Week 5 HW: Protein design part II
Part A: SOD1 Binder Peptide Design (From Pranam)
Part 1: Generate Binders with PepMLM
the human SOD1 sequence (P00441): MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ
the A4V mutant SOD1 sequence: MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ
four peptides of length 12 amino acids conditioned on the mutant SOD1 sequence with the known SOD1-binding peptide FLYRWLPSRRGG for comparison:
| index | Binder | Pseudo Perplexity |
|---|---|---|
| 1 | WRYGPAAAAHWK | 9.318684 |
| 2 | WHYPAVVLRWKX | 16.435002 |
| 3 | WLYYPAAVRLWK | 16.527933 |
| 4 | WLYYVAVVALGE | 22.958134 |
| 5 | FLYRWLPSRRGG | 20.63523127283615 |

Conclusion
The model assigned the lowest perplexity (9.32) to peptide WRYGPAAAAHWK, indicating the highest sequence plausibility according to the language model.
The experimentally validated SOD1-binding peptide FLYRWLPSRRGG showed one of the higher perplexity (20.63), suggesting that the language model does not necessarily rank experimentally verified binders as the most probable sequences.
One generated peptide (WHYPAVVLRWKX) contains the residue “X”, which denotes an unknown or unspecified amino acid. This likely reflects a tokenization or sampling artifact of the language model. Because “X” does not correspond to a defined amino acid, this peptide should be interpreted cautiously when evaluating potential binding candidates which suggests that such a peptide may be invalid.