Week 5 HW: Protein Design Part II

Part A SOD1 Binder Peptide Design

Part 1: Generate Binders with PepMLM

Human SOD1 Sequence (UniProt)

MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

A4V Mutant Sequence

Position 4: Ala β†’ Val (A4V)

MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ
PositionWild-typeMutant
4A (Ala)V (Val)

PepMLM-Generated Peptides

4 candidate binders generated against the A4V mutant sequence + the known reference peptide.
Lower perplexity scores indicate sequences more confidently predicted by the model.

#SequencePerplexityNote
PepMLM 1WHYPAAAAAWKK8.611β€”
PepMLM 2WRSPAVAAAHKE7.866Lowest perplexity
PepMLM 3WRYPAVALEWKK16.562Highest perplexity
PepMLM 4WHSYVVGARWWK13.338β€”
KnownFLYRWLPSRRGGβ€”Reference binder

Note on Perplexity: In PepMLM, perplexity reflects how confidently the masked language model predicts each residue in context. Lower perplexity suggests the sequence is more consistent with the model’s learned distribution of binders; however, higher perplexity sequences may still yield productive binding if their physicochemical and structural properties are favourable.

Part 2: Evaluate Binders with AlphaFold 3


For the sake of my OCD
or else with only 5 pics will look ugly

Known Peptide
ipTM = 0.36

Peptide 1
ipTM = 0.27

Peptide 2
ipTM = 0.40

Peptide 3
ipTM = 0.19

Peptide 4
ipTM = 0.39

ipTM (interface predicted TM-score) measures predicted interface accuracy.
Values range from 0 to 1 β€” higher is better. Scores β‰₯ 0.5 generally indicate confident predictions.


Binding Analysis

StructureipTMNear A4V / N-term?Ξ²-barrel engagementSurface character
Known (Reference)0.36YesLateral strand edgeSurface-bound, extended
PepMLM Peptide 10.27NoMinimalSurface, poorly engaged
PepMLM Peptide 20.40Partial β€” dimer faceLateral interface cleftSurface docked
PepMLM Peptide 30.19NoNonePeripheral, non-specific
PepMLM Peptide 40.39Distal (C-term base)Bottom loop regionSurface-bound

Notes

  • PepMLM Peptide 2 is the strongest candidate: highest ipTM, adopts Ξ±-helical secondary structure upon binding, and docks into the concave groove at the lateral Ξ²-barrel interface β€” the region destabilised by the A4V mutation. One face of the helix contacts SOD1 while the other remains solvent-exposed. This binding mode is consistent with therapeutic peptides that stabilise misfolding-prone interfaces.
  • PepMLM Peptide 4 has a comparable ipTM (0.39) but localises to the base of the barrel near C-terminal loops, distal from the A4V site, limiting its therapeutic relevance.
  • PepMLM Peptides 1 and 3 show poor interface engagement and are unlikely to be productive binders.

ipTM (interface predicted TM-score) measures predicted interface accuracy.
Values range from 0–1; scores β‰₯ 0.5 generally indicate confident predictions. All values here are modest, consistent with flexible peptide–protein interfaces typical in AlphaFold-Multimer assessments.


Part 3: Evaluate Properties of Generated Peptides in the PeptiVerse

swipe left for more

PepMLM 1
WHYPAAAAAWKK
PepMLM 2
WRSPAVAAAHKE
PepMLM 3
WRYPAVALEWKK
PepMLM 4
WHSYVVGARWWK
Known (Reference)
FLYRWLPSRRGG
PropertyPredictionValue
πŸ’§ SolubilitySoluble1.000
🩸 HemolysisNon-hemolytic0.013
πŸ”— Binding AffinityWeak binding4.902 pKd/pKi
πŸ“ Lengthβ€”12 aa
βš–οΈ Mol. Weightβ€”1399.6 Da
⚑ Net Chargeβ€”+1.84
🎯 pIβ€”9.70 pH
πŸ’¦ GRAVYβ€”βˆ’0.56
PropertyPredictionValue
πŸ’§ SolubilitySoluble1.000
🩸 HemolysisNon-hemolytic0.016
πŸ”— Binding AffinityWeak binding4.661 pKd/pKi
πŸ“ Lengthβ€”12 aa
βš–οΈ Mol. Weightβ€”1322.5 Da
⚑ Net Chargeβ€”+0.85
🎯 pIβ€”8.76 pH
πŸ’¦ GRAVYβ€”βˆ’0.58
PropertyPredictionValue
πŸ’§ SolubilitySoluble1.000
🩸 HemolysisNon-hemolytic0.027
πŸ”— Binding AffinityWeak binding5.784 pKd/pKi
πŸ“ Lengthβ€”12 aa
βš–οΈ Mol. Weightβ€”1546.8 Da
⚑ Net Chargeβ€”+1.76
🎯 pIβ€”9.70 pH
πŸ’¦ GRAVYβ€”βˆ’0.74
PropertyPredictionValue
πŸ’§ SolubilitySoluble1.000
🩸 HemolysisNon-hemolytic0.039
πŸ”— Binding AffinityWeak binding6.308 pKd/pKi
πŸ“ Lengthβ€”12 aa
βš–οΈ Mol. Weightβ€”1574.8 Da
⚑ Net Chargeβ€”+1.85
🎯 pIβ€”9.99 pH
πŸ’¦ GRAVYβ€”βˆ’0.55
PropertyPredictionValue
πŸ’§ SolubilitySoluble1.000
🩸 HemolysisNon-hemolytic0.047
πŸ”— Binding AffinityWeak binding5.968 pKd/pKi
πŸ“ Lengthβ€”12 aa
βš–οΈ Mol. Weightβ€”1507.7 Da
⚑ Net Chargeβ€”+2.76
🎯 pIβ€”11.71 pH
πŸ’¦ GRAVYβ€”βˆ’0.71

All peptides are predicted soluble and non-hemolytic. Binding affinity (pKd/pKi): higher = stronger predicted affinity. Negative GRAVY scores reflect hydrophilic character across all sequences. Across the five peptides, there is no clear correlation between ipTM and predicted binding affinity.

The peptide I selected is PepMLM Peptide 2 (WRYPAVALEWKK). While its predicted affinity is modest, it has the highest ipTM, adopts stable Ξ±-helical secondary structure upon docking β€” a hallmark of productive peptide–protein interfaces β€” and engages the lateral cleft of the Ξ²-barrel at precisely the region destabilised by A4V. It is the only candidate where the structural, physicochemical, and site-specificity evidence converge.


Part C: Final Project: L-Protein Mutants

The MS2 bacteriophage lysis protein (L-protein) is a 74 amino acid protein responsible for killing E. coli host cells by perforating the bacterial membrane. A critical vulnerability of this system is that a single point mutation in the host chaperone protein DnaJ can prevent the lysis protein from functioning, allowing E. coli to acquire resistance to MS2.

The L-protein has two structurally and functionally distinct regions:

  • Soluble N-terminal domain (positions 1–38): responsible for interaction with DnaJ
  • Transmembrane domain (positions 39–73): responsible for membrane insertion and lysis

At least 2 in the transmembrane region and at least 2 in the soluble region.

Option 1: Mutagenesis

Running the ESM-2 protein language model (facebook/esm2_t6_8M_UR50D) on the full wild-type L-protein sequence:

METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

The model scores every possible single amino acid substitution at every position using a Log Likelihood Ratio (LLR):

  • Positive score β†’ the substitution looks evolutionarily natural and compatible
  • Negative score β†’ the substitution disrupts what the model expects at that position
  • Position 1 (M) showed almost entirely dark purple scores, confirming the start methionine is essential and should not be mutated
  • Rows M, W, Y were dark across most positions β€” large/bulky amino acids are generally disruptive substitutions
  • The transmembrane region (~positions 39–73) showed brighter yellow/green scores for hydrophobic substitutions (L, I, V, F) β€” consistent with the hydrophobic nature of membrane-spanning helices
  • Bright yellow hotspots at positions 29, 39, and 50 stood out as positions where specific mutations are strongly predicted

The notebook was first run with a focused query on the transmembrane region (positions 38–60), producing the following top-scored mutations:

  Amino Acid  Position     Score
0           L        50  2.561468
1           L        39  2.241780
2           I        50  1.928801
3           L        53  1.864932
4           L        52  1.813968
5           F        50  1.802069
6           V        50  1.594576
7           S        50  1.574557
8           L        45  1.539248
9           S        39  1.517457
10          L        40  1.477630
11          A        39  1.364999
12          A        50  1.357795
13          I        39  1.320103
14          T        39  1.302804
15          F        39  1.245851
16          V        39  1.244390
17          T        50  1.222131
18          L        54  1.120860
19          R        39  1.064191

Three positions dominate the top scores: 50, 39, and 45. The model strongly favors leucine (L) substitutions at positions 50 and 39, and also at position 45. This is the first signal pointing toward K50L, Y39L, and A45(L or P) as strong TM candidates. Notably, multiple substitutions at position 50 rank highly (L, I, F, V, S, A),suggesting this position is generally flexible β€” but leucine scores the highest of all.


The notebook was then run on the full protein sequence to get a global ranking across all 74 positions:

     Position Wild_Type_AA Mutation_AA  LLR_Score
989         50            K           L   2.561468
574         29            C           R   2.395427
769         39            Y           L   2.241780
575         29            C           S   2.043150
173          9            S           Q   2.014325
573         29            C           Q   1.997049
572         29            C           P   1.971029
569         29            C           L   1.960646
987         50            K           I   1.928801
1049        53            N           L   1.864932

The top 10 globally are dominated by three positions: 50 (K→L), 29 (C→R/S/Q/P/L), and 39 (Y→L). This globally confirms what the TM scan already suggested, and additionally highlights C29 in the soluble region as a computationally interesting mutation site.

The full ranking also produced a second merged output combining both score datasets:

     Position Wild_Type_AA Mutation_AA  LLR_Score
1332        50            K           L   2.561468
770         29            C           R   2.395427
1035        39            Y           L   2.241780
229          9            S           Q   2.014325
776         29            C           Q   1.997049
...

The computational shortlist from the ESM model was:

  • K50L (score: +2.56) β€” highest in entire protein
  • C29R (score: +2.40) β€” highest in soluble region
  • Y39L (score: +2.24) β€” strong TM candidate
  • A45L (score: +1.54) β€” noted in TM scan

The L-Protein Mutants CSV was uploaded into the notebook, which displayed the first rows of the experimental dataset:

| Position | Base Pair Changed | AA Position | AA Change  | Lysis | Protein Level |
|----------|------------------|-------------|------------|-------|---------------|
| 3        | G->T              | 1           | M->I       | 0     | 0             |
| 3        | G->A              | 1           | M->I       | 0     | 0             |
| 2        | T->C              | 1           | M->T       | 0     | 0             |
| 4        | G->T              | 2           | E->Stop    | 0     | N.D.          |
| 8        | C->T              | 3           | T->I       | 0     | 0             |

This dataset contains experimentally measured lysis outcomes (0 = no lysis, 1 = lysis) for mutations that have already been tested in the lab. Cross-referencing this with the ESM scores revealed which computational predictions align with real biology.


Merging both datasets exposed a critical finding: the ESM model only partially agrees with experimental lysis outcomes.

MutationESM ScoreLysis (Lab)Agreement?
P13L+0.10Yesβœ…
S15A+0.04Yesβœ…
K23E+0.18Yesβœ…
E25G+0.45Yesβœ…
A45P+0.04Yesβœ…
I46F-0.10Yes❌
R18G-0.85Yes❌
R31I-0.93Yes❌
L44P-1.59Yes❌
R20W-2.18Yes❌

The disagreements (especially R18G, I46F, L44P) suggest that the ESM model scores general protein structural fitness (the ability to fold into a stable, functional, three-dimensional shape (conformation) that is energeticaly favorable), not functional lysis activity (the process of breaking open cell membranes).

Mutations that disrupt DnaJ binding (like R18G) are penalised by the model because the arginine is evolutionarily conserved β€” but conserved because it binds DnaJ.

This insight shaped the final selection strategy:

Use ESM scores to identify novel untested candidates with high computational confidence, and use experimental data to validate or override those scores based on known biology.


With all evidence assembled, five mutations were selected spanning both protein regions:

Soluble Region Mutations (Positions 1–38)

P13L β€” Position 13, Proline β†’ Leucine

  • ESM Score: +0.10 | Lysis: Confirmed | Protein Level: Confirmed
  • Proline at position 13 creates a rigid backbone kink within the DnaJ-binding domain. Replacing it with leucine (flexible, hydrophobic) removes this constraint, potentially allowing the soluble domain to fold independently of DnaJ. Supported by both model and lab.

S15A β€” Position 15, Serine β†’ Alanine

  • ESM Score: +0.04 | Lysis: Confirmed | Protein Level: Confirmed
  • Serine at position 15 sits within the NRRRP arginine-rich DnaJ-binding motif. Its hydroxyl side chain is a candidate hydrogen-bonding contact point with DnaJ. Replacing it with alanine (no side chain beyond a methyl group) directly removes a potential DnaJ interaction site. Both ESM and lab confirm this is tolerated. Selected alongside P13L because the two mechanisms are complementary β€” P13L addresses backbone rigidity, S15A addresses the interaction surface.

Transmembrane Region Mutations (Positions 39–73)

Y39L β€” Position 39, Tyrosine β†’ Leucine

  • ESM Score: +2.24 | Lysis: Not yet tested
  • Position 39 is the first residue of the transmembrane domain β€” the boundary point where the protein transitions from soluble to membrane-spanning. Tyrosine is large and polar (hydroxyl group), which is chemically unusual at the start of a hydrophobic TM helix. Leucine is hydrophobic and small, making for a cleaner, sharper TM helix start. The ESM model strongly favors this change, and it ranked 3rd globally across the entire protein. The only tested mutation at this position (Y39H) failed β€” but histidine is charged and polar, making it incomparable to leucine. Selected as the highest-confidence novel TM candidate.

A45P β€” Position 45, Alanine β†’ Proline

  • ESM Score: +0.04 | Lysis: Confirmed | Protein Level: Confirmed
  • Introducing proline into a transmembrane helix creates a structural kink β€” a feature found in many natural pore-forming proteins and ion channels. This kink at position 45 (sitting centrally in the TM helix) may promote the conformational change needed to open the transmembrane pore. Supported by both the ESM model and direct experimental confirmation.

K50L β€” Position 50, Lysine β†’ Leucine

  • ESM Score: +2.56 (highest in entire protein) | Lysis: Not yet tested
  • Lysine (K) is a charged, hydrophilic amino acid β€” unusual to find it buried deep in a hydrophobic transmembrane helix. The ESM model assigns the highest score in the entire protein to replacing it with leucine (hydrophobic), which is thermodynamically much more compatible with a membrane environment. This substitution could improve membrane insertion efficiency, increase protein expression, or stabilize the TM assembly. It is acknowledged that four other K50 variants (K50E, K50N, K50I, K50Q) have failed in the lab, suggesting this position may be sensitive. However, K50L is specifically a hydrophobic substitution β€” chemically distinct from the charged/polar variants that failed β€” and its extremely high ESM score justifies testing it as a novel candidate.

Final Mutations

#MutationRegionESM ScoreLysisProteinRationale
1P13LSoluble+0.10βœ…βœ…Removes proline kink; enables DnaJ-independent folding
2S15ASoluble+0.04βœ…βœ…Removes DnaJ contact site within NRRRP motif
3Y39LTM+2.24❓—Sharpens TM helix entry; 3rd highest ESM score globally
4A45PTM+0.04βœ…βœ…Proline kink promotes pore-forming conformation
5K50LTM+2.56❓—Highest ESM score in protein; removes charged residue from TM core

AI Prompt used in this section for mutation selection: Given the provided mutations, could you explain the rationale behind each and why would each serve as potentially candidates?