Week 5 HW: Protein Design Part II

Human SOD1 sequence MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

After adding A4V mutation MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

Therefore, produced peptides:

indexBinderPseudo Perplexity
1WLYVVAAVRWKX23.320599604199636
2WRYVAAAAAHKE8.96053025308908
3WLYVPAGLALWX13.021677157633269
4WLYYVVAVAHKX15.430388570774006
5FLYRWLPSRRGG11.545571242285833

##Part 2: Evaluating Binders with alpha fold3

The alpha fold results for some reason are not loading for me, despite multiple attempst and troubleshooting. Hence the results were analyzed with the help of Claude using PAE matrices

peptide 1 ipTM 0.38 cover image cover image The PAE matrix shows a uniformly mid-green inter-chain strip with no distinct dark patch, indicating no preferred binding site and the peptide appears to be floating without specific engagement.

peptide 2 ipTM 0.35 cover image cover image The inter-chain strip is mostly light green with a very faint darker region around residues 60–100, suggesting a weak, non-specific affinity toward the β-barrel region, though confidence is low.

peptide 3 ipTM 0.36 cover image cover image The inter-chain strip is the lightest and most uniform of all five, indicating the highest positional uncertainty. It appears to have the least defined interaction with SOD1.

peptide 4 ipTM 0.37 cover image cover image A slightly darker patch in the inter-chain strip around residues 1–30 hints at proximity to the N-terminal region where the A4V mutation sits, making this the most therapeutically interesting placement among the PepMLM peptides.

peptide 5 ipTM 0.41 cover image cover image Shows the darkest and most defined inter-chain strip overall, with a signal around residues 60–110 suggesting some affinity toward the β-barrel mid-region consistent with it being a known SOD1 binder and having the highest ipTM.

Part 3: Evavluating properties of generated peptides in Peptiverse

Peptide 1 WLYVVAAVRWKA cover image cover image

Peptide 2 WRYVAAAAAHKE cover image cover image

Peptide 3 WLYVPAGLALWA cover image cover image

Peptide 4 WLYYVVAVAHKA cover image cover image

Peptide 5 FLYRWLPSRRGG cover image cover image

All four peptides demonstrated favorable therapeutic profiles when evaluated through PeptiVerse, and outperformed FLYRWLPSRRGG in predicted binding affinity. Every peptide showed perfect solubility (1.000 probability) and was predicted to be non-hemolytic, confirming a safe baseline. In terms of binding affinity, Peptide 3 (WLYVPAGLALWA) emerged as the strongest binder with a medium binding score of 7.599 pKd/pKi, followed by Peptide 1 (WLYVVAAVRWKA) at 7.214. FLYRWLPSRRGG achieved a weak binding score of 5.968. This is a significant finding as it suggests PepMLM successfully generated peptides with stronger predicted affinity than an experimentally validated binder. Based on this analysis, Peptide 3 (WLYVPAGLALWA) remains the top candidate to advance. It has the highest predicted binding affinity, full solubility, low hemolytic risk, and a drug-like molecular weight of 1359.6 Da, making it the strongest overall therapeutic candidate from this screen pending AlphaFold3 structural confirmation.

Interpretation of PeptiVerse results

The generated peptides showed trade-offs between predicted binding affinity, therapeutic safety, and developability.

Peptide 7 (GKRYYYYKDKCF) showed the strongest predicted binding affinity (pKd = 9.123), making it the most promising binder from an interaction standpoint. However, it had a relatively low motif score (0.340), suggesting weaker alignment with the desired design motif.

Peptide 8 (VGTCYCIKKKKM) had the highest hemolysis probability (0.978), which makes it less attractive as a therapeutic candidate despite a reasonably strong predicted affinity (pKd = 7.123) and a strong motif score (0.730).

Peptide 9 (TKQCKFTRPQNE) had the strongest motif score (0.876), indicating good alignment with the desired interaction pattern, but its predicted binding affinity (pKd = 5.533) was lower than the best-performing candidates.

Overall, Peptide 7 appears strongest in terms of predicted affinity, while Peptide 9 may represent a more motif-consistent but weaker-binding alternative. Since all candidates showed high hemolysis probabilities, additional optimization would likely be required before therapeutic development.

Part 4: Optimized peptide generation with moPPIt

IndexPeptideHemolysisSolubilityAffinity (pKd)Motif Score
6GKCGKNEVHKHR0.9550.9175.6920.396
7GKRYYYYKDKCF0.9450.9179.1230.340
8VGTCYCIKKKKM0.9780.7507.1230.730
9TKQCKFTRPQNE0.9550.8335.5330.876

Overall, moPPIt gives more rational, multi-objective candidates anchored to a therapeutic hypothesis (binding the A4V site), while PepMLM provides broader sequence diversity without site or safety guidance.

Among the moPPIt candidates, GKRYYYYKDKCF (Peptide 7) is the strongest candidate to advance. It has by far the highest predicted binding affinity (9.12 pKd), a hemolysis score of 0.945 (non-hemolytic), and a solubility score of 0.917. Its motif score of 0.340 is the lowest among the four, suggesting it may not perfectly engage the exact residues targeted near position 4, but given that its affinity is dramatically higher than all other candidates from both tools, it warrants further structural and experimental investigation to determine where exactly it binds SOD1.

Part B skipped since optional

Part C: Final project L-Protein Mutants

Option 1: Mutagenesis

Attaching MSA output

cover image cover image cover image cover image

looking at the TM region in Image 2, almost every sequence ends with EAVIRTVTTLQQLLT. This stretch is extremely conserved, which means residues ~62–75 (VIRTVTTLQQLLT) are very risky to mutate.

And the L Protein mutation heatmap, cover image cover image

The heatmap x-axis follows the full L-protein sequence. Mapping positions to amino acids: M(1) E(2) T(3) R(4) F(5) P(6) Q(7) Q(8) S(9) Q(10)Q(11)T(12)P(13)A(14)S(15)T(16) N(17)R(18)R(19)R(20)P(21)F(22)K(23)H(24)E(25)D(26)Y(27)P(28)C(29)R(30)R(31)Q(32) Q(33)R(34)S(35)S(36)T(37)L(38)Y(39)V(40) | L(41)I(42)F(43)L(44)A(45)I(46)F(47)L(48) S(49)K(50)F(51)T(52)N(53)Q(54)L(55)L(56)L(57)S(58)L(59)L(60)E(61)A(62)V(63)I(64) R(65)T(66)V(67)T(68)T(69)L(70)Q(71)Q(72)L(73)L(74)T(75)

ESM Score vs. Experimental Data Correlation

To evaluate whether the ESM-based mutational scores capture real functional information, I cross-referenced the heatmap against the experimental L-protein mutant dataset from the spreadsheet. Positions such as those in the conserved EAVIRTVTTLQQLLT stretch of the TM domain (residues 62–75) consistently appear as dark columns in the ESM heatmap, indicating strong negative predicted fitness for any substitution. This aligns well with the MSA data, where these positions show near-zero variation across related phage sequences. Conversely, some positions in the soluble N-terminal domain (residues 1–40) show yellow-to-neutral scores at certain substitutions, suggesting the model predicts these changes are tolerable and consistent with the experimental observation that many soluble-domain mutations retain partial lysis activity.

The following 5 mutations were selected based on positive ESM LLR scores, MSA conservation analysis, and structural reasoning. Two mutations fall in the TM domain (residues 41–75) and three in the soluble N-terminal domain (residues 1–40). The mutations I chose to continue with in the Soluble Domain and Transmembrane domain are:

IndexPositionWildtype_AAMutation_AALLR Score
129CR2.3954
209SQ2.014
350KL2.5615
453NL1.8649
522FR1.6020

Alphafold multimer runs

8 chains of L-protein (including proposed mutations) separated by colons, total length 600 residues.

METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

ModelpLDDTpTMipTM
Rank 127.80.1590.125
Rank 225.70.206~0.13
Rank 325.70.206~0.13
Rank 428.00.165
Rank 528.00.165

cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image

Structural Interpretation

All five ranked models show uniformly very low pLDDT scores (20–28, well below the <50 threshold). The PAE matrices are nearly uniformly red (~25–30 Å error) across all off-diagonal inter-chain blocks, with confidence only on the per-chain diagonal. This means the model cannot confidently place any chain relative to any other.

Despite the low confidence scores, the predicted structures display a biologically interesting pattern: in the Mol* viewer, helical secondary structure is visible at the center of the assembly, with disordered tails radiating outward in a sunburst arrangement. This is consistent with the pore-formation hypothesis for the L-protein. the TM helices converge at the central axis (as expected for a membrane pore), while the soluble N-terminal domains remain disordered and point outward into the cytoplasm. The per-position IDDT plot shows periodic peaks that correspond to the TM helix region of each chain, which is the only portion with marginally higher local confidence (~40–50).

Run 2: L-Protein + DnaJ CoFold

L-protein (mutant sequence, 75 residues, Chain A) + DnaJ (357 residues, Chain B), submitted as a two-chain heterodimer to ColabFold AlphaFold2 Multimer v3

METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:MAKQDYYEILGVSKTAEEREIRKAYKRLAMKYHPDRNQGDKEAEAKFKEIKEAYEVLTDSQKRAAYDQYGHAAFEQGGMGGGGFGGGADFSDIFGDVFGDIFGGGRGRQRAARGADLRYNMELTLEEAVRGVTKEIRIPTLEECDVCHGSGAKPGTQPQTCPTCHGSGQVQMRQGFFAVQQTCPHCQGRGTLIKDPCNKCHGHGRVERSKTLSVKIPAGVDTGDRIRLAGEGEAGEHGAPAGDLYVQVQVKQHPIFEREGNNLYCEVPINFAMAALGGEIEVPTLDGRVKLKVPGETQTGKLFRMRGKGVKSVRGGAQGDLLCRVVVETPVGLNERQKQLLQELQESFGGPTGEHNSPRSKSFFDGVKKFFDDLTR

ModelpLDDTpTMNotes
Rank 170.10.527Best ranked by multimer metric
Rank 278.20.526Highest per-residue confidence
Rank 376.10.526Consistent with ranks 1–2
Rank 476.0~0.52Slightly variable L-protein placement
Rank 5~75~0.52Similar topology to rank 3–4

cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image

Structural interpretation:

In contrast to the octamer run, the L-protein + DnaJ co-fold produces substantially higher confidence scores across all five models (pLDDT 70–78, pTM ~0.527), indicating that AlphaFold2 can form a meaningful structural prediction for this complex. This difference is expected: DnaJ is a well-characterised soluble protein with rich MSA coverage (~2000 sequences, Image 1), which anchors the prediction and allows confident inter-chain contact modeling.

The per-position IDDT plot reveals the key asymmetry of the complex: Chain A (L-protein, positions 0–75) consistently scores in the 20–50 range across all models while Chain B (DnaJ, positions 75–450) scores 80–95 throughout, well into the “confident” to “very high” range. This is biologically meaningful: the L-protein is a largely disordered, membrane-dependent protein that AF2 cannot confidently fold in isolation, while DnaJ is a structured chaperone that the model predicts with high accuracy. The L-protein’s low per-residue confidence does not invalidate the interaction prediction — it reflects the intrinsic disorder of the L-protein rather than a failure of the complex model.

All five ranked models show distinct blue (low error, ~0–10 Å) regions in the inter-chain quadrants, specifically, the L-protein (chain A, rows 0–75) shows confident predicted placement relative to the N-terminal J-domain region of DnaJ (approximately positions 100–250 in chain B). This is a strong signal: the model is confidently predicting that the L-protein contacts DnaJ, and that the interaction interface is localised rather than diffuse. Crucially, the contact region maps to L-protein residues in the soluble N-terminal domain (residues 1–40), not the TM domain — consistent with the published biological evidence that DnaJ interacts with the soluble domain of the L-protein (Chamakura et al., 2017).

The Mol* structure shows DnaJ folded as a large, confident beta-sheet and helix domain (blue/dark blue throughout), with the L-protein appearing as a short helix (red, low pLDDT) docked against DnaJ’s surface at the J-domain. The helical secondary structure of the L-protein’s TM region is partially preserved even in this soluble context, appearing as a compact helical element adjacent to the DnaJ interaction surface.

Relevance to proposed mutations:

Three of the five proposed mutations (S9Q, F22R, and C29R) fall directly within the soluble domain (residues 1–40) that the PAE matrix identifies as the predicted DnaJ contact region. This strongly supports their therapeutic rationale:

  • C29R introduces a positively charged arginine at a cysteine position within the predicted interface. This could either strengthen hydrophilic contacts with DnaJ or, more importantly, sterically and electrostatically disrupt the native interaction — potentially enabling DnaJ-independent folding by forcing the L-protein to adopt a stable conformation without chaperone assistance.

  • F22R replaces an aromatic residue with arginine at another interface-proximal position, similarly altering the electrostatic character of the binding surface.

  • S9Q lies at the N-terminal edge of the predicted contact zone; the glutamine substitution introduces new hydrogen-bonding capacity that could stabilize the soluble domain’s fold autonomously.

The two TM mutations (K50L and N53L) fall in Chain A positions beyond the confident inter-chain contact region, consistent with TM residues not participating in DnaJ binding — instead targeting membrane insertion efficiency independently of the DnaJ interaction.

The L-protein’s low per-residue pLDDT throughout means the exact contact geometry should be treated as a hypothesis rather than a reliable atomic model. AlphaFold2 lacks membrane context, so the TM domain is modeled as if soluble. Validation via co-immunoprecipitation or crosslinking mass spectrometry of the wildtype and mutant complexes would be required to confirm the predicted interface. A more reliable structural prediction for just the soluble domain co-folded with DnaJ’s J-domain (rather than full-length L-protein) could also be attempted, as this would focus modeling resources on the well-defined interaction region.