Week 5 HW: Protein Design Part II

Part 1: Generate Binders with PepMLM

Changed 5th character from A to V.

>sp|P00441|SOD1_HUMAN_A4V Superoxide dismutase [Cu-Zn] OS=Homo sapiens OX=9606 GN=SOD1 PE=1 SV=2
MATK**V**VCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTS
AGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVV
HEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

Sequence 5 TKGNAGSRLACG WRGDDSVDFEGR 18.929941

Part 2: Evaluate Binders with AlphaFold3

Record the ipTM score and briefly describe where the peptide appears to bind. Does it localize near the N-terminus where A4V sits? Does it engage the β-barrel region or approach the dimer interface? Does it appear surface-bound or partially buried?

Protein-peptide sequence 1: SSTLRLFAQLRR, ipTM = 0.55, pTM = 0.67 The low ipTM of 0.3 suggests the peptide is surface-bound, and is roughly associated with the outer beta barrel part. It is not buried and is situated on the exterior surface of the Alphafold model, and does not situate near the N-terminus where the A4V mutation sits.

Protein-peptide sequence 2: GTGTASGALLLK: ipTM = 0.52, pTM = 0.68 The low ipTM of 0.3 suggests the peptide is surface-bound, and is roughly associated with the outer beta barrel part. It is not buried and is situated on the exterior surface of the Alphafold model, and does not situate near the N-terminus where the A4V mutation sits. Higher iPTM because of hydrophobic amino acid residues being aligned between the protein-peptide complex.

Protein-peptide sequence 3: ATVGVVQVIICN: ipTM = 0.64, pTM = 0.71 The higher ipTM of 0.64, shows extremely high confidence in the A4V SOD1 structural fold. Peptide looks to be wrapping around the beta-barrel and surface bound. Localisation wise, it isn’t directly touching the N-terminal A4V site. It contacts with the middle of the protein sequence.

Protein-peptide sequence 4: WRGDDSVDFEGR, ipTM = 0.56, pTM = 0.65 The ipTM is over 0.5 suggesting good structural stability, it appears surface bound, does not localise near the N-terminus and is not buried deep within the hydrophobic core, and it targets the edge of the beta-barrel.

4. In a short paragraph, describe the ipTM values you observe and whether any PepMLM-generated peptide matches or exceeds the known binder.

The obtained ipTM values across the four protein-peptide complexes range from 0.52 to 0.64, showing that all PepMLM-generated peptides successfully meet or exceed the 0.5 threshold required for a confident protein-peptide interaction. Sequence 3 (ATVGVVQVIICN) is the standout performer, achieving an ipTM of 0.64, which signifies a highly stable and specific docking configuration compared to the more moderate scores of Sequence 1 (0.55) and Sequence 4 (0.56). This high-scoring peptide matches and effectively exceeds the normal performance of generic binders in this system, likely due to its hydrophobic VVQVII motif which complements the destabilised environment of the A4V mutant SOD1. While Sequence 2 recorded the lowest confidence at 0.52, it still represents a true predicted binder, though it likely lacks the deep structural burial or specific N-terminal targeting seen in the 0.64 model.

Part 3: Evaluate Properties of Generated Peptides in the PeptiVerse

Structural confidence alone is insufficient for therapeutic development. Using PeptiVerse, let’s evaluate the therapeutic properties of your peptide! For each PepMLM-generated peptide: Paste the peptide sequence. Paste the A4V mutant SOD1 sequence in the target field.

Peptide sequence: SSTLRLFAQLRR
GTGTASGALLLK
ATVGVVQVIICN
WRGDDSVDFEGR

3.Compare these predictions to what you observed structurally with AlphaFold3. In a short paragraph, describe what you see. Do peptides with higher ipTM also show stronger predicted affinity? Are any strong binders predicted to be hemolytic or poorly soluble? Which peptide best balances predicted binding and therapeutic properties? Choose one peptide you would advance and justify your decision briefly.

The peptides generally show a positive correlation between structural confidence and predicted affinity, though the relationship is not strongly linear. Peptide 3 (ATVGVVQVIICN), with the highest structural confidence (ipTM = 0.64), had the highest predicted binding affinity (6.204 pKd/pKi). Comparatively, Peptide 2 (GTGTASGALLLK), had the lowest structural confidence (ipTM = 0.52) and the weakest predicted affinity (5.555 pKd/pKi). All four peptides are predicted to be soluble (1.000 probability) and non-hemolytic, with hemolysis probabilities being well below 10%. While Peptide 3 offers the highest binding affinity, its increased hydrophobicity (1.83 GRAVY) and near-neutral charge (-0.21) may make it less stable in aqueous environments compared to Peptide 1 (SSTLRLFAQLRR), which has a high structural fit (ipTM = 0.55) with great solubility and a favourable cationic charge (+2.46), increasing its stability in aqueous environments.

I would advance Peptide 3 (ATVGVVQVIICN) for further development; as it has a high binding affinity and structural confidence.

Part 4: Generate Optimized Peptides with moPPIt I tried to run this section but for the final step, generating, samples.csv it throws an error, below:

/content/moPPIt/flow_matching/path/path_sample.py:45: SyntaxWarning: invalid escape sequence '\s'
  x_t (Tensor): the sample along the path  :math:`X_t \sim p_t`.
/content/moPPIt/flow_matching/path/scheduler/scheduler.py:72: SyntaxWarning: invalid escape sequence '\s'
  SchedulerOutput: :math:`\alpha_t,\sigma_t,\frac{\partial}{\partial t}\alpha_t,\frac{\partial}{\partial t}\sigma_t`
/content/moPPIt/flow_matching/path/scheduler/scheduler.py:79: SyntaxWarning: invalid escape sequence '\k'
  Computes :math:`t` from :math:`\kappa_t`.
Traceback (most recent call last):
  File "/content/moPPIt/moppit.py", line 25, in <module>
    target_sequence = tokenizer(target, return_tensors='pt').to(device)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 800, in to
    self.data = {k: v.to(device=device) for k, v in self.data.items()}
                    ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 417, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_164/1413049073.py in <cell line: 0>()
     38 ret = proc.wait()
     39 if ret != 0:
---> 40    raise RuntimeError(f"moo.py failed with code {ret}")

RuntimeError: moo.py failed with code 1

Part C: Final Project: L-Protein Mutants

Q3, and 4) The language model tries to calculate whether a mutation is common in nature whereas the experimental data measures lysis. With the particular protein sequence, the language model embeddings would not apprehend the information sufficiently to be used as a predictive tool solely without fine-tuning or specialised training on viral functional assays. Evolutionary Index (EI) is usually measured by ESM-2, but it cannot understand/reflect gain of function or functional nuances which will cause bacterial cell lysis. Protein levels can capture protein structural integrity and conformational stability, and the score was 0.088, indicating that the language model’s embeddings do not capture the particular protein structural limitations, conformational restrictions or physical boundaries which dictate requirements: conformationally or structurally of the L-Protein. After doing statistical analyses, I received a correlation coefficient of 0.0068, which is an extremely weak/non-existent linear relationship between the ESM-2 score and the lysis outcome. Furthermore, the mean score for mutations which cause cell breakdown (-0.371) was very close to those that did not (-0.389). Also, a t-test produced a p-value of 0.958 (non statistically significant between SM-2 scores and the experimental L-Protein lysis data.

MSL protein structural breakdown:

Soluble region (N-terminal) typically spans residues 1–40, while the transmembrane (TM) region spans roughly residues 41–70.

Sites like the Leucine (L) at position 10 or the Cysteines (C) are often highly conserved across homologs.

Q5)

Q6) [Multimer Assembly Assignment]