Week 5 — Protein Design Part II

Part A: SOD1 Binder Peptide Design

Goal: Design short peptides that bind mutant SOD1 and then decide which ones are worth advancing toward therapy

Part 1: Generate Binders with PepMLM

The SOD1 Protein0 has base sequence:

MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

To obtain the functional and mutated protein we need to do the removal of the start codon Methionone and mutate the fourth A to a V.

ATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

Using PepMLM in my edited copy of the notebook I got the following pepties of length 12:

[         Binder  Pseudo Perplexity
 0  WHYPAVXIEHKX          13.025725,
          Binder  Pseudo Perplexity
 0  WSYPATXIRHGK          17.076882,
          Binder  Pseudo Perplexity
 0  WSSYAVAAELKX           9.094557,
          Binder  Pseudo Perplexity
 0  WRYPVVXARWGX          10.105283]

I also computed the perplexity of the known SOD1 binding peptide ‘FLYRWLPSRRGG’

compute_pseudo_perplexity(model, tokenizer, protein_seq, 'FLYRWLPSRRGG')

np.float64(22.52521074663301)

It appears that the model like its peptides betters than the known binder.

Actually when I wne tot the next step I finally noticed every single one of the generated peptides had an X/unknown which alphafold barfs on. I fixed this by adjusting the top_k value to be top 9 to allow more choices besides X and got:

         Binder  Pseudo Perplexity
 0  WDNVGYAIYSGK          17.318647
 1  HDWVGQGIDQGE          26.359836
 2  ESYYDQAVDQLE          35.716432
 3  VSYPGQVVGHLP          34.004369]

These are more suprising than the other ones (which isn’t too surprising given the wider latidute of to pick less likely amino acids), but still decent and similar to the known binding protein.

Part 2: Evaluate Binders with AlphaFold3

Running alphfold agains these binders give:

  • Known Peptide - FLYRWLPSRRGG Results Peptide Known Binder Peptide Known Binder

Alphafold is confident in the shape of protein. It is intersting while that while alphafold does identify it as binding to the N-terminus it doesn’t feel very confident about that binding 0.25

  • Peptide Binder 0 - WDNVGYAIYSGK Results Peptide Binder 0 Peptide Binder 0

Alphafold is confident in the shape of protein. The binding certainity (ipTM) is only 0.62 also it is binding to the beta barallel not the N terminus.

  • Peptide Binder 1 - HDWVGQGIDQGE Results Peptide Binder 1 Peptide Binder 1

Alphafold is confident in the shape of protein but the binding certainity (ipTM) is only 0.5 also it is binding to the alpha helix not the N terminus.

  • Peptide Binder 2 - ESYYDQAVDQLE Results Peptide Binder 2 Peptide Binder 2

Alphafold is confident in the shape of protein. The binding certainity (ipTM) is low 0.42 and it is predicted to bind to the N-terminus.

  • Peptide Binder 3 - VSYPGQVVGHLP Results Peptide Binder 3 Peptide Binder 3

Alphafold is confident in the shape of protein. The binding certainity (ipTM) is only 0.4 also it is binding to the beta barallel not the N terminus.

Overall given that Peptide Binder 2 seems to be more confident in the binding to the N-Terminus than it is for the known peptide. Also looking at the grids for the various results it doesn’t seem like alphafold is really the confident for where the peptides are binding even for the known peptide?

Part 3: Evaluate Peptide Properties

Using Peptiverse I got the following for the basic properties

InputPropertyPredictionValueUnit
FLYRWLPSRRGG💦 Hydrophobicity (GRAVY)Non-hemolytic1507.7Probability
FLYRWLPSRRGG🔗 Binding AffinityWeak binding5.962pKd/pKi
FLYRWLPSRRGG📏 Length12aa
FLYRWLPSRRGG⚖️ Molecular Weight1507.7Da
FLYRWLPSRRGG⚡ Net Charge (pH 7)2.76
FLYRWLPSRRGG🎯 Isoelectric Point11.71pH
FLYRWLPSRRGG💦 Hydrophobicity (GRAVY)-0.71GRAVY
WDNVGYAIYSGK🩸 HemolysisNon-hemolytic0.050Probability}
WDNVGYAIYSGK🔗 Binding AffinityWeak binding5.994pKd/pKi
WDNVGYAIYSGK📏 Length12aa
WDNVGYAIYSGK⚖️ Molecular Weight1372.5Da
WDNVGYAIYSGK⚡ Net Charge (pH 7)-0.24
WDNVGYAIYSGK🎯 Isoelectric Point5.83pH
WDNVGYAIYSGK💦 Hydrophobicity (GRAVY)-0.46GRAVY
HDWVGQGIDQGE🩸 HemolysisNon-hemolytic0.043Probability
HDWVGQGIDQGE📏 Length12aa
HDWVGQGIDQGE⚡ Net Charge (pH 7)-3.14
HDWVGQGIDQGE🎯 Isoelectric Point4.31pH
HDWVGQGIDQGE💦 Hydrophobicity (GRAVY)-1.18GRAVY
HDWVGQGIDQGE🩸 HemolysisNon-hemolytic0.043Probability
HDWVGQGIDQGE📏 Length12aa
HDWVGQGIDQGE⚖️ Molecular Weight1340.4Da
HDWVGQGIDQGE⚡ Net Charge (pH 7)-3.14
HDWVGQGIDQGE🎯 Isoelectric Point4.31pH
HDWVGQGIDQGE💦 Hydrophobicity (GRAVY)-1.18GRAVY
VSYPGQVVGHLP🩸 HemolysisNon-hemolytic0.024Probability
VSYPGQVVGHLP📏 Length12aa
VSYPGQVVGHLP⚖️ Molecular Weight1252.4Da
VSYPGQVVGHLP⚡ Net Charge (pH 7)-0.18
VSYPGQVVGHLP🎯 Isoelectric Point6.71pH
VSYPGQVVGHLP💦 Hydrophobicity (GRAVY)0.30GRAVY
ESYYDQAVDQLE🩸 HemolysisNon-hemolytic0.052Probability
ESYYDQAVDQLE🔗 Binding AffinityWeak binding6.105pKd/pKi
ESYYDQAVDQLE📏 Length12aa
ESYYDQAVDQLE⚖️ Molecular Weight1459.5Da
ESYYDQAVDQLE⚡ Net Charge (pH 7)-4.15
ESYYDQAVDQLE🎯 Isoelectric Point4.05pH
ESYYDQAVDQLE💦 Hydrophobicity (GRAVY)-1.22GRAVY

Checking the binding affinity separatrely it doesn’t seem like any of them bond strongly including the known binding peptide.

InputPropertyPredictionValueUnit
FLYRWLPSRRGG🔗 Binding AffinityWeak binding5.962pKd/pKi
WDNVGYAIYSGK🔗 Binding AffinityWeak binding5.994pKd/pKi
HDWVGQGIDQGE🔗 Binding AffinityWeak binding5.216pKd/pKi
ESYYDQAVDQLE🔗 Binding AffinityWeak binding6.105pKd/pKi
VSYPGQVVGHLP🔗 Binding AffinityWeak binding5.524pKd/pKi

None of my peptides (and even the known binder?) seem like great candidates. Not sure if I am doing something wrong? Given this I guess I would go with ESYYDQAVDQLE for further work since it seems to be the closest binding candidate.

Part 4: Generate Optmized Peptides

I made a copy of the notebook and tried to do the optimization computations there. Unfortunately I ran into two problems I wasn’t able to resolve:

  • I wasn’t able to get the recommended GPU and could only get a T4
  • I repeatedly got (what appears to be an off by one error) pretty deep in the moPPit code. I tried running with a bunhc of different parameters. I did determine that the probably seems to be with something indexed by the sample parameters since the error message changes when I change the sample count. The error I am getting is below:
Target Motifs: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12], device='cuda:0')
Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t33_650M_UR50D and are newly initialized: ['esm.pooler.dense.bias', 'esm.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t33_650M_UR50D and are newly initialized: ['esm.pooler.dense.bias', 'esm.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

NFE: 0:   0%|          | 0/0.9990000128746033 [00:00<?, ?it/s]Weight Vector: tensor([1., 1.], device='cuda:0')

NFE: 0:   0%|          | 0/0.9990000128746033 [00:07<?, ?it/s]
Traceback (most recent call last):
  File "/content/moPPIt/moppit.py", line 81, in <module>
    x_1 = solver.multi_guidance_sample(args=args, x_init=x_init, 
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/content/moPPIt/flow_matching/solver/discrete_solver.py", line 365, in multi_guidance_sample
    guided_u_t, pos_indices, cand_tokens, improvement_values, delta_S = guided_transition_scoring(x_t, u_t, w, score_models, t, w, args)
                                                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/content/moPPIt/flow_matching/utils/multi_guidance.py", line 69, in guided_transition_scoring
    improvement *= importance[count]
                   ~~~~~~~~~~^^^^^^^
IndexError: index 2 is out of bounds for dimension 0 with size 2
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_1400/1413049073.py in <cell line: 0>()
     38 ret = proc.wait()
     39 if ret != 0:
---> 40    raise RuntimeError(f"moo.py failed with code {ret}")

RuntimeError: moo.py failed with code 1

Part B: BRD4 Drug Discovery Platform Tutorial

Part 1 Structural Predictions

Results

CompoundBinding ConfidenceOptimization ScoreStructure Confidence
Hit0.760.260.90
Lead0.440.230.88
JQ10.96.0440.98

Discussion

  • Does Binding Confidence increase as you move from hit to clinical candidate? What would you expect, and why might it deviate?

I am not sure I understand the terminology but I am assuming that the order is Lead -> Hit -> Clinical Candidate (JQ1). As we go through this sequence I would expect binding condidence to increase in most scores. I can see situations where there could be experimental evidence that overrides the modeling results and perhaps give an inversion of scores?

  • Inspect the predicted binding pose for JQ1. Can you identify potential key binding interactions.

It looks like JQ1 is binding in the top/interior of the a barrel made for the 4 helix barrel that is the primary teririary structure in the protein.

• Compare the Optimization Scores. How do the scores compare for JQ1 vs the Lead.

The optimization score for JQ1 is higher.

Part 3 Generative Virtual Screen

Ran virtual screen with recommended parameters

Part 4 Analysis

I got no results that had both an optimization score > 0.4 and binding score > 0.8 which I think means I got no “High Confidence” candidates. If I relax that to an high binding OR high optimization I get

  • 2 compounds that binding scors of 0.8, with optimization scores of 0.27 and 0.34. Of these two SM-5S4NDA3E has the highest optimizatoin score and is my best candidate.
  • 12 candidates with an optimization score of greater than 0.4. The three best candidates are (the rest of binding condience of less than 0.7)
    • SM-279VMWMP has binding confindence of 0.78 and optimization score of 0.4
    • SM-QAZ4MM4W has binding condidence of 0.77 and optimization score 0.45
    • SM-2F86NWEQ has binding condence of 0.72 and optimization score 0.42
  • JQ1 appears to be a much better candidate it has Binding Confidence of 0.99 and Optimization Score 0.55.
  • The binding againstBRD2 for my top 4 candidates was still quite strong, between 0.8 and 0.77 so these binders are not very selective.

Part C: Final Group Project: L Protein Mutants

I did a variation of option 3 which was to generate random mutations using the information in this spreadsheet on mutants. However inspired by Professor’s Sahas mention of the constraints because the virus is encoding other proteins in different frameshifts, I did a search for the point mutations that preserve the structures of the proteins and put the results in this spreadsheet. The entire journey and the code that generates the final results is captured in this python code which is the downloaded python code from the Marimo notebook I used. Given we are trying to engineer this protein for phage therapy using modified MS2 (I think) it seems like you need to take this into account and either select mutations from this list or do your mutational analysis on both the lysis protein and frameshifted protein at the same time, which it doesn’t seem like is being done? In particular, most of the mutations in the spreadsheet are not in the list of mutations that preserve the frameshifted proteins.