Week 5 — Protein Design Part II

Part A: SOD1 Binder Peptide Design

Goal: Design short peptides that bind mutant SOD1 and then decide which ones are worth advancing toward therapy

Part 1: Generate Binders with PepMLM

The SOD1 Protein0 has base sequence:

MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

To obtain the functional and mutated protein we need to do the removal of the start codon Methionone and mutate the fourth A to a V.

ATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

Using PepMLM in my edited copy of the notebook I got the following pepties of length 12:

[         Binder  Pseudo Perplexity
 0  WHYPAVXIEHKX          13.025725,
          Binder  Pseudo Perplexity
 0  WSYPATXIRHGK          17.076882,
          Binder  Pseudo Perplexity
 0  WSSYAVAAELKX           9.094557,
          Binder  Pseudo Perplexity
 0  WRYPVVXARWGX          10.105283]

I also computed the perplexity of the known SOD1 binding peptide ‘FLYRWLPSRRGG’

compute_pseudo_perplexity(model, tokenizer, protein_seq, 'FLYRWLPSRRGG')

np.float64(22.52521074663301)

It appears that the model like its peptides betters than the known binder.

Actually when I wne tot the next step I finally noticed every single one of the generated peptides had an X/unknown which alphafold barfs on. I fixed this by adjusting the top_k value to be top 9 to allow more choices besides X and got:

         Binder  Pseudo Perplexity
 0  WDNVGYAIYSGK          17.318647
 1  HDWVGQGIDQGE          26.359836
 2  ESYYDQAVDQLE          35.716432
 3  VSYPGQVVGHLP          34.004369]

These are more suprising than the other ones (which isn’t too surprising given the wider latidute of to pick less likely amino acids), but still decent and similar to the known binding protein.

Part 2: Evaluate Binders with AlphaFold3

Running alphfold agains these binders give:

Known Peptide - FLYRWLPSRRGG Results

Alphafold is confident in the shape of protein. It is intersting while that while alphafold does identify it as binding to the N-terminus it doesn’t feel very confident about that binding 0.25

Peptide Binder 0 - WDNVGYAIYSGK Results

Alphafold is confident in the shape of protein. The binding certainity (ipTM) is only 0.62 also it is binding to the beta barallel not the N terminus.

Peptide Binder 1 - HDWVGQGIDQGE Results

Alphafold is confident in the shape of protein but the binding certainity (ipTM) is only 0.5 also it is binding to the alpha helix not the N terminus.

Peptide Binder 2 - ESYYDQAVDQLE Results

Alphafold is confident in the shape of protein. The binding certainity (ipTM) is low 0.42 and it is predicted to bind to the N-terminus.

Peptide Binder 3 - VSYPGQVVGHLP Results

Alphafold is confident in the shape of protein. The binding certainity (ipTM) is only 0.4 also it is binding to the beta barallel not the N terminus.

Overall given that Peptide Binder 2 seems to be more confident in the binding to the N-Terminus than it is for the known peptide. Also looking at the grids for the various results it doesn’t seem like alphafold is really the confident for where the peptides are binding even for the known peptide?

Part 3: Evaluate Peptide Properties

Using Peptiverse I got the following for the basic properties

Input	Property	Prediction	Value	Unit
FLYRWLPSRRGG	💦 Hydrophobicity (GRAVY)	Non-hemolytic	1507.7	Probability
FLYRWLPSRRGG	🔗 Binding Affinity	Weak binding	5.962	pKd/pKi
FLYRWLPSRRGG	📏 Length		12	aa
FLYRWLPSRRGG	⚖️ Molecular Weight		1507.7	Da
FLYRWLPSRRGG	⚡ Net Charge (pH 7)		2.76
FLYRWLPSRRGG	🎯 Isoelectric Point		11.71	pH
FLYRWLPSRRGG	💦 Hydrophobicity (GRAVY)		-0.71	GRAVY
WDNVGYAIYSGK	🩸 Hemolysis	Non-hemolytic	0.050	Probability}
WDNVGYAIYSGK	🔗 Binding Affinity	Weak binding	5.994	pKd/pKi
WDNVGYAIYSGK	📏 Length		12	aa
WDNVGYAIYSGK	⚖️ Molecular Weight		1372.5	Da
WDNVGYAIYSGK	⚡ Net Charge (pH 7)		-0.24
WDNVGYAIYSGK	🎯 Isoelectric Point		5.83	pH
WDNVGYAIYSGK	💦 Hydrophobicity (GRAVY)		-0.46	GRAVY
HDWVGQGIDQGE	🩸 Hemolysis	Non-hemolytic	0.043	Probability
HDWVGQGIDQGE	📏 Length		12	aa
HDWVGQGIDQGE	⚡ Net Charge (pH 7)		-3.14
HDWVGQGIDQGE	🎯 Isoelectric Point		4.31	pH
HDWVGQGIDQGE	💦 Hydrophobicity (GRAVY)		-1.18	GRAVY
HDWVGQGIDQGE	🩸 Hemolysis	Non-hemolytic	0.043	Probability
HDWVGQGIDQGE	📏 Length		12	aa
HDWVGQGIDQGE	⚖️ Molecular Weight		1340.4	Da
HDWVGQGIDQGE	⚡ Net Charge (pH 7)		-3.14
HDWVGQGIDQGE	🎯 Isoelectric Point		4.31	pH
HDWVGQGIDQGE	💦 Hydrophobicity (GRAVY)		-1.18	GRAVY
VSYPGQVVGHLP	🩸 Hemolysis	Non-hemolytic	0.024	Probability
VSYPGQVVGHLP	📏 Length		12	aa
VSYPGQVVGHLP	⚖️ Molecular Weight		1252.4	Da
VSYPGQVVGHLP	⚡ Net Charge (pH 7)		-0.18
VSYPGQVVGHLP	🎯 Isoelectric Point		6.71	pH
VSYPGQVVGHLP	💦 Hydrophobicity (GRAVY)		0.30	GRAVY
ESYYDQAVDQLE	🩸 Hemolysis	Non-hemolytic	0.052	Probability
ESYYDQAVDQLE	🔗 Binding Affinity	Weak binding	6.105	pKd/pKi
ESYYDQAVDQLE	📏 Length		12	aa
ESYYDQAVDQLE	⚖️ Molecular Weight		1459.5	Da
ESYYDQAVDQLE	⚡ Net Charge (pH 7)		-4.15
ESYYDQAVDQLE	🎯 Isoelectric Point		4.05	pH
ESYYDQAVDQLE	💦 Hydrophobicity (GRAVY)		-1.22	GRAVY

Checking the binding affinity separatrely it doesn’t seem like any of them bond strongly including the known binding peptide.

Input	Property	Prediction	Value	Unit
FLYRWLPSRRGG	🔗 Binding Affinity	Weak binding	5.962	pKd/pKi
WDNVGYAIYSGK	🔗 Binding Affinity	Weak binding	5.994	pKd/pKi
HDWVGQGIDQGE	🔗 Binding Affinity	Weak binding	5.216	pKd/pKi
ESYYDQAVDQLE	🔗 Binding Affinity	Weak binding	6.105	pKd/pKi
VSYPGQVVGHLP	🔗 Binding Affinity	Weak binding	5.524	pKd/pKi

None of my peptides (and even the known binder?) seem like great candidates. Not sure if I am doing something wrong? Given this I guess I would go with ESYYDQAVDQLE for further work since it seems to be the closest binding candidate.

Part 4: Generate Optmized Peptides

I made a copy of the notebook and tried to do the optimization computations there. Unfortunately I ran into two problems I wasn’t able to resolve:

I wasn’t able to get the recommended GPU and could only get a T4
I repeatedly got (what appears to be an off by one error) pretty deep in the moPPit code. I tried running with a bunhc of different parameters. I did determine that the probably seems to be with something indexed by the sample parameters since the error message changes when I change the sample count. The error I am getting is below:

Target Motifs: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12], device='cuda:0')
Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t33_650M_UR50D and are newly initialized: ['esm.pooler.dense.bias', 'esm.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t33_650M_UR50D and are newly initialized: ['esm.pooler.dense.bias', 'esm.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

NFE: 0:   0%|          | 0/0.9990000128746033 [00:00<?, ?it/s]Weight Vector: tensor([1., 1.], device='cuda:0')

NFE: 0:   0%|          | 0/0.9990000128746033 [00:07<?, ?it/s]
Traceback (most recent call last):
  File "/content/moPPIt/moppit.py", line 81, in <module>
    x_1 = solver.multi_guidance_sample(args=args, x_init=x_init, 
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/content/moPPIt/flow_matching/solver/discrete_solver.py", line 365, in multi_guidance_sample
    guided_u_t, pos_indices, cand_tokens, improvement_values, delta_S = guided_transition_scoring(x_t, u_t, w, score_models, t, w, args)
                                                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/content/moPPIt/flow_matching/utils/multi_guidance.py", line 69, in guided_transition_scoring
    improvement *= importance[count]
                   ~~~~~~~~~~^^^^^^^
IndexError: index 2 is out of bounds for dimension 0 with size 2
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_1400/1413049073.py in <cell line: 0>()
     38 ret = proc.wait()
     39 if ret != 0:
---> 40    raise RuntimeError(f"moo.py failed with code {ret}")

RuntimeError: moo.py failed with code 1

Part B: BRD4 Drug Discovery Platform Tutorial

Part 1 Structural Predictions

Results

Compound	Binding Confidence	Optimization Score	Structure Confidence
Hit	0.76	0.26	0.90
Lead	0.44	0.23	0.88
JQ1	0.96	.044	0.98

Discussion

Does Binding Confidence increase as you move from hit to clinical candidate? What would you expect, and why might it deviate?

I am not sure I understand the terminology but I am assuming that the order is Lead -> Hit -> Clinical Candidate (JQ1). As we go through this sequence I would expect binding condidence to increase in most scores. I can see situations where there could be experimental evidence that overrides the modeling results and perhaps give an inversion of scores?

Inspect the predicted binding pose for JQ1. Can you identify potential key binding interactions.

It looks like JQ1 is binding in the top/interior of the a barrel made for the 4 helix barrel that is the primary teririary structure in the protein.

• Compare the Optimization Scores. How do the scores compare for JQ1 vs the Lead.

The optimization score for JQ1 is higher.

Part 3 Generative Virtual Screen

Ran virtual screen with recommended parameters

Part 4 Analysis

I got no results that had both an optimization score > 0.4 and binding score > 0.8 which I think means I got no “High Confidence” candidates. If I relax that to an high binding OR high optimization I get

2 compounds that binding scors of 0.8, with optimization scores of 0.27 and 0.34. Of these two SM-5S4NDA3E has the highest optimizatoin score and is my best candidate.
12 candidates with an optimization score of greater than 0.4. The three best candidates are (the rest of binding condience of less than 0.7)
- SM-279VMWMP has binding confindence of 0.78 and optimization score of 0.4
- SM-QAZ4MM4W has binding condidence of 0.77 and optimization score 0.45
- SM-2F86NWEQ has binding condence of 0.72 and optimization score 0.42
JQ1 appears to be a much better candidate it has Binding Confidence of 0.99 and Optimization Score 0.55.
The binding againstBRD2 for my top 4 candidates was still quite strong, between 0.8 and 0.77 so these binders are not very selective.

Part C: Final Group Project: L Protein Mutants

I did a variation of option 3 which was to generate random mutations using the information in this spreadsheet on mutants. However inspired by Professor’s Sahas mention of the constraints because the virus is encoding other proteins in different frameshifts, I did a search for the point mutations that preserve the structures of the proteins and put the results in this spreadsheet. The entire journey and the code that generates the final results is captured in this python code which is the downloaded python code from the Marimo notebook I used. Given we are trying to engineer this protein for phage therapy using modified MS2 (I think) it seems like you need to take this into account and either select mutations from this list or do your mutational analysis on both the lysis protein and frameshifted protein at the same time, which it doesn’t seem like is being done? In particular, most of the mutations in the spreadsheet are not in the list of mutations that preserve the frameshifted proteins.