Week 5 HW: Protein Design II

Part A: SOD1 Binder Peptide Design (From Pranam)

Part 1: Generate Binders with PepMLM

Begin by retrieving the human SOD1 sequence from UniProt (P00441) and introducing the A4V mutation.

The Protein can be found with this link The Protein sequence is:

>sp|P00441|SODC_HUMAN Superoxide dismutase [Cu-Zn] OS=Homo sapiens OX=9606 GN=SOD1 PE=1 SV=2 MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTS AGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVV HEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

Using the PepMLM Colab linked from the HuggingFace PepMLM-650M model card:

Colab Notebook was copied to my personal drive. I applied for access, even though that doesn’t seem to be necessary anymore.

Generate four peptides of length 12 amino acids conditioned on the mutant SOD1 sequence.To your generated list, add the known SOD1-binding peptide FLYRWLPSRRGG for comparison.Record the perplexity scores that indicate PepMLM’s confidence in the binders.

I checked the single-sequence checkmark and pasted the protein sequence above. I gave it the job name “SOD1_fk_v1”. Afterwards I continued with the model and generated several sequences. This had to be done in different runs as most batches contained sequences with an “X”, which later would not be accepted by AlphaFold. In the end I decided to mix and match from different batches. These are the sequences (aa1 - aa4) with the reference sequence being named aa0. The following table recodes the sequence and the perplexity score.

IDSequencePseudo Perplexity
aa0FLYRWLPSRRGG-
aa1AHYGVLAAAVKWRRK15.4397
aa2SRYDVYVGRVKARAK18.3568
aa3WRYDPVTGRYAAKKA9.3430
aa4SWVPVYTAVVKLKRK20.8359

Part 2: Evaluate Binders with AlphaFold3

Navigate to the AlphaFold Server: alphafoldserver.com

For each peptide, submit the mutant SOD1 sequence followed by the peptide sequence as separate chains to model the protein-peptide complex.

aa0 aa0 aa1 aa1 aa2 aa2 aa3 aa3 aa4 aa4

Record the ipTM score and briefly describe where the peptide appears to bind. Does it localize near the N-terminus where A4V sits? Does it engage the β-barrel region or approach the dimer interface? Does it appear surface-bound or partially buried?

IDSequencePseudo PerplexityipTMpTMBindung SiteBinding Format
aa0FLYRWLPSRRGG-0.320.79- In the same region as N-terminus, parallel to the beta barrel
- in the same region as the dimer interface (the dimer interface is where N-Terminus and C-Terminus meet, or easier, the beginning and the end of the protein. It has the form of a strand
Surface BoundSurface Bound
aa1AHYGVLAAAVKWRRK15.43970.430.82- On the opposite side of the protein from the N-terminus
- wrapping around the beta barrel from the side and the top
- on the opposite side of the protein from the N-terminus
Surface Bound
aa2SRYDVYVGRVKARAK18.35680.460.87- around a 90 degree away from N-Terminus and therefore the dimer interface
- paralell to the beta barrel, though it has the form of a C
Surface Bound
aa3WRYDPVTGRYAAKKA9.34300.230.84- around a 90 degree away from N-Terminus and therefore the dimer interface
- perpendicular to the beta barrel
- it has the form of a C, two beta sheets on either side with the belly of the C pointing towards the protein
mostly surface bound, though more burried than the others
aa4SWVPVYTAVVKLKRK20.83590.420.84- around a 120 degree away from N-Terminus and therefore the dimer interface
- paralell to the beta barrel
- shape is random, but wrapped into the protein in a 3D shape
mostly surface bound, though of all the generated the most burried than the others

In a short paragraph, describe the ipTM values you observe and whether any PepMLM-generated peptide matches or exceeds the known binder.

All ipTM values exceed the refrence linker, except aa3, the, though all pTM values exceed the reference sequence. After discussion with TA all values above 0.8 constitute a good confidence score that the overall structure is correct.

Part 3: Evaluate Properties of Generated Peptides in the PeptiVerse

Structural confidence alone is insufficient for therapeutic development. Using PeptiVerse, let’s evaluate the therapeutic properties of your peptide! For each PepMLM-generated peptide:

Paste the peptide sequence. Paste the A4V mutant SOD1 sequence in the target field. Check the boxes Predicted binding affinity Solubility Hemolysis probability Net charge (pH 7) Molecular weight

Peptiverse Screenshot Peptiverse Screenshot

Compare these predictions to what you observed structurally with AlphaFold3. In a short paragraph, describe what you see.

All peptides show weak binding affinity. This is somewhat expected from the AlphaFold data, as the generated Peptides all show only surface level binding.

Do peptides with higher ipTM also show stronger predicted affinity?

According to a quick trendline analysis, the relationship is negative. The affinity scores scatter around 6.27 and with a standard deviation of around 200.

Are any strong binders predicted to be hemolytic or poorly soluble?

My predictions didn’t produce any strong binders, therefore I cannot answer the question. All predicted peptides are non-hemolytic (values range between 22 and 77) and soluable

Which peptide best balances predicted binding and therapeutic properties?

As non of my peptides have strong binding, while they have good therapeutic properties, the sequence with the highest binding affinity is the best balance. Which is aa3, that does have the lowest ipTM score.

Choose one peptide you would advance and justify your decision briefly.

I’d choose aa3, as the sample is doesn’t produce a predicted sequence that has a strong binding affinity, while the therapeutic properties are solid, the peptide with the highest binding affinity is chosen.

Part 4: Generate Optimized Peptides with moPPIt

Now, move from sampling to controlled design. moPPIt uses Multi-Objective Guided Discrete Flow Matching (MOG-DFM) to steer peptide generation toward specific residues and optimize binding and therapeutic properties simultaneously. Unlike PepMLM, which samples plausible binders conditioned on just the target sequence, moPPIt lets you choose where you want to bind and optimize multiple objectives at once.

Open the moPPit Colab linked from the HuggingFace moPPIt model card
Make a copy and switch to a GPU runtime.
In the notebook:
Paste your A4V mutant SOD1 sequence.
Choose specific residue indices on SOD1 that you want your peptide to bind (for example, residues near position 4, the dimer interface, or another surface patch).
Set peptide length to 12 amino acids.
Enable motif and affinity guidance (and solubility/hemolysis guidance if available). Generate peptides.

Unfortunately I wasn’t able to complete this part as google didn’t allow for the use of A100 or L4 GPU. I got the tried with T4 GPUs but the first cell ran in a infinite loop till my GPU credits ran out.

After generation, briefly describe how these moPPit peptides differ from your PepMLM peptides. How would you evaluate these peptides before advancing them to clinical studies?

See above

Part B: BRD4 Drug Discovery Platform Tutorial (Gabriele) [optional for Committed Listeners]

Part 0: Sign-up to Boltz Lab

I signed up for access to the Boltz Lab, though as of writing the Homework i have not received any credentials

Part 1: Structural Predictions in the Sandbox

Part 2: Setting Up a BRD4 Design Project

Part 3: Running Your Virtual Screen

Part 4: Analysis and Discussion

Part C: Final Project: L-Protein Mutants

Option 1: Mutagenesis

Designing these mutants with good computational confidence is hard. It will show you limitations of some of the structure based models. Ultimately, you can pick various combinations of mutations and get lab results and then decide to pick the next round of mutations, but this assay will not be easy to run at scale in this class.

I copied the Colab notebook and worked through it to the best of my ability. Any experimental work will be done in the BioPunk Node at a later time, as per discussion with Eliott Roth (Our Node Leader).

Run this notebook to generate for each position in the amino acid sequence, a “score” for what would happen to the protein if you mutated into another amino acid. It can be positive or negative for the protein. We want to identify possible mutations that are “positive” If you run this notebook - you will see a .csv file in the sidebar. You can download it and look at it in the google sheets if that’s easier.

Running the Colab notebook, gave me several outputs, unfortunately they were quite badly documented therefore all information should be used with caution. The first run analyses the entire protein sequence and produces an exhaustive list of potential mutation sites: Model 1

PositionWild_Type_AAMutation_AALLR_Sccore
50KL2,561468
29CR2,395427
39YL2,24178
29CS2,04315
9SQ2,014325
29CQ1,997049
29CP1,971029
29CL1,960646
50KI1,928801
53NL1,864932
61EL1,818098
52TL1,813968
50KF1,802069
29CT1,797247
29CK1,795878
5FQ1,795244
5FR1,659717
29CA1,648656
27YR1,628061
22FR1,602028
5FP1,596891
50KV1,594576
50KS1,574557
5FT1,559024
5FS1,556417
45AL1,539248
39YS1,517457
27YS1,497053
40VL1,47763
27YL1,474637
22FS1,423358
29CE1,383281
39YA1,364999
29CN1,362601
50KA1,357795
29CI1,344121
5FL1,332615
17NR1,323651
39YI1,320103
39YT1,302804
26DR1,268762
29CH1,246107
39YF1,245851
39YV1,24439
23KR1,236555
25ER1,22935
24HR1,227779
50KT1,222131
27YQ1,218851
27YT1,215567

Model 2

IDAmino AcidPositionLLR_Score
0L502,561468
1L392,24178
2I501,928801
3L531,864932
4L521,813968
5F501,802069
6V501,594576
7S501,574557
8L451,539248
9S391,517457
10L401,47763
11A391,364999
12A501,357795
13I391,320103
14T391,302804
15F391,245851
16V391,24439
17T501,222131
18L541,12086
19R391,064191

Model 3

PositionWild_Type_AAMutation_AALLR_Score
50KL2,561468
29CR2,395427
39YL2,24178
29CS2,04315
9SQ2,014325
29CQ1,997049
29CP1,971029
29CL1,960646
50KI1,928801
53NL1,864932

Model 4

PositionWild_Type_AAMutation_AALLR_Score
50KL2,561468
29CR2,395427
29CR2,395427
39YL2,24178
29CS2,04315
29CS2,04315
9SQ2,014325
9SQ2,014325
29CQ1,997049
29CQ1,997049

Here is the Heatmap generated from the Colab Notebook. Phage Mutation Heatmap Phage Mutation Heatmap

Use the experimental data here. This dataset contains information about mutants of the L-Protein and their effect on lysis in the lab - L-Protein Mutants

I copied the csv file from the website to the googlesheet and the Colab.

First check, does the experimental data correlate with the scores from the notebook in (b)? This should give you a clue on how well these language embeddings capture information about this protein sequence.

Let’s first look at the experimental data. Here the positions, mutations are displayed, together with a binary, whether the lysis has happened. Therefore one needs to look at the lysis = 1. These are 35 entries. Now one needs to compare, whether the positions are in the model generated mutation sites.

I’m unsure what the instructors mean with correlating the scores from the experimental data with the predicted data. The above mentioned workflow is a comparison of targets. In a next step on can check, whether the identified targets have a positive LLR score.

Using information about the effect of protein mutations at these sites - both the scores and the experimental data in the drive, come up with 5 mutations for each student along with how you came up with them and why you believe they would work. 2 of the variants you submit must have mutations in the transmembrane region (refer to notes above on what amino acid positions these are) and 2 of them must be in the soluble region . Remember that you can also use the pBLAST to see which residues are conserved and not mutate them if you want to. One easy way to generate sequence mutations could be to look for residue positions and mutations that have a positive mutational effect either in the experimental or have a positive score from step 1. And pick a combination of those mutations.

IDModelPositionWildtypeMutation
1161EL
2150KF
3253NL
4252L
5353NL

You can utilize Af2_Multimer to generate a Multimeric Assembly; you can do this by making your query sequence as. We want to do this because - A running hypothesis for how this protein functions is that it assembles to make a perforation in the bacterial membrane.

No Colab was provided for that. If applicable the colab from option 2 could be utilized. This can be submitted at a later point.

Use of Generative AI

Generative AI was used as a conceptual drafting aid during the development of this project. Specifically, it supported the structuring and refinement of complex ideas related to protein design and use of computational tools. It was instrumental in explaining the methods used in the colab notebook to generate mutations of the lysis protein, as well as clarifying the scientific concepts in the related reading. The AI was used to iteratively improve clarity of language and to explore alternative conceptual framings. All final judgments were made by the author. The link for the prompts and responses is attached in the repository.