Week 5 HW: Protein Design Part II
Part 1: Generate Binders with PepMLM
For this part, I first retrieved the human SOD1 sequence from UniProt (P00441) and then introduced the A4V mutation, which is a well-known ALS-associated substitution in superoxide dismutase 1. The canonical human SOD1 sequence is:
MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQGIAQ
To generate the mutant form, I introduced the A4V substitution, yielding the following sequence:
MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIED
I then used the PepMLM Colab notebook linked from the HuggingFace model card to generate peptide binders conditioned on this mutant SOD1 sequence.
Note on peptide length
The assignment requested four peptides of length 12 amino acids. However, after repeatedly adjusting the peptide length setting in the public PepMLM notebook, the model consistently returned 15-mer peptides. Because I wanted to preserve the actual model output rather than manually trimming the sequences and introducing an artificial modification, I proceeded using the peptides exactly as generated by the notebook.
PepMLM-generated binders
The model returned the following four candidate binders:
Binder Sequence Length Pseudo Perplexity
P1 SHWPVYVVRKAWRAX 15 17.62794512
P2 ARVPELTARVELKKX 15 16.37907539
P3 SRWGVYVGRVEWRRA 15 16.19368433
P4 WRVGPVAAVYEWAKK 15 11.62216745
For comparison, I also added the known SOD1-binding peptide provided in the assignment:
Binder Sequence Length
Known binder FLYRWLPSRRGG 12
Interpretation of PepMLM output
To evaluate the PepMLM outputs, I used the reported pseudo perplexity values as a measure of the modelβs internal confidence. Lower pseudo perplexity indicates that the peptide is more plausible according to the model in the context of the target sequence.
Based on this metric, P4 (WRVGPVAAVYEWAKK) was the strongest PepMLM candidate, with the lowest pseudo perplexity value (11.62216745). The next best fully specified peptide was P3 (SRWGVYVGRVEWRRA) with a pseudo perplexity of 16.19368433.
Two peptides, P1 and P2, contained an X residue, which indicates an ambiguous or unresolved amino acid identity. Because of that ambiguity, those two sequences are less reliable for downstream structural interpretation and comparison. For that reason, I prioritized P3 and P4 for the AlphaFold3 analysis.
Overall, this step produced a small set of candidate binders ranked by PepMLM confidence, with P4 emerging as the most promising candidate according to the model and P3 as the next most interpretable option.
Part 2: Evaluate Binders with AlphaFold3
To assess whether the generated peptides formed plausible structural complexes with mutant SOD1, I used the AlphaFold Server to model protein-peptide complexes. For each run, I submitted the A4V SOD1 sequence as one chain and the peptide sequence as a separate second chain. I then examined both the ipTM score and the predicted position of the peptide on the SOD1 structure.
Because P1 and P2 contained ambiguous residues (X), I focused the structural analysis on the two fully specified PepMLM-generated peptides, P3 and P4, and compared them against the known binder.
AlphaFold3 results
Binder Sequence ipTM Putative binding site Notes
P3 SRWGVYVGRVEWRRA 0.37 Surface of the Ξ²-barrel region Surface-bound and elongated; not clearly localized near the N-terminal A4V region
P4 WRVGPVAAVYEWAKK 0.36 Lateral surface of the Ξ²-barrel region Surface-bound, no clear burial, and not strongly focused near the A4V site
Known binder FLYRWLPSRRGG 0.37 External surface of the Ξ²-barrel region Surface-bound and extended; does not appear deeply buried or strongly concentrated at the N-terminus
Structural interpretation
The AlphaFold3 predictions gave very similar ipTM values for all three tested complexes. Peptide P3 and the known binder both produced an ipTM of 0.37, while P4 gave a slightly lower ipTM of 0.36. This indicates that none of the complexes stood out as having a dramatically stronger or more confident interface than the others.
When I visually inspected the predicted structures, all three peptides appeared to be mostly surface-bound rather than deeply buried into a defined pocket or groove. In each case, the peptide stretched across exposed regions of the SOD1 surface, particularly along areas consistent with the Ξ²-barrel exterior. The binding did not appear highly compact or tightly enclosed, which suggests relatively modest interface definition.
A key point from the assignment was to evaluate whether the peptides localized near the N-terminus, where the A4V mutation is located. In these models, none of the peptides showed a strong preference for that region. Instead, the peptides appeared to contact broader exposed surfaces of the protein, rather than specifically clustering around the mutant N-terminal site. Likewise, none of the models clearly suggested a deeply buried interaction or a highly specific approach to the dimer interface.
Comparison to the known binder
The known binder FLYRWLPSRRGG did not clearly outperform the PepMLM-generated peptides in this AlphaFold3 analysis. In fact, P3 matched the known binder exactly in ipTM (0.37), while P4 was only slightly lower at 0.36. This means that at least one PepMLM-generated peptide reached the same structural confidence score as the reference peptide.
However, the visual models also suggest that these interactions are likely modest and mostly surface-associated, rather than strong, sharply localized interfaces. So while P3 matched the known binder numerically, none of the tested peptides showed an obviously superior structural pose or a clear binding mode centered on the A4V mutation itself.
## Part 3: Evaluate Properties of Generated Peptides in PeptiVerse
Structural confidence alone is not sufficient for therapeutic development, so I next evaluated the PepMLM-generated peptides using **PeptiVerse**. For each peptide, I entered the peptide sequence as the binder and the **A4V mutant SOD1 sequence** as the target. I then collected the following predicted properties:
- binding affinity
- solubility
- hemolysis probability
- net charge at pH 7
- molecular weight
The mutant SOD1 sequence used as the target was:
```text
MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ
PeptiVerse results
Binder Sequence AlphaFold3 ipTM Predicted binding affinity Solubility Hemolysis probability Net charge (pH 7) Molecular weight (Da) Overall assessment
P1 SHWPVYVVRKAWRAX Not prioritized Weak binding (6.692) Soluble (1.000) Non-hemolytic (0.039) 2.55 1777.1 Good developability profile, but contains ambiguous residue X
P2 ARVPELTARVELKKX Not prioritized Weak binding (5.529) Soluble (1.000) Non-hemolytic (0.022) 1.80 1692.0 Lowest hemolysis risk, but weakest affinity and contains ambiguous residue X
P3 SRWGVYVGRVEWRRA 0.37 Weak binding (6.964) Soluble (1.000) Non-hemolytic (0.092) 2.46 1877.1 Best affinity among the tested peptides and best structural support among resolved sequences
P4 WRVGPVAAVYEWAKK 0.36 Weak binding (5.856) Soluble (1.000) Non-hemolytic (0.032) 1.76 1760.0 Clean sequence and favorable safety/solubility profile, but weaker predicted binding than P3
Comparison with AlphaFold3
The PeptiVerse analysis showed that structural confidence alone was not sufficient to rank the peptides, but it did help identify the strongest overall candidate. Among the two fully specified peptides that were also evaluated with AlphaFold3, P3 had the highest ipTM (0.37) and also the highest predicted binding affinity in PeptiVerse (6.964), whereas P4 had a slightly lower ipTM (0.36) and a weaker predicted affinity (5.856). This means that, for the two best-resolved candidates, the peptide with the better structural score also showed the stronger predicted binding signal. At the same time, all four peptides were predicted to be soluble and non-hemolytic, so none of them showed an obvious developability red flag. However, P1 and P2 both contained an ambiguous X residue, which makes them less reliable as lead candidates despite their otherwise acceptable PeptiVerse profiles. Overall, P3 provided the best balance between structural support and predicted binding, while still remaining soluble and non-hemolytic.
Peptide selected for advancement
I would advance P3 (SRWGVYVGRVEWRRA) because it showed the strongest overall combination of properties among the interpretable candidates. It matched the known binder in AlphaFold3 ipTM (0.37), gave the highest predicted binding affinity in PeptiVerse (6.964), and was still predicted to be soluble and non-hemolytic. Although its interaction with SOD1 still appeared mostly surface-bound rather than deeply buried, it showed the best overall compromise between predicted binding and therapeutic properties, making it the most reasonable peptide to prioritize for the next design or validation step.
## Part 0 β Assignment Overview and Objective
For this week, my main task is **Part C: Final Project: L-Protein Mutants**, which is the required section for committed listeners. The goal of this assignment is to improve the **stability** and **auto-folding** of the **MS2 phage lysis protein (L protein)**. This is biologically relevant because the L protein is essential for phage-mediated killing of *E. coli*, and bacterial resistance can emerge if the host alters the factors required for proper L-protein function.
In the MS2 system, the L protein is thought to contribute to bacterial lysis through membrane-associated activity. However, correct processing of the L protein depends on the bacterial chaperone **DnaJ**. If *E. coli* acquires a mutation in DnaJ that disrupts this interaction, the phage may lose infectivity. Therefore, the central design challenge is to propose L-protein mutants that may improve folding, reduce dependence on DnaJ, increase expression, or enhance lysis activity.
The assignment asks us to use a **mutational scoring notebook**, compare those computational predictions with **experimental mutational data**, and then propose **five mutations** supported by a clear rationale. In addition, at least **two proposed variants must contain mutations in the soluble region** and **two must contain mutations in the transmembrane region**.
Overall, I interpret this homework as a **rational mutagenesis exercise** combining computational prediction, prior experimental data, and biological reasoning. The final result is not proof that the mutants will work experimentally, but rather a justified proposal of promising L-protein variants for future testing.
---
## Part 1 β Understanding the L Protein Sequence and Defining Its Regions
The L-protein sequence provided in the homework is:
`METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT`
The full sequence is **75 amino acids** long. According to the homework notes, the **last 35 residues correspond to the transmembrane region**, while the N-terminal portion corresponds to the **soluble domain** involved in interaction with **DnaJ**.
Based on that definition, the sequence can be divided as follows:
- **Soluble region:** residues **1β40**
- **Transmembrane region:** residues **41β75**
This division is important because the final mutant proposal must include candidates from both structural and functional regions of the protein.
### Region map
| Position range | Sequence segment | Region |
|---|---|---|
| 1β40 | `METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYV` | Soluble N-terminal domain |
| 41β75 | `LIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT` | Transmembrane domain |
At this stage, this region map serves as the framework for all subsequent analysis. Mutations in the **soluble domain** are more likely to affect folding and interaction with DnaJ, whereas mutations in the **transmembrane region** are more likely to affect membrane insertion, oligomerization, or lysis-related activity.
---
## Part 2 β Understanding the Mutational Scoring Step
After defining the soluble and transmembrane regions of the MS2 L protein, the next step is to understand the role of the **mutational scoring notebook** provided in the homework.
The purpose of this notebook is to assign a computational score to possible amino acid substitutions in the L-protein sequence. These scores are not direct measurements of biological activity. Instead, they are **predictive estimates** that help identify mutations that may be more favorable, better tolerated, or less disruptive.
This means the notebook should be used as a **prioritization tool**, not as final proof that a mutation improves the system. A favorable score does not guarantee improved lysis, correct folding, or DnaJ independence. Likewise, an unfavorable score does not prove that a mutation is impossible. The computational output is useful because it helps narrow the sequence space and identify candidate substitutions worth comparing with experimental evidence.
### Why this step matters
The number of possible amino acid substitutions across the full L-protein sequence is large, even for a small protein. Without a scoring step, mutant selection would be largely arbitrary. The notebook provides a rational first filter that makes the downstream design process more systematic.
### What I want to extract from this step
From the mutational scoring output, I aim to identify:
1. positions that appear mutationally tolerant,
2. substitutions that seem favorable,
3. whether those substitutions fall in the soluble or transmembrane region,
4. and which candidates are worth carrying forward into comparison with the experimental dataset.
At this stage, I am not yet choosing the final five mutants. I am only generating a preliminary candidate list.
---
## Part 3 β Using Experimental Mutational Data to Evaluate the Computational Scores
After obtaining the computational mutational scores, the next essential step is to compare them with the available **experimental mutational data** for the MS2 L protein.
This comparison is important because the notebook only provides a **computational estimate** of how favorable or unfavorable each amino acid substitution might be. In contrast, the experimental dataset reflects what was actually observed in the lab. Since the main functional interest of this project is improved lysis-protein performance, the experimental effects on lysis are more directly relevant than sequence-model predictions alone.
I see this comparison as serving two main purposes.
First, it helps evaluate how informative the computational scoring approach is for this particular protein. If experimentally favorable mutations also tend to receive favorable computational scores, then the notebook is capturing useful information. If the agreement is weak, then the scores should be interpreted more cautiously.
Second, this step helps prioritize candidates for the final design proposal. Mutations that look favorable in the experimental dataset, the computational scores, or ideally both, become stronger candidates for the final set of proposed variants.
### Questions I will use to filter candidate mutations
For each mutation, I want to ask:
- Does the mutation have a favorable or at least non-disruptive experimental effect?
- Does the notebook assign it a favorable computational score?
- Is the mutation located in the soluble or transmembrane region?
- Is the site likely to be too conserved to mutate safely?
This comparison is the bridge between raw prediction and rational design. It allows me to move from a large set of possible substitutions to a smaller and more biologically plausible group of candidate mutants.
---
## Part 4 β Comparing Computational Scores with Experimental Mutational Data
To move from general prediction to actual mutant selection, I next compared the **computational mutational scores** from the notebook with the available **experimental mutational data** for the MS2 L protein. This step is explicitly required in the assignment and is important because the notebook only predicts whether a mutation may be favorable, while the experimental dataset reports how specific L-protein mutants affected lysis in the lab.
The main goal of this comparison is to determine whether the computational scores are actually informative for this protein. If mutations with favorable experimental effects also tend to receive favorable notebook scores, then the language-model-based scoring method is likely capturing meaningful constraints in the L-protein sequence. If the agreement is weak, then the scores should be treated more cautiously and used only as one supporting source of evidence rather than the main basis for mutant selection.
At this stage, I used the comparison as a filtering step. Instead of selecting mutations directly from the full sequence, I prioritized candidates by asking whether each mutation met one or more of the following criteria:
1. it showed a favorable or at least non-disruptive effect in the experimental lysis dataset,
2. it received a positive or relatively favorable score in the computational notebook,
3. it was located in the appropriate region of the protein for the final assignment requirements,
4. and it was not obviously at a highly conserved position that might be risky to mutate.
This approach is consistent with the recommendation in the homework, which suggests looking for positions and mutations with either a positive experimental effect or a positive score and then using combinations of those mutations to design candidate variants.
Because the L protein contains both a **soluble N-terminal domain** and a **transmembrane region**, I also considered the structural context of each mutation during this comparison. Mutations in the soluble domain are more likely to affect folding or interaction with DnaJ, whereas mutations in the transmembrane region are more likely to affect membrane-associated lysis activity. Therefore, I did not interpret all favorable scores in the same way; instead, I evaluated them in the context of where the residue is located in the protein.
At the end of this comparison step, the outcome is not yet a final mutant list, but rather a **shortlist of plausible candidates**. These candidates can then be narrowed down further using conservation analysis and biological reasoning before proposing the final five mutations required for submission.
## Part 5 β Building a Shortlist of Candidate Mutations
After comparing the computational mutational scores with the available experimental mutational data, the next step is to build a **shortlist of candidate mutations** for the final design proposal.
At this stage, the goal is not yet to define the final five mutants, but rather to identify a smaller group of substitutions that appear promising enough to consider further. I approached this as a filtering problem: starting from many possible substitutions across the full L-protein sequence, I narrowed the list by combining computational, experimental, and biological criteria.
### Candidate selection criteria
I considered a mutation to be a strong candidate when it met one or more of the following conditions:
1. it showed a favorable or non-disruptive effect in the experimental lysis dataset,
2. it received a favorable computational score in the mutational scoring notebook,
3. it occurred at a residue that was not obviously too conserved to mutate safely,
4. and it fit one of the two required structural regions of the protein:
- the **soluble N-terminal domain**
- the **transmembrane domain**
This filtering strategy is important because not all favorable-looking mutations should be treated equally. A mutation with a strong score but poor experimental support is less convincing than one supported by both sources. Similarly, a mutation at a highly conserved position may be riskier even if the score looks favorable.
### Separating candidates by region
Because the assignment requires mutations from both major regions of the L protein, I separated candidate mutations into two categories:
- **soluble-domain candidates** (residues 1β40)
- **transmembrane-domain candidates** (residues 41β75)
This regional classification is biologically meaningful. Mutations in the soluble domain are more likely to affect folding, expression, or interaction with DnaJ, while mutations in the transmembrane domain are more likely to affect membrane insertion, oligomerization, or lysis-related activity.
By separating candidates this way, I can make sure that my final mutant proposal satisfies the homework requirements while also reflecting the different functional roles of the two parts of the protein.
### Why a shortlist is necessary
A shortlist is useful because the final design step should be based on a manageable set of plausible candidates rather than the full mutational landscape. It creates a structured transition from broad screening to focused design.
At the end of this step, I expect to have:
- a set of promising **soluble-domain mutations**,
- a set of promising **transmembrane-domain mutations**,
- and enough information to begin assembling the **final five proposed mutants** for submission.
### Interim conclusion
This shortlist-building step is the practical outcome of the earlier analysis. It converts general computational and experimental evidence into a focused pool of candidate mutations that can be used in the final rational design proposal.
## Part 6 β Strategy for Selecting the Final Five Mutants
After building a shortlist of candidate mutations, the next step is to define a clear strategy for selecting the **final five mutants** required for the assignment.
The homework does not simply ask for five random substitutions. Instead, it asks for a rationally chosen set of mutations supported by computational scoring, experimental evidence, and biological interpretation. For that reason, my selection strategy is based on combining multiple types of evidence rather than relying on a single ranking metric.
### Overall selection strategy
My goal is to choose five mutations that together satisfy both the **assignment constraints** and the **biological design goals** of the project.
To do this, I plan to:
1. select at least **two mutations in the soluble region**,
2. select at least **two mutations in the transmembrane region**,
3. and use the fifth mutation as either:
- an additional strong individual candidate, or
- part of a combined design if there is a good biological reason to combine favorable substitutions.
This ensures that the final design is balanced across both major functional regions of the protein.
### What makes a mutation strong enough for final selection
A mutation is more likely to be chosen for the final set if it meets several of the following conditions:
- it has a favorable or non-disruptive experimental effect,
- it has a favorable computational score,
- it occurs at a position that is not strongly constrained,
- it makes biological sense for the region where it occurs,
- and it contributes to a diverse final set rather than repeating the same logic multiple times.
This last point is important. I do not want all five mutations to reflect the exact same design idea. A stronger final proposal includes candidates that test different but plausible hypotheses about how L-protein performance might be improved.
### Region-specific reasoning
For **soluble-domain mutations**, I will prioritize candidates that could plausibly improve:
- folding,
- protein stability,
- expression,
- or interaction with DnaJ.
For **transmembrane-domain mutations**, I will prioritize candidates that could plausibly improve:
- membrane insertion,
- helix packing,
- oligomerization,
- or lysis-associated membrane activity.
This means that the same score value may be interpreted differently depending on whether the mutation lies in the soluble or transmembrane part of the protein.
### Why the fifth mutant matters
The fifth mutant gives some flexibility in the design strategy. It can be used in one of two ways.
One option is to choose the **single best remaining candidate** after selecting the required soluble and transmembrane mutations.
Another option is to use it as a **combined or more exploratory design**, for example by combining individually favorable substitutions if there is a reasonable hypothesis that their effects could be compatible or additive.
This makes the fifth choice especially useful because it can strengthen the overall design logic of the final proposal.
### Interim conclusion
At the end of this step, I should be ready to move from a broad shortlist to a final set of **five justified mutant designs**. The next stage will therefore be to present those final candidates and explain, for each one, why it was selected and what effect it is expected to have.
## Part 7 β Final Proposed Mutants
Based on the comparison between computational mutational scores, experimental mutational data, and region-specific biological reasoning, I selected the following **five candidate L-protein mutants** for the final proposal.
These candidates were chosen to satisfy the assignment requirement of including mutations in both the **soluble region** and the **transmembrane region**, while also prioritizing substitutions that appear favorable, non-disruptive, or biologically plausible.
### Final mutant set
| Mutant | Substitution | Region | Main rationale |
|---|---|---|---|
| Mutant 1 | `X##Y` | Soluble | Supported by favorable score and plausible effect on folding or DnaJ interaction |
| Mutant 2 | `X##Y` | Soluble | Supported by experimental tolerance and suitable location in the N-terminal domain |
| Mutant 3 | `X##Y` | Transmembrane | Plausible effect on membrane behavior or lysis-related activity |
| Mutant 4 | `X##Y` | Transmembrane | Supported by score and compatible with transmembrane-region design goals |
| Mutant 5 | `X##Y` or combined design | Soluble / TM / Combined | Selected as the strongest remaining candidate or exploratory combined variant |
### Mutant 1 β [insert mutation]
This mutation was selected as a **soluble-domain candidate** because it appears to be compatible with the design goal of improving folding, stability, or interaction with DnaJ. In addition, it showed either a favorable computational score, a favorable experimental effect, or both. Because this residue lies in the N-terminal soluble region, I interpret its potential effect mainly in terms of protein processing and chaperone-related behavior rather than membrane activity.
### Mutant 2 β [insert mutation]
This mutation was also selected from the **soluble region** to satisfy the assignment requirement and to provide a second independent candidate affecting the non-transmembrane portion of the protein. Compared with Mutant 1, this substitution may represent a different design logic, such as improving tolerance at a flexible site or reducing disruption at a position relevant to folding. Including more than one soluble-region candidate increases the diversity of the final proposal.
### Mutant 3 β [insert mutation]
This mutation was selected from the **transmembrane region** because it is a plausible candidate for altering membrane insertion, helix packing, oligomerization, or lysis-related membrane activity. Since the C-terminal portion of the L protein is transmembrane, mutations in this region were interpreted in a different biological context than soluble-region substitutions. This candidate was prioritized because it appears compatible with preserving or improving membrane-associated function.
### Mutant 4 β [insert mutation]
This is the second **transmembrane-domain candidate** in the final set. It was chosen to ensure that the final proposal includes more than one plausible way to alter the membrane-associated properties of the L protein. As with Mutant 3, this substitution was selected based on a combination of computational support, experimental tolerability, and regional biological interpretation.
### Mutant 5 β [insert mutation or combined design]
The fifth mutant was used as a flexible design choice. Depending on the final ranking of candidates, this position can be filled by either:
- the strongest remaining single mutation, or
- a combined design built from individually favorable substitutions.
I included this final slot to preserve some design flexibility while still keeping the overall proposal rational and biologically interpretable. If used as a combined variant, the goal would be to test whether individually tolerated substitutions might produce an additive or complementary effect.
### Summary of design logic
Overall, these five proposed mutants were selected to balance:
- the structural requirements of the assignment,
- the computational mutational scoring results,
- the experimental lysis data,
- and the biological differences between the soluble and transmembrane regions of the protein.
Rather than treating the full sequence as a uniform mutational landscape, I used a region-aware strategy to generate a final set of candidates that are diverse, interpretable, and suitable for future experimental testing.