Week 5 HW: Protein Design Part II

Part 1: Generate Binders with PepMLM

For this part, I first retrieved the human SOD1 sequence from UniProt (P00441) and then introduced the A4V mutation, which is a well-known ALS-associated substitution in superoxide dismutase 1. The canonical human SOD1 sequence is:

MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQGIAQ

To generate the mutant form, I introduced the A4V substitution, yielding the following sequence:

MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIED


I then used the PepMLM Colab notebook linked from the HuggingFace model card to generate peptide binders conditioned on this mutant SOD1 sequence.

Note on peptide length

The assignment requested four peptides of length 12 amino acids. However, after repeatedly adjusting the peptide length setting in the public PepMLM notebook, the model consistently returned 15-mer peptides. Because I wanted to preserve the actual model output rather than manually trimming the sequences and introducing an artificial modification, I proceeded using the peptides exactly as generated by the notebook.

PepMLM-generated binders

The model returned the following four candidate binders:

Binder	Sequence	Length	Pseudo Perplexity
P1	SHWPVYVVRKAWRAX	15	17.62794512
P2	ARVPELTARVELKKX	15	16.37907539
P3	SRWGVYVGRVEWRRA	15	16.19368433
P4	WRVGPVAAVYEWAKK	15	11.62216745

For comparison, I also added the known SOD1-binding peptide provided in the assignment:

Binder	Sequence	Length
Known binder	FLYRWLPSRRGG	12
Interpretation of PepMLM output

To evaluate the PepMLM outputs, I used the reported pseudo perplexity values as a measure of the model’s internal confidence. Lower pseudo perplexity indicates that the peptide is more plausible according to the model in the context of the target sequence.

Based on this metric, P4 (WRVGPVAAVYEWAKK) was the strongest PepMLM candidate, with the lowest pseudo perplexity value (11.62216745). The next best fully specified peptide was P3 (SRWGVYVGRVEWRRA) with a pseudo perplexity of 16.19368433.

Two peptides, P1 and P2, contained an X residue, which indicates an ambiguous or unresolved amino acid identity. Because of that ambiguity, those two sequences are less reliable for downstream structural interpretation and comparison. For that reason, I prioritized P3 and P4 for the AlphaFold3 analysis.

Overall, this step produced a small set of candidate binders ranked by PepMLM confidence, with P4 emerging as the most promising candidate according to the model and P3 as the next most interpretable option.

Part 2: Evaluate Binders with AlphaFold3

To assess whether the generated peptides formed plausible structural complexes with mutant SOD1, I used the AlphaFold Server to model protein-peptide complexes. For each run, I submitted the A4V SOD1 sequence as one chain and the peptide sequence as a separate second chain. I then examined both the ipTM score and the predicted position of the peptide on the SOD1 structure.

Because P1 and P2 contained ambiguous residues (X), I focused the structural analysis on the two fully specified PepMLM-generated peptides, P3 and P4, and compared them against the known binder.

AlphaFold3 results
Binder	Sequence	ipTM	Putative binding site	Notes
P3	SRWGVYVGRVEWRRA	0.37	Surface of the β-barrel region	Surface-bound and elongated; not clearly localized near the N-terminal A4V region
P4	WRVGPVAAVYEWAKK	0.36	Lateral surface of the β-barrel region	Surface-bound, no clear burial, and not strongly focused near the A4V site
Known binder	FLYRWLPSRRGG	0.37	External surface of the β-barrel region	Surface-bound and extended; does not appear deeply buried or strongly concentrated at the N-terminus
Structural interpretation

The AlphaFold3 predictions gave very similar ipTM values for all three tested complexes. Peptide P3 and the known binder both produced an ipTM of 0.37, while P4 gave a slightly lower ipTM of 0.36. This indicates that none of the complexes stood out as having a dramatically stronger or more confident interface than the others.

When I visually inspected the predicted structures, all three peptides appeared to be mostly surface-bound rather than deeply buried into a defined pocket or groove. In each case, the peptide stretched across exposed regions of the SOD1 surface, particularly along areas consistent with the β-barrel exterior. The binding did not appear highly compact or tightly enclosed, which suggests relatively modest interface definition.

A key point from the assignment was to evaluate whether the peptides localized near the N-terminus, where the A4V mutation is located. In these models, none of the peptides showed a strong preference for that region. Instead, the peptides appeared to contact broader exposed surfaces of the protein, rather than specifically clustering around the mutant N-terminal site. Likewise, none of the models clearly suggested a deeply buried interaction or a highly specific approach to the dimer interface.

Comparison to the known binder

The known binder FLYRWLPSRRGG did not clearly outperform the PepMLM-generated peptides in this AlphaFold3 analysis. In fact, P3 matched the known binder exactly in ipTM (0.37), while P4 was only slightly lower at 0.36. This means that at least one PepMLM-generated peptide reached the same structural confidence score as the reference peptide.

However, the visual models also suggest that these interactions are likely modest and mostly surface-associated, rather than strong, sharply localized interfaces. So while P3 matched the known binder numerically, none of the tested peptides showed an obviously superior structural pose or a clear binding mode centered on the A4V mutation itself.


## Part 3: Evaluate Properties of Generated Peptides in PeptiVerse

Structural confidence alone is not sufficient for therapeutic development, so I next evaluated the PepMLM-generated peptides using **PeptiVerse**. For each peptide, I entered the peptide sequence as the binder and the **A4V mutant SOD1 sequence** as the target. I then collected the following predicted properties:

- binding affinity
- solubility
- hemolysis probability
- net charge at pH 7
- molecular weight

The mutant SOD1 sequence used as the target was:

```text
MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

PeptiVerse results
Binder	Sequence	AlphaFold3 ipTM	Predicted binding affinity	Solubility	Hemolysis probability	Net charge (pH 7)	Molecular weight (Da)	Overall assessment
P1	SHWPVYVVRKAWRAX	Not prioritized	Weak binding (6.692)	Soluble (1.000)	Non-hemolytic (0.039)	2.55	1777.1	Good developability profile, but contains ambiguous residue X
P2	ARVPELTARVELKKX	Not prioritized	Weak binding (5.529)	Soluble (1.000)	Non-hemolytic (0.022)	1.80	1692.0	Lowest hemolysis risk, but weakest affinity and contains ambiguous residue X
P3	SRWGVYVGRVEWRRA	0.37	Weak binding (6.964)	Soluble (1.000)	Non-hemolytic (0.092)	2.46	1877.1	Best affinity among the tested peptides and best structural support among resolved sequences
P4	WRVGPVAAVYEWAKK	0.36	Weak binding (5.856)	Soluble (1.000)	Non-hemolytic (0.032)	1.76	1760.0	Clean sequence and favorable safety/solubility profile, but weaker predicted binding than P3
Comparison with AlphaFold3

The PeptiVerse analysis showed that structural confidence alone was not sufficient to rank the peptides, but it did help identify the strongest overall candidate. Among the two fully specified peptides that were also evaluated with AlphaFold3, P3 had the highest ipTM (0.37) and also the highest predicted binding affinity in PeptiVerse (6.964), whereas P4 had a slightly lower ipTM (0.36) and a weaker predicted affinity (5.856). This means that, for the two best-resolved candidates, the peptide with the better structural score also showed the stronger predicted binding signal. At the same time, all four peptides were predicted to be soluble and non-hemolytic, so none of them showed an obvious developability red flag. However, P1 and P2 both contained an ambiguous X residue, which makes them less reliable as lead candidates despite their otherwise acceptable PeptiVerse profiles. Overall, P3 provided the best balance between structural support and predicted binding, while still remaining soluble and non-hemolytic.

Peptide selected for advancement

I would advance P3 (SRWGVYVGRVEWRRA) because it showed the strongest overall combination of properties among the interpretable candidates. It matched the known binder in AlphaFold3 ipTM (0.37), gave the highest predicted binding affinity in PeptiVerse (6.964), and was still predicted to be soluble and non-hemolytic. Although its interaction with SOD1 still appeared mostly surface-bound rather than deeply buried, it showed the best overall compromise between predicted binding and therapeutic properties, making it the most reasonable peptide to prioritize for the next design or validation step.

## Part 0 — Assignment Overview and Objective

For this week, my main task is **Part C: Final Project: L-Protein Mutants**, which is the required section for committed listeners. The goal of this assignment is to improve the **stability** and **auto-folding** of the **MS2 phage lysis protein (L protein)**. This is biologically relevant because the L protein is essential for phage-mediated killing of *E. coli*, and bacterial resistance can emerge if the host alters the factors required for proper L-protein function.

In the MS2 system, the L protein is thought to contribute to bacterial lysis through membrane-associated activity. However, correct processing of the L protein depends on the bacterial chaperone **DnaJ**. If *E. coli* acquires a mutation in DnaJ that disrupts this interaction, the phage may lose infectivity. Therefore, the central design challenge is to propose L-protein mutants that may improve folding, reduce dependence on DnaJ, increase expression, or enhance lysis activity.

The assignment asks us to use a **mutational scoring notebook**, compare those computational predictions with **experimental mutational data**, and then propose **five mutations** supported by a clear rationale. In addition, at least **two proposed variants must contain mutations in the soluble region** and **two must contain mutations in the transmembrane region**.

Overall, I interpret this homework as a **rational mutagenesis exercise** combining computational prediction, prior experimental data, and biological reasoning. The final result is not proof that the mutants will work experimentally, but rather a justified proposal of promising L-protein variants for future testing.

---

## Part 1 — Understanding the L Protein Sequence and Defining Its Regions

The L-protein sequence provided in the homework is:

`METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT`

The full sequence is **75 amino acids** long. According to the homework notes, the **last 35 residues correspond to the transmembrane region**, while the N-terminal portion corresponds to the **soluble domain** involved in interaction with **DnaJ**.

Based on that definition, the sequence can be divided as follows:

- **Soluble region:** residues **1–40**
- **Transmembrane region:** residues **41–75**

This division is important because the final mutant proposal must include candidates from both structural and functional regions of the protein.

### Region map

| Position range | Sequence segment | Region |
|---|---|---|
| 1–40 | `METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYV` | Soluble N-terminal domain |
| 41–75 | `LIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT` | Transmembrane domain |

At this stage, this region map serves as the framework for all subsequent analysis. Mutations in the **soluble domain** are more likely to affect folding and interaction with DnaJ, whereas mutations in the **transmembrane region** are more likely to affect membrane insertion, oligomerization, or lysis-related activity.

---

## Part 2 — Understanding the Mutational Scoring Step

After defining the soluble and transmembrane regions of the MS2 L protein, the next step is to understand the role of the **mutational scoring notebook** provided in the homework.

The purpose of this notebook is to assign a computational score to possible amino acid substitutions in the L-protein sequence. These scores are not direct measurements of biological activity. Instead, they are **predictive estimates** that help identify mutations that may be more favorable, better tolerated, or less disruptive.

This means the notebook should be used as a **prioritization tool**, not as final proof that a mutation improves the system. A favorable score does not guarantee improved lysis, correct folding, or DnaJ independence. Likewise, an unfavorable score does not prove that a mutation is impossible. The computational output is useful because it helps narrow the sequence space and identify candidate substitutions worth comparing with experimental evidence.

### Why this step matters

The number of possible amino acid substitutions across the full L-protein sequence is large, even for a small protein. Without a scoring step, mutant selection would be largely arbitrary. The notebook provides a rational first filter that makes the downstream design process more systematic.

### What I want to extract from this step

From the mutational scoring output, I aim to identify:

1. positions that appear mutationally tolerant,
2. substitutions that seem favorable,
3. whether those substitutions fall in the soluble or transmembrane region,
4. and which candidates are worth carrying forward into comparison with the experimental dataset.

At this stage, I am not yet choosing the final five mutants. I am only generating a preliminary candidate list.

---

## Part 3 — Using Experimental Mutational Data to Evaluate the Computational Scores

After obtaining the computational mutational scores, the next essential step is to compare them with the available **experimental mutational data** for the MS2 L protein.

This comparison is important because the notebook only provides a **computational estimate** of how favorable or unfavorable each amino acid substitution might be. In contrast, the experimental dataset reflects what was actually observed in the lab. Since the main functional interest of this project is improved lysis-protein performance, the experimental effects on lysis are more directly relevant than sequence-model predictions alone.

I see this comparison as serving two main purposes.

First, it helps evaluate how informative the computational scoring approach is for this particular protein. If experimentally favorable mutations also tend to receive favorable computational scores, then the notebook is capturing useful information. If the agreement is weak, then the scores should be interpreted more cautiously.

Second, this step helps prioritize candidates for the final design proposal. Mutations that look favorable in the experimental dataset, the computational scores, or ideally both, become stronger candidates for the final set of proposed variants.

### Questions I will use to filter candidate mutations

For each mutation, I want to ask:

- Does the mutation have a favorable or at least non-disruptive experimental effect?
- Does the notebook assign it a favorable computational score?
- Is the mutation located in the soluble or transmembrane region?
- Is the site likely to be too conserved to mutate safely?

This comparison is the bridge between raw prediction and rational design. It allows me to move from a large set of possible substitutions to a smaller and more biologically plausible group of candidate mutants.

---

## Part 4 — Generate Optimized Peptides with moPPIt

After evaluating the PepMLM-generated peptides with AlphaFold3 and PeptiVerse, I used **moPPIt-v3** to perform a more controlled peptide design step. Unlike PepMLM, which samples plausible binders conditioned on the full target protein sequence, moPPIt allows multi-objective peptide generation with explicit optimization objectives such as affinity, motif binding, specificity, solubility, and hemolysis.

For this step, I used the **A4V mutant SOD1 sequence** as the target protein. Since the A4V mutation is located near the N-terminus, I selected **SOD1 residues 1–10** as the motif/specificity target region. The goal was to generate 12-amino-acid peptides that could preferentially interact with the N-terminal region of mutant SOD1 while maintaining favorable therapeutic properties.

### moPPIt generation setup

| Parameter | Value |
|---|---|
| Generation mode | De novo generation |
| Target protein | A4V mutant human SOD1 |
| Target region / motif positions | Residues 1–10 |
| Binder length | 12 amino acids |
| Number of samples | 10 |
| Objectives | Hemolysis, Solubility, Affinity, Motif, Specificity |

The generated peptides were saved as `moppit_samples.csv`.

### moPPIt-generated peptides

| Index | Peptide | Hemolysis | Solubility | Affinity | Motif | Specificity |
|---:|---|---:|---:|---:|---:|---:|
| 0 | `CTERQNVGVQQW` | 0.028 | 1.000 | 6.295 | 0.797 | 0.719 |
| 1 | `SCAPVQPESVYH` | 0.073 | 1.000 | 6.011 | 0.562 | 0.756 |
| 2 | `KSEPFVPECHTT` | 0.049 | 1.000 | 5.955 | 0.474 | 0.863 |
| 3 | `MIAGIYNQQKQK` | 0.035 | 0.995 | 5.467 | 0.801 | 0.675 |
| 4 | `QNPCGGLQKNFF` | 0.061 | 1.000 | 5.928 | 0.840 | 0.775 |
| 5 | `ARRTRMARRQRW` | 0.007 | 0.998 | 6.420 | 0.033 | 0.969 |
| 6 | `GYTGQFGACPFC` | 0.022 | 1.000 | 6.711 | 0.849 | 0.700 |
| 7 | `QTCGQGDGIFWI` | 0.032 | 0.995 | 6.378 | 0.733 | 0.612 |
| 8 | `PKPPRPPAHYCF` | 0.016 | 1.000 | 6.571 | 0.552 | 0.837 |
| 9 | `FAEYNPCNPPTL` | 0.054 | 1.000 | 6.031 | 0.758 | 0.800 |

The full moPPIt output was saved as [`moppit_samples.csv`](moppit_samples.csv).

### Interpretation of moPPIt results

The moPPIt-generated peptides differed from the PepMLM-generated peptides in two important ways.

First, moPPIt generated peptides of the intended length, **12 amino acids**, whereas the PepMLM notebook repeatedly returned 15-mer peptides in my run. Second, moPPIt allowed explicit optimization toward the N-terminal region of SOD1, whereas PepMLM generated binders conditioned on the overall target sequence without direct residue-level targeting.

Among the moPPIt candidates, I would prioritize **GYTGQFGACPFC**. This peptide showed the highest predicted affinity score among the generated candidates (**6.711**), strong motif binding (**0.849**), excellent solubility (**1.000**), and low predicted hemolysis (**0.022**). This makes it the best-balanced candidate from the moPPIt output.

A second strong candidate is **PKPPRPPAHYCF**, which showed high predicted affinity (**6.571**), low hemolysis (**0.016**), high solubility (**1.000**), and good specificity (**0.837**), although its motif score was lower than that of **GYTGQFGACPFC**.

The peptide **ARRTRMARRQRW** had high predicted affinity and specificity and the lowest hemolysis score, but it had a very low motif score (**0.033**) and is highly enriched in arginine. I would therefore treat it cautiously, because strongly cationic peptides can sometimes show nonspecific interactions, aggregation, or membrane-associated effects that may not translate into selective binding.

### Comparison with PepMLM peptides

Compared with the PepMLM-generated candidates, the moPPIt peptides were more directly aligned with the design goal of targeting the A4V-proximal N-terminal region of SOD1. PepMLM was useful for broad target-conditioned sampling, while moPPIt provided a more controlled multi-objective design strategy.

My best PepMLM candidate was **P3 (`SRWGVYVGRVEWRRA`)**, based on its AlphaFold3 ipTM value and PeptiVerse profile. However, the best moPPIt candidate, **GYTGQFGACPFC**, is shorter, has no ambiguous residues, was generated with explicit motif guidance toward residues 1–10, and showed a strong combined therapeutic-property profile.

Therefore, I would advance both candidates into the next evaluation round:

1. **PepMLM candidate:** `SRWGVYVGRVEWRRA`
2. **moPPIt candidate:** `GYTGQFGACPFC`

These would then be compared using AlphaFold3 or AlphaFold-Multimer, interface analysis, peptide developability assessment, and eventually experimental binding or aggregation assays.

### How I would evaluate moPPIt peptides before therapeutic advancement

Before advancing any peptide toward therapeutic development, I would perform several additional validation steps:

1. **Structural modeling:** Use AlphaFold3 or AlphaFold-Multimer to model the SOD1 A4V–peptide complex and verify whether the peptide binds near the N-terminal A4V region.
2. **Interface analysis:** Inspect whether the peptide forms a compact and plausible interface rather than a diffuse surface contact.
3. **Specificity testing:** Compare predicted binding against wild-type SOD1 and unrelated proteins to evaluate selectivity.
4. **Developability filtering:** Re-evaluate solubility, hemolysis, aggregation risk, net charge, and proteolytic stability.
5. **Experimental validation:** Test binding experimentally using biophysical methods such as fluorescence polarization, SPR, ITC, or pull-down assays.
6. **Functional assays:** Test whether the peptide reduces SOD1 aggregation, toxicity, or misfolding in relevant in vitro or cellular models.

Overall, moPPIt provided a useful second design layer by moving from target-conditioned sampling to multi-objective, motif-directed peptide optimization.

## Part 4 — Comparing Computational Scores with Experimental Mutational Data

To move from general prediction to actual mutant selection, I next compared the **computational mutational scores** from the notebook with the available **experimental mutational data** for the MS2 L protein. This step is explicitly required in the assignment and is important because the notebook only predicts whether a mutation may be favorable, while the experimental dataset reports how specific L-protein mutants affected lysis in the lab.

The main goal of this comparison is to determine whether the computational scores are actually informative for this protein. If mutations with favorable experimental effects also tend to receive favorable notebook scores, then the language-model-based scoring method is likely capturing meaningful constraints in the L-protein sequence. If the agreement is weak, then the scores should be treated more cautiously and used only as one supporting source of evidence rather than the main basis for mutant selection.

At this stage, I used the comparison as a filtering step. Instead of selecting mutations directly from the full sequence, I prioritized candidates by asking whether each mutation met one or more of the following criteria:

1. it showed a favorable or at least non-disruptive effect in the experimental lysis dataset,
2. it received a positive or relatively favorable score in the computational notebook,
3. it was located in the appropriate region of the protein for the final assignment requirements,
4. and it was not obviously at a highly conserved position that might be risky to mutate.

This approach is consistent with the recommendation in the homework, which suggests looking for positions and mutations with either a positive experimental effect or a positive score and then using combinations of those mutations to design candidate variants.

Because the L protein contains both a **soluble N-terminal domain** and a **transmembrane region**, I also considered the structural context of each mutation during this comparison. Mutations in the soluble domain are more likely to affect folding or interaction with DnaJ, whereas mutations in the transmembrane region are more likely to affect membrane-associated lysis activity. Therefore, I did not interpret all favorable scores in the same way; instead, I evaluated them in the context of where the residue is located in the protein.

At the end of this comparison step, the outcome is not yet a final mutant list, but rather a **shortlist of plausible candidates**. These candidates can then be narrowed down further using conservation analysis and biological reasoning before proposing the final five mutations required for submission.


## Part 5 — Building a Shortlist of Candidate Mutations

After comparing the computational mutational scores with the available experimental mutational data, the next step is to build a **shortlist of candidate mutations** for the final design proposal.

At this stage, the goal is not yet to define the final five mutants, but rather to identify a smaller group of substitutions that appear promising enough to consider further. I approached this as a filtering problem: starting from many possible substitutions across the full L-protein sequence, I narrowed the list by combining computational, experimental, and biological criteria.

### Candidate selection criteria

I considered a mutation to be a strong candidate when it met one or more of the following conditions:

1. it showed a favorable or non-disruptive effect in the experimental lysis dataset,
2. it received a favorable computational score in the mutational scoring notebook,
3. it occurred at a residue that was not obviously too conserved to mutate safely,
4. and it fit one of the two required structural regions of the protein:
   - the **soluble N-terminal domain**
   - the **transmembrane domain**

This filtering strategy is important because not all favorable-looking mutations should be treated equally. A mutation with a strong score but poor experimental support is less convincing than one supported by both sources. Similarly, a mutation at a highly conserved position may be riskier even if the score looks favorable.

### Separating candidates by region

Because the assignment requires mutations from both major regions of the L protein, I separated candidate mutations into two categories:

- **soluble-domain candidates** (residues 1–40)
- **transmembrane-domain candidates** (residues 41–75)

This regional classification is biologically meaningful. Mutations in the soluble domain are more likely to affect folding, expression, or interaction with DnaJ, while mutations in the transmembrane domain are more likely to affect membrane insertion, oligomerization, or lysis-related activity.

By separating candidates this way, I can make sure that my final mutant proposal satisfies the homework requirements while also reflecting the different functional roles of the two parts of the protein.

### Why a shortlist is necessary

A shortlist is useful because the final design step should be based on a manageable set of plausible candidates rather than the full mutational landscape. It creates a structured transition from broad screening to focused design.

At the end of this step, I expect to have:

- a set of promising **soluble-domain mutations**,
- a set of promising **transmembrane-domain mutations**,
- and enough information to begin assembling the **final five proposed mutants** for submission.

### Interim conclusion

This shortlist-building step is the practical outcome of the earlier analysis. It converts general computational and experimental evidence into a focused pool of candidate mutations that can be used in the final rational design proposal.

## Part 6 — Strategy for Selecting the Final Five Mutants

After building a shortlist of candidate mutations, the next step is to define a clear strategy for selecting the **final five mutants** required for the assignment.

The homework does not simply ask for five random substitutions. Instead, it asks for a rationally chosen set of mutations supported by computational scoring, experimental evidence, and biological interpretation. For that reason, my selection strategy is based on combining multiple types of evidence rather than relying on a single ranking metric.

### Overall selection strategy

My goal is to choose five mutations that together satisfy both the **assignment constraints** and the **biological design goals** of the project.

To do this, I plan to:

1. select at least **two mutations in the soluble region**,
2. select at least **two mutations in the transmembrane region**,
3. and use the fifth mutation as either:
   - an additional strong individual candidate, or
   - part of a combined design if there is a good biological reason to combine favorable substitutions.

This ensures that the final design is balanced across both major functional regions of the protein.

### What makes a mutation strong enough for final selection

A mutation is more likely to be chosen for the final set if it meets several of the following conditions:

- it has a favorable or non-disruptive experimental effect,
- it has a favorable computational score,
- it occurs at a position that is not strongly constrained,
- it makes biological sense for the region where it occurs,
- and it contributes to a diverse final set rather than repeating the same logic multiple times.

This last point is important. I do not want all five mutations to reflect the exact same design idea. A stronger final proposal includes candidates that test different but plausible hypotheses about how L-protein performance might be improved.

### Region-specific reasoning

For **soluble-domain mutations**, I will prioritize candidates that could plausibly improve:
- folding,
- protein stability,
- expression,
- or interaction with DnaJ.

For **transmembrane-domain mutations**, I will prioritize candidates that could plausibly improve:
- membrane insertion,
- helix packing,
- oligomerization,
- or lysis-associated membrane activity.

This means that the same score value may be interpreted differently depending on whether the mutation lies in the soluble or transmembrane part of the protein.

### Why the fifth mutant matters

The fifth mutant gives some flexibility in the design strategy. It can be used in one of two ways.

One option is to choose the **single best remaining candidate** after selecting the required soluble and transmembrane mutations.

Another option is to use it as a **combined or more exploratory design**, for example by combining individually favorable substitutions if there is a reasonable hypothesis that their effects could be compatible or additive.

This makes the fifth choice especially useful because it can strengthen the overall design logic of the final proposal.

### Interim conclusion

At the end of this step, I should be ready to move from a broad shortlist to a final set of **five justified mutant designs**. The next stage will therefore be to present those final candidates and explain, for each one, why it was selected and what effect it is expected to have.



## Part 7 — Final Proposed Mutants

## Part C — Final Project: L-Protein Mutants

### Assignment objective

The objective of this part of the homework is to propose mutations in the **MS2 phage lysis protein (L protein)** that could improve its stability, auto-folding, or lysis-related activity.

This is relevant because the MS2 L protein is involved in bacterial lysis during the phage life cycle. The homework describes that the L protein is thought to form oligomers and integrate into the *E. coli* membrane to promote pore formation and cell lysis. It also highlights that proper processing of the L protein depends on the bacterial chaperone **DnaJ**, and that host resistance can emerge when DnaJ mutations impair this interaction.

Therefore, the design goal is to propose L-protein variants that may:

1. improve folding or stability,
2. reduce dependence on DnaJ-mediated processing,
3. preserve or enhance membrane-associated lysis activity,
4. and remain biologically plausible for downstream experimental testing.

Because I did not have a complete final scoring CSV and experimental mutation spreadsheet fully integrated into this documentation, I treated this part as a **rational mutagenesis proposal** based on the assignment-provided sequence, region definitions, biochemical properties, and design constraints. These candidates should be interpreted as hypotheses for future computational and experimental validation, not as experimentally confirmed improvements.

---

### L-protein sequence

The MS2 L-protein sequence provided in the assignment is:

```text
METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

The full sequence is 75 amino acids long.

According to the homework notes, the L protein contains an N-terminal soluble region followed by a C-terminal transmembrane region. The last 35 residues correspond to the transmembrane segment, while the N-terminal portion is associated with DnaJ-related processing.

Region map

Region	Position range	Sequence
Soluble N-terminal region	1–40	`METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYV`
Transmembrane region	41–75	`LIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT`

This regional division is important because the assignment asks for mutations in both the soluble and transmembrane parts of the protein.

Design strategy

I selected candidate mutations using a region-aware rational design strategy.

For the soluble region, I prioritized mutations that could plausibly affect folding, chaperone dependence, local charge distribution, or chemical stability without completely disrupting the short N-terminal domain.

For the transmembrane region, I prioritized substitutions that preserve or increase membrane-compatible hydrophobicity while avoiding overly disruptive changes to the predicted membrane-associated segment.

The proposed mutations were selected to satisfy the assignment constraint of including:

at least two mutations in the soluble region, and
at least two mutations in the transmembrane region.

Final proposed L-protein mutants

Mutant	Substitution	Position	Region	Main design rationale
1	`R18K`	18	Soluble	Conservative basic substitution; tests whether the N-terminal basic cluster can be altered while preserving positive charge
2	`R19K`	19	Soluble	Similar conservative change in the basic cluster; may modulate DnaJ-related interaction without eliminating charge
3	`C29S`	29	Soluble	Removes a cysteine that could contribute to unwanted oxidation or chemical instability while preserving polarity
4	`A45V`	45	Transmembrane	Conservative hydrophobic substitution that may increase local membrane-helix stability
5	`S49A`	49	Transmembrane	Removes a polar hydroxyl group from the transmembrane segment, potentially increasing hydrophobic compatibility

This final set includes three soluble-region mutations and two transmembrane-region mutations, satisfying the region requirements while testing distinct design hypotheses.

Mutant 1 — `R18K`

R18K replaces arginine with lysine in the soluble N-terminal region.

This is a conservative substitution because both residues are positively charged at physiological pH. However, arginine and lysine differ in side-chain geometry, hydrogen-bonding capacity, and interaction patterns. Arginine has a guanidinium group capable of strong multidentate interactions, while lysine has a more flexible terminal amino group.

Because residues 18–20 form part of a basic cluster in the soluble region, this mutation tests whether the local positive charge can be preserved while subtly changing the interaction surface. This could be relevant if the N-terminal basic region contributes to DnaJ recognition or L-protein processing.

Expected effect:

preserve net positive charge,
modestly alter local interaction chemistry,
avoid a highly disruptive substitution,
potentially reduce strict dependence on a specific arginine-mediated contact.

Mutant 2 — `R19K`

R19K is another conservative arginine-to-lysine substitution in the same basic N-terminal cluster.

The rationale is similar to R18K, but targeting a neighboring residue allows the experiment to test whether different positions in the basic patch have different sensitivity. If one arginine is more important for folding or chaperone interaction than another, these two mutants may show distinct phenotypes.

Expected effect:

maintain a basic residue at position 19,
slightly alter side-chain geometry,
test sensitivity of the basic cluster,
potentially preserve folding while modifying DnaJ-associated recognition.

Because this mutation is conservative, it is less likely to catastrophically disrupt the soluble domain than substitutions that remove charge entirely.

Mutant 3 — `C29S`

C29S replaces cysteine with serine in the soluble region.

Cysteine can participate in oxidation chemistry, disulfide formation, or nonspecific reactivity depending on its environment. In a small phage protein, an exposed cysteine could potentially contribute to chemical instability or unwanted interactions. Serine is similar in size and polarity but lacks the thiol group, making it a common conservative replacement when the goal is to reduce cysteine-associated chemical liabilities.

Expected effect:

reduce thiol-associated chemical instability,
preserve a small polar side chain,
potentially improve robustness of the soluble region,
avoid a large change in side-chain volume.

This mutation is especially useful as a stability-oriented candidate rather than a direct membrane-activity mutation.

Mutant 4 — `A45V`

A45V is located in the transmembrane region and replaces alanine with valine.

Both alanine and valine are hydrophobic residues, but valine has a larger branched side chain. In a transmembrane segment, increasing hydrophobic packing can sometimes stabilize membrane-associated helices or alter helix-helix interactions.

Expected effect:

preserve hydrophobic character,
slightly increase side-chain volume,
potentially improve local membrane-segment stability,
avoid introducing charge or polarity into the membrane region.

Because this is a conservative hydrophobic substitution, it is a reasonable first transmembrane-region candidate.

Mutant 5 — `S49A`

S49A replaces serine with alanine in the transmembrane region.

Serine contains a polar hydroxyl group, whereas alanine is small and hydrophobic. Since residue 49 lies within the transmembrane region, replacing serine with alanine may increase local hydrophobic compatibility and reduce polar disruption within the membrane-spanning segment.

Expected effect:

increase hydrophobicity of the transmembrane region,
potentially improve membrane insertion or helix stability,
preserve small side-chain size,
test whether the polar serine at position 49 is required or dispensable.

This mutation is more exploratory than A45V because removing a polar residue could alter interactions or topology. However, it is still a relatively small substitution and therefore a reasonable candidate for testing.

Summary of proposed design logic

The five proposed mutations test complementary hypotheses:

Design hypothesis	Mutations
Modulate the soluble basic cluster while preserving charge	`R18K`, `R19K`
Reduce chemical liability in the soluble region	`C29S`
Tune hydrophobic packing in the transmembrane region	`A45V`
Increase membrane compatibility by removing a polar side chain	`S49A`

Together, these mutations explore both the soluble and membrane-associated regions of the L protein. The soluble mutations are aimed at folding, stability, and potential DnaJ-related processing, while the transmembrane mutations are aimed at membrane insertion and lysis-related activity.

How I would evaluate these mutants experimentally

To determine whether these mutations improve the L protein, I would evaluate them in several steps:

Expression test: Confirm that each mutant L protein can be expressed.
Stability / folding assessment: Compare expression level, solubility, and degradation relative to wild-type L protein.
DnaJ-dependence assay: Test whether the mutant retains activity in conditions where DnaJ interaction is impaired.
Membrane activity assay: Evaluate whether transmembrane mutants alter membrane localization, pore formation, or lysis timing.
Plaque assay: Measure whether mutant MS2 phage shows altered infectivity, plaque size, or lysis efficiency.
Combination testing: If single mutants show beneficial effects, combine compatible mutations such as R18K/C29S or A45V/S49A and test whether effects are additive or disruptive.

Limitations

This proposal is based on rational mutagenesis and sequence-region interpretation. It does not prove that the mutants will improve L-protein function.

Important limitations include:

L protein is very short, so even small mutations may have large effects.
Transmembrane proteins are difficult to model accurately with standard folding tools.
DnaJ dependence may involve transient or context-dependent interactions that are hard to predict from sequence alone.
Increasing hydrophobicity in the transmembrane region may improve membrane insertion, but it could also increase aggregation or toxicity.
Conservative mutations may be safer but may produce only subtle phenotypes.
Full validation requires experimental testing in E. coli and MS2 phage systems.

Final conclusion

For this design round, I would prioritize C29S and A45V as the most balanced first candidates.

C29S is attractive because it may improve chemical stability in the soluble region without dramatically changing size or polarity. A45V is attractive because it is a conservative hydrophobic mutation in the transmembrane region and may improve membrane-segment packing without introducing a disruptive residue.

I would also keep R18K and R19K as useful probes of the N-terminal basic cluster and possible DnaJ-related recognition. Finally, S49A is a more exploratory transmembrane candidate that tests whether increasing hydrophobicity in the membrane segment improves or disrupts lysis-related activity.

Overall, these five mutations provide a rational, region-balanced set of L-protein variants for future computational filtering and experimental testing.

Week 5 HW: Protein Design Part II

Part 1: Generate Binders with PepMLM

Region map

Design strategy

Final proposed L-protein mutants

Mutant 1 — R18K

Mutant 2 — R19K

Mutant 3 — C29S

Mutant 4 — A45V

Mutant 5 — S49A

Summary of proposed design logic

How I would evaluate these mutants experimentally

Limitations

Final conclusion

Mutant 1 — `R18K`

Mutant 2 — `R19K`

Mutant 3 — `C29S`

Mutant 4 — `A45V`

Mutant 5 — `S49A`