Week 5 HW: Protein Design Part II

Part 1: Generate Binders with PepMLM


1. Begin by retrieving the human SOD1 sequence from UniProt (P00441) and introducing the A4V mutation.

Superoxide Dismutase 1 (SOD1) is a human enzyme that plays a critical role in protecting cells from oxidative stress by catalyzing the conversion of superoxide radicals into oxygen and hydrogen peroxide. It is a small, 154-amino-acid protein that typically forms a stable homodimer and contains a β-βarrel core structure with metal cofactors, copper and zinc, essential for its catalytic activity. Mutations in SOD1, such as the A4V variant, are associated with familial amyotrophic lateral sclerosis (ALS), a neurodegenerative disorder. SOD1 is widely expressed in the cytoplasm and is a key model protein for studying protein folding, aggregation, and targeted protein degradation strategies.


Figure 1. Superoxide Dismutase 1 (SOD1) sequence, fasta file.

Figure 2. Superoxide Dismutase 1 (SOD1) sequence.

Mutation A4V was inserted:

  • Original:
    MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

  • Mutated:
    MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

2. Using the PepMLM Colab linked from the HuggingFace PepMLM-650M model card:


Figure 3. PepMLM Colab.

3. Generate four peptides of length 12 amino acids conditioned on the mutant SOD1 sequence.

As the original script generated four peptides containing some invalid characters, I improved the script to produce only valid peptides without those characters. You can access the script cell here:

Additionally, I included a scritp to calculate the pseudo-perplexity of a known peptide.


The .cvs file gives the peptides and the Pseudo Perplexity values:

Pseudo-perplexity measures the model’s confidence in a peptide sequence given the target protein context. Lower values indicate that the peptide is more consistent with patterns of protein–peptide interactions learned during training, while higher values suggest lower confidence. In the table, WRSGAAGAAWWK has the lowest pseudo-perplexity (7.12), indicating the model is most confident in this generated peptide, whereas the known binder FLYRWLPSRRGG has the highest value (20.64), reflecting comparatively lower model confidence. The other generated peptides fall in between, showing moderate confidence according to the model.

4. To your generated list, add the known SOD1-binding peptide FLYRWLPSRRGG for comparison.

  • I open the test.csv file in excel, added a line and insert the SOD1-binding peptide FLYRWLPSRRGG. Record the perplexity scores that indicate PepMLM’s confidence in the binders.


Figure 6

PepMLM assigns lower pseudo-perplexity scores to several generated peptides compared with the known SOD1-binding peptide FLYRWLPSRRGG. The peptide WRSGAAGAAWWK shows the lowest perplexity (7.12), indicating that the model predicts it as the most probable binder among the candidates.


Part 2: Evaluate Binders with AlphaFold3


1. Navigate to the AlphaFold Server: alphafoldserver.com

2. For each peptide, submit the mutant SOD1 sequence followed by the peptide sequence as separate chains to model the protein-peptide complex.


Figure 7

Figure 8

3. Record the ipTM score and briefly describe where the peptide appears to bind. Does it localize near the N-terminus where A4V sits? Does it engage the β-barrel region or approach the dimer interface? Does it appear surface-bound or partially buried?

The predicted binders 1–4 localize near residues 96–102 of the protein, a region that forms one of the parallel β-strands contributing to the β-sheet structure. In contrast, the known binder (binder 5) appears near residues 11–15. None of the binders are located close to the N-terminus or C-terminus of the protein. Additionally, the peptides do not appear tightly attached to the protein surface or inserted into a binding pocket, suggesting that they remain relatively exposed rather than strongly surface-bound or buried.

The predicted interaction confidence, measured by ipTM, is relatively low for all binders, with values ranging from 0.25 to 0.44. Binder 2 shows the highest interaction score (ipTM = 0.44), followed by binder 4 (0.39) and binder 1 (0.34), while binders 3 (0.28) and 5 (0.25) show lower confidence. These values indicate moderate to low confidence in the predicted peptide–protein interactions, suggesting that the binding poses may be weak or uncertain.


Figure 9

4. In a short paragraph, describe the ipTM values you observe and whether any PepMLM-generated peptide matches or exceeds the known binder.

The predicted complexes show low ipTM values (0.25–0.44), indicating weak confidence in protein–peptide interactions. The PepMLM-generated peptide WRSGAAGAAWWK has the highest ipTM score (0.44), exceeding that of the known binder FLYRWLPSRRGG (0.25), but none of the peptides appear to form a stable interaction with SOD1 in the predicted models.


Part 3: Evaluate Properties of Generated Peptides in the PeptiVerse**


Structural confidence alone is insufficient for therapeutic development. Using PeptiVerse, let’s evaluate the therapeutic properties of your peptide! For each PepMLM-generated peptide:

1. Paste the peptide sequence.

2. Paste the A4V mutant SOD1 sequence in the target field.

  • Check the boxes:
    • Predicted binding affinity
    • Solubility
    • Hemolysis probability
    • Net charge (pH 7)
    • Molecular weight


Figure 10

Compare these predictions to what you observed structurally with AlphaFold3. In a short paragraph, describe what you see. Do peptides with higher ipTM also show stronger predicted affinity? Are any strong binders predicted to be hemolytic or poorly soluble? Which peptide best balances predicted binding and therapeutic properties?

Choose one peptide you would advance and justify your decision briefly.

The predicted binding affinities range from pKd/pKi values of ~5.4 to 6.9, indicating weak to moderate binding overall. The peptide WRSGAAGAAWWK shows both the highest predicted affinity (pKd 6.94) and the highest ipTM value (0.44), suggesting some agreement between the structural prediction from AlphaFold3 and the affinity prediction from Peptiverse. However, most peptides did not appear to form stable interactions with SOD1 in the structural models, remaining mostly surface-proximal rather than clearly bound. All peptides show good predicted solubility (score = 1) and very low hemolysis probabilities (<0.05), indicating favorable therapeutic properties. Among the candidates, WRSGAAGAAWWK (2) best balances predicted binding strength, structural confidence, and safety properties.

See all values from alphaFold web y pepTiVerse per peptide:

BinderPseudo PerplexityipTMpTMSolubilityHemolysisBinding Affinity (pKd/pKi)Length (aa)Molecular Weight (Da)Net Charge (pH 7)Isoelectric Point (pH)Hydrophobicity (GRAVY)
WRYGAAAVEHKK12.176569880.340.7510.0215.436121415.61.859.7-1
WRSGAAGAAWWK7.116553950.440.8410.0326.939121346.51.7611-0.46
WRYYAAGLAWKK14.146826340.280.7310.0246.735121512.82.7610-0.66
WLYYAAGARHKE18.403888210.390.8710.0295.887121464.60.858.5-0.82
FLYRWLPSRRGG20.635231270.250.8310.0475.968121507.72.7611.71-0.71

Notes:

  • ipTM score estimates the confidence of protein–peptide interactions. Values above 0.7 indicate a reliable interaction, values between 0.5 and 0.7 suggest a possible interaction, and values below 0.5 indicate that an interaction is unlikely.
  • Solubility is predicted as a binary value where 0 indicates not soluble and 1 indicates soluble.
  • Hemolysis probability predicts whether a peptide may damage red blood cells; 0 indicates non-hemolytic and 1 indicates hemolytic.
  • Binding affinity (pKd/pKi) reflects binding strength (−log10 of Kd or Ki), where higher values indicate stronger binding. Values <5 indicate very weak binding (>100 µM), 5–6 weak (~100 µM), 6–7 moderate (~10 µM), 7–8 good (~1 µM), and >8 strong binding (<100 nM).

Part 4: Generate Optimized Peptides with moPPIt


Now, move from sampling to controlled design. moPPIt uses Multi-Objective Guided Discrete Flow Matching (MOG-DFM) to steer peptide generation toward specific residues and optimize binding and therapeutic properties simultaneously. Unlike PepMLM, which samples plausible binders conditioned on just the target sequence, moPPIt lets you choose where you want to bind and optimize multiple objectives at once.

1. Open the moPPit Colab linked from the HuggingFace moPPIt model card.

2. Make a copy and switch to a GPU runtime.


Figure 11

3. In the notebook:

  1. Paste your A4V mutant SOD1 sequence.
  2. Choose specific residue indices on SOD1 that you want your peptide to bind (for example, residues near position 4, the dimer interface, or another surface patch).
  3. Set peptide length to 12 amino acids.
  4. Enable motif and affinity guidance (and solubility/hemolysis guidance if available). Generate peptides.

4. After generation, briefly describe how these moPPit peptides differ from your PepMLM peptides. How would you evaluate these peptides before advancing them to clinical studies?

moPPIt Generated Peptides

BinderHemolysisSolubilityAffinityMotif
GGGRQKCFTLNM0.93490.756.57640.5096
SKKQKTITCELC0.96770.83337.21260.6561
KTCKKSFEKKQN0.97390.91676.20080.6707
GTTCIQGKKKDE0.97850.91676.37310.5156

The peptides generated with moPPIt differ from the PepMLM peptides because their generation was guided by multiple objectives defined in the notebook, including affinity, motif targeting, solubility, hemolysis, and specificity. In contrast, PepMLM generates peptides based mainly on sequence likelihood conditioned on the target protein sequence. As a result, the moPPIt peptides show moderate predicted affinities (pKd ≈ 6.2–7.2), relatively high motif scores (≈0.51–0.67), and good solubility predictions (0.75–0.92). The PepMLM peptides also show moderate affinities but were not optimized simultaneously for these therapeutic objectives during generation.

However, when one of the generated peptides was modeled with the SOD1 structure using AlphaFold, the peptide did not appear to form a stable interaction with the protein surface. This suggests that although the predicted affinity and motif scores are moderate, the structural predictions do not clearly support strong binding.

Comparison of PepMLM and moPPIt Generated Peptides

MethodBinderHemolysisSolubilityAffinity
PepMLMWRYGAAAVEHKK0.021Yes5.436
PepMLMWRSGAAGAAWWK0.032Yes6.939
PepMLMWRYYAAGLAWKK0.024Yes6.735
PepMLMWLYYAAGARHKE0.029Yes5.887
PepMLMFLYRWLPSRRGG0.047Yes5.968
moPPItGGGRQKCFTLNM0.93490.756.5764
moPPItSKKQKTITCELC0.96770.83337.2126
moPPItKTCKKSFEKKQN0.97390.91676.2008
moPPItGTTCIQGKKKDE0.97850.91676.3731

Before advancing any peptide toward clinical studies, further validation would be necessary. Additional structural modeling and docking analyses should confirm stable binding to the intended region of SOD1. Promising candidates should then be synthesized and tested experimentally through in vitro binding assays, hemolysis and cytotoxicity tests, and stability analyses to verify their predicted properties. Only peptides showing clear binding and favorable safety profiles would be considered for further development.


Part C: Final Project: L-Protein Mutants


High level summary: The objective of this assignment is to improve the stability and auto-folding of the lysis protein of a MS2-phage. This mechanism is key to the understanding of how phages can potentially solve antibiotic-resistance.


HTGAA 2026 – Protein Design II: Phage homework


1. Background Bacteriophages are viruses that infect bacteria. In this homework, I am working with MS2, a small RNA phage that infects Escherichia coli. My goal is to design mutants of the MS2 L-protein that can overcome a bacterial resistance mechanism.


2. The L-protein


Figure 12. From Chamakura et al. 2017.

--- The L-protein is the lysis protein of MS2. It is only 75 amino acids long and it kills the bacterial cell by disrupting the membrane. The protein has two main regions: - **Soluble domain** (positions 1–40): located outside the membrane. This region interacts with a bacterial chaperone called DnaJ. - **Transmembrane (TM) domain** (positions 41–75): inserted into the inner membrane of the bacterium. One important rule: positions 48 (L) and 49 (S) — the LS motif — are absolutely conserved and must never be mutated.

3. The DnaJ problem The L-protein needs DnaJ to work. DnaJ (UniProt P08622) is a bacterial chaperone from E. coli that physically interacts with the soluble domain of L-protein to activate lysis. Bacteria can mutate DnaJ to become resistant to the phage — the best-characterized resistance mutation is P330Q in the C-terminal domain of DnaJ. When this mutation is present, the interaction with L-protein is disrupted and the phage cannot lyse the cell.

To fight this resistance, I designed L-protein mutants — especially in the soluble domain — that might interact with DnaJ differently, making it harder for the bacterium to escape. To evaluate this, I used AF2-Multimer (ColabFold) to co-fold each L-protein mutant with two versions of DnaJ:

  • DnaJ wild-type: to check whether the mutation changes the natural interaction
  • DnaJ P330Q: to check whether the mutant can still interact with the resistant form of DnaJ

4. Reading the experimental data The paper by Chamakura et al. (2017) provides a mutational analysis of the L-protein. For each position, missense mutations were tested and scored for two things:

  • Lysis: does the mutant still lyse the cell? (1 = yes, 0 = no)
  • Protein levels: is the mutant protein stable and detectable? (1 = yes, 0 = no) A good candidate mutation must have Lysis = 1 and Protein = 1. This means the protein is stable and still functional. Mutations with Lysis = 0 are defective and not useful.

5. MS2 L-protein candidate mutations (Lysis = 1, Protein = 1) from L-protein mutants - excel To filter the candidates, I used the Excel file provided in the homework materials: L-protein Mutants. This file contains the experimental results from Chamakura et al. (2017), with each mutation scored for Lysis and Protein levels. From the full dataset of 35 functional mutations, I filtered for those with both Lysis = 1 and Protein = 1. This gave me 10 strong candidates to choose from.

Soluble domain (pos 1–40)

PositionAA changebp changeLysisProteinNote
13P → LC→T (pos 38)11
15S → AT→G (pos 43)11
18R → GA→G (pos 52)11RRR cluster (DnaJ binding site)
18R → IG→T (pos 53)11RRR cluster (DnaJ binding site)
30R → QG→A (pos 89)11
30R → LG→T (pos 89)11
31R → IG→T (pos 92)11

Transmembrane domain (pos 41–75)

PositionAA changebp changeLysisProteinNote
44L → PT→C (pos 131)11TM start
45A → PG→C (pos 133)11TM start
46I → FA→T (pos 136)11TM start

6. Conservation analysis with ClustalOmega To make sure our candidate mutations are well-chosen, I used BLAST to find similar L-protein sequences from other MS2 phage strains. I then aligned all the sequences using ClustalOmega to see which positions are conserved across evolution.

In the alignment, conserved positions are marked with * — these are amino acids that never change across strains, meaning they are probably essential for the protein to work. Variable positions have no symbol, meaning mutations there are better tolerated.

The results confirmed:

  • The RRR cluster (positions 18–20) shows variation across strains. One natural MS2 strain already has isoleucine (I) at position 18 instead of arginine — exactly our mutation R18 → I. This makes our choice even more justified: if nature already “tried” this mutation and the phage survived, it is a safe and reasonable design.
  • The transmembrane domain (positions 44–46) is conserved, with * symbols in the alignment. This confirms these positions are structurally important. However, the experimental data from Chamakura et al. (2017) shows that mutations L44 → P and A45 → P maintain Lysis = 1 and Protein = 1, so I can proceed with confidence.
  • Position 30 (R) is also well conserved, suggesting it plays an important role in the soluble domain.

Overall, the conservation analysis validates our 5 final mutations: they target functionally important sites, but the experimental data confirms they are compatible with lysis activity.

The alignment was generated using ClustalOmega with BLAST results from the homework materials.

Figura 13. ClustalOmega alignment of MS2 L-protein sequences Figura 13. ClustalOmega alignment of MS2 L-protein sequences Figure: Multiple sequence alignment of MS2 L-protein homologs. Asterisks () indicate fully conserved positions.*

Download alignment file (.aln)


7. The 5 final mutants From the 10 candidates, I selected 5 mutations based on their biological relevance and experimental validation. The central question guiding my design is: can we engineer an L-protein that still lyses bacteria even when DnaJ carries the resistance mutation P330Q?

DnaJ normally interacts with the soluble domain of L-protein to activate lysis. When DnaJ is mutated at P330, this interaction is disrupted. Our hypothesis is that mutations in the L-protein — especially in the region that contacts DnaJ — could change the interaction interface, allowing the phage to lyse resistant bacteria.

For the soluble domain, I chose two mutations at position 18, both in the RRR cluster (positions 18–20). This cluster is the main contact point between the L-protein and DnaJ. Replacing the positively charged arginine with a neutral amino acid may shift how DnaJ recognizes this region, potentially allowing interaction through a different surface that the P330Q mutation does not affect.

For the transmembrane domain, I chose positions 44 and 45, at the start of the TM region. Replacing leucine or alanine with proline breaks alpha-helices, changing the geometry of the TM region and potentially affecting membrane insertion independently of DnaJ.

For the free choice, I selected R30 → Q, also in the soluble domain. Position 30 is another arginine outside the RRR cluster. Mutating it tests whether the broader charged region — not just positions 18–20 — contributes to DnaJ recognition.

All 5 mutations have Lysis = 1 and Protein = 1 in the experimental data.

#MutationDomainStrategy
1R18 → GSolubleAlter DnaJ contact surface
2R18 → ISolubleAlter DnaJ contact surface
3L44 → PTransmembraneChange TM geometry
4A45 → PTransmembraneChange TM geometry
5R30 → QSoluble (free choice)Alter DnaJ contact surface

8. Genomic overlap analysis The MS2 genome is very small and compact. The L-protein gene overlaps with two other genes: the coat protein gene and the replicase gene (rep). This means that a mutation in the L-protein gene can also change an amino acid in one of these other proteins, because they share the same nucleotides but are read in a different reading frame.

This genomic structure is shown in the Benchling snapshot from Leverkus et al. (2023):

Figura 14. MS2 genome overlap Figura 14. MS2 genome overlap

Figure: The L-protein gene (lys gene, green) overlaps with the coat protein gene (cp gene, orange) and the replicase gene (rep gene, pink). Each row shows the amino acids decoded in that reading frame from the same nucleotide sequence. Source: Leverkus et al. (2023), PMC5446614.

To check whether our 5 mutations are safe for the other genes, I wrote a Python script (with AI assistance) that reads the MS2 genome (NC_001417.2), locates each mutation in the genome, and checks what happens in the overlapping reading frames. The notebook is available here: MS2 Overlap Analysis

MutationOverlapping geneCodon #OriginalMutatedEffect
R18 → GnoneNo overlap ✓
R18 → InoneNo overlap ✓
R30 → Qrep2TCG (S)TCA (S)Synonymous ✓
L44 → Prep16CCT (P)CCC (P)Synonymous ✓
A45 → Prep17CGC (R)CCC (P)Missense ⚠️

Four of the five mutations are safe for the overlapping genes. R18 → G and R18 → I fall in a region with no overlap at all. R30 → Q and L44 → P overlap with the rep gene but produce synonymous mutations — the amino acid in the replicase does not change. The only concern is A45 → P, which changes an arginine (R) to a proline (P) in the replicase (rep codon 17). This is a known limitation and would need to be tested experimentally.

Note: A synonymous mutation means the DNA sequence changes but the amino acid stays the same, because multiple codons can encode the same amino acid (e.g. CCT and CCC both code for Proline).


9. Python script The analysis was done in three cells in the MS2 Overlap Analysis Google Colab notebook

  • Cell 1 — Load genome and verify L-protein translation
codon_table = {
    'TTT':'F','TTC':','TTA':'L','TTG':'L','CTT':'L','CTC':'L','CTA':'L','CTG':'L',
    'ATT':'I','ATC':'I','ATA':'I','ATG':'M','GTT':'V','GTC':'V','GTA':'V','GTG':'V',
    'TCT':'S','TCC':'S','TCA':'S','TCG':'S','CCT':'P','CCC':'P','CCA':'P','CCG':'P',
    'ACT':'T','ACC':'T','ACA':'T','ACG':'T','GCT':'A','GCC':'A','GCA':'A','GCG':'A',
    'TAT':'Y','TAC':'Y','TAA':'*','TAG':'*','CAT':'H','CAC':'H','CAA':'Q','CAG':'Q',
    'AAT':'N','AAC':'N','AAA':'K','AAG':'K','GAT':'D','GAC':'D','GAA':'E','GAG':'E',
    'TGT':'C','TGC':'C','TGA':'*','TGG':'W','CGT':'R','CGC':'R','CGA':'R','CGG':'R',
    'AGT':'S','AGC':'S','AGA':'R','AGG':'R','GGT':'G','GGC':'G','GGA':'G','GGG':'G',
}
genome_raw = """>NC_001417.2 phage MS2 genome
GGGTGGGACCCCTTTCGG...."""  # paste full sequence here
lines = genome_raw.strip().split('\n')
if lines[0].startswith('>'):
    lines = lines[1:]
genome = ''.join('\n'.join(lines).split()).upper()
print(f"Genome length: {len(genome)} nt")  # expected: 3569
L_START    = 1678
COAT_START = 1335
REP_START  = 1761
l_gene = genome[L_START-1:L_START-1+228]
l_protein = ''.join(codon_table.get(l_gene[i:i+3], '?') for i in range(0, 225, 3))
print(f"L-protein: {l_protein}")
print(f"Match: {l_protein == 'METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT'}")
  • Cell 2 — Analyze genomic overlap for the 5 selected mutations
mutations = [
    ('R18→G', 52,  'A', 'G'),
    ('R18→I', 53,  'G', 'T'),
    ('R30→Q', 89,  'G', 'A'),
    ('L44→P', 131, 'T', 'C'),
    ('A45→P', 133, 'G', 'C'),
]
def analyze_overlap(genome, L_bp_pos, new_base, gene_start):
    genome_pos = L_START + L_bp_pos - 1
    pos_in_gene = genome_pos - gene_start
    codon_idx = pos_in_gene // 3
    pos_in_codon = pos_in_gene % 3
    codon_start = gene_start - 1 + codon_idx * 3
    orig_codon = genome[codon_start:codon_start+3]
    mut_codon = orig_codon[:pos_in_codon] + new_base + orig_codon[pos_in_codon+1:]
    orig_aa = codon_table.get(orig_codon, '?')
    mut_aa  = codon_table.get(mut_codon, '?')
    return codon_idx+1, orig_codon, orig_aa, mut_codon, mut_aa

print(f"{'Mutation':<10} {'Gene':<8} {'Codon':<8} {'Original':<12} {'Mutated':<12} {'Effect'}")
print("-" * 65)
for desc, L_bp, orig_base, new_base in mutations:
    genome_pos = L_START + L_bp - 1
    found = False
    if COAT_START <= genome_pos <= 1724:
        cn, oc, oa, mc, ma = analyze_overlap(genome, L_bp, new_base, COAT_START)
        effect = 'SYNONYMOUS ✓' if oa == ma else ('STOP ✗' if ma == '*' else 'missense')
        print(f"{desc:<10} {'coat':<8} {cn:<8} {oc+'('+oa+')':<12} {mc+'('+ma+')':<12} {effect}")
        found = True
    if REP_START <= genome_pos <= 3398:
        cn, oc, oa, mc, ma = analyze_overlap(genome, L_bp, new_base, REP_START)
        effect = 'SYNONYMOUS ✓' if oa == ma else ('STOP ✗' if ma == '*' else 'missense')
        print(f"{desc:<10} {'rep':<8} {cn:<8} {oc+'('+oa+')':<12} {mc+'('+ma+')':<12} {effect}")
        found = True
    if not found:
        print(f"{desc:<10} {'none':<8} {'—':<8} {'—':<12} {'—':<12} no overlap ✓")
  • Cell 3 — Generate mutant sequences and DnaJ P330Q for AlphaFold
import pandas as pd

original = "METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT"

def mutate(seq, position, new_aa):
    seq = list(seq)
    seq[position - 1] = new_aa
    return ''.join(seq)

mutants = {
    'Wild-type': original,
    'R18G': mutate(original, 18, 'G'),
    'R18I': mutate(original, 18, 'I'),
    'R30Q': mutate(original, 30, 'Q'),
    'L44P': mutate(original, 44, 'P'),
    'A45P': mutate(original, 45, 'P'),
}

print("=== L-protein sequences for AlphaFold (Entity 1) ===\n")
for name, seq in mutants.items():
    print(f"{name}:\n{seq}\n")

# DnaJ P330Q (E. coli resistance mutation)
dnaj_wt = "MAKQDYYEILGVSKTAEEREIRKAYKRLAMKYHPDRNQGDKEAEAKFKEIKEAYEVLTDSQKRAAYDQYGHAAFEQGGMGGGGFGGGADFSDIFGDVFGDIFGGGRGRQRAARGADLRYNMELTLEEAVRGVTKEIRIPTLEECDVCHGSGAKPGTQPQTCPTCHGSGQVQMRQGFFAVQQTCPHCQGRGTLIKDPCNKCHGHGRVERSKTLSVKIPAGVDTGDRIRLAGEGEAGEHGAPAGDLYVQVQVKQHPIFEREGNNLYCEVPINFAMAALGGEIEVPTLDGRVKLKVPGETQTGKLFRMRGKGVKSVRGGAQGDLLCRVVVETPVGLNERQKQLLQELQESFGGPTGEHNSPRSKSFFDGVKKFFDDLTR"
dnaj_P330Q = mutate(dnaj_wt, 330, 'Q')

print("=== DnaJ sequences for AlphaFold (Entity 2) ===\n")
print(f"DnaJ wild-type (pos 330 = {dnaj_wt[329]}):\n{dnaj_wt}\n")
print(f"DnaJ P330Q (pos 330 = {dnaj_P330Q[329]}):\n{dnaj_P330Q}\n")

# Save all sequences as CSV
rows = []
for name, seq in mutants.items():
    rows.append({'Name': name, 'Type': 'L-protein', 'Sequence': seq})
rows.append({'Name': 'DnaJ_wt',    'Type': 'DnaJ', 'Sequence': dnaj_wt})
rows.append({'Name': 'DnaJ_P330Q', 'Type': 'DnaJ', 'Sequence': dnaj_P330Q})

df = pd.DataFrame(rows)
df.to_csv('all_sequences.csv', index=False)
print("Saved: all_sequences.csv")
print(df[['Name','Type']].to_string())

from google.colab import files
files.download('all_sequences.csv')

Open in Google Colab


10. PepMLM results — pseudo perplexity analysis

I used PepMLM-650M to generate peptide binders for each of the 5 mutant L-protein sequences. PepMLM is a protein language model that generates short peptides and scores them using pseudo perplexity (PPL) — a measure of how “natural” or plausible a peptide is given the target protein. A lower PPL means the peptide fits the protein surface better.

For each mutant, I generated 8 candidate binders (15 amino acids long) and selected the one with the lowest PPL. As a biological reference, I used the fragment TNRRRPFKHEDYPCR (positions 14–28 of the original L-protein), which contains the RRR cluster — the known DnaJ binding site identified by Chamakura et al. (2017). Since this is a real functional fragment, we expect it to have a low PPL, and we use it as a baseline.

MutantBest binderPseudo Perplexity
Reference (RRR cluster)TNRRRPFKHEDYPCR2.988
R30QSTLGLLLADLLAKLL8.471
R18ISTEGQQLADDLAFIL9.492
R18GTTEEQLQLDDEGFIF12.102
L44PTTWELQLLYDEGLLL15.185
A45PTTFGQLESRDGLLIL17.985

The reference fragment has the lowest PPL (2.988), consistent with it being a real, functional sequence. All mutant binders score higher, which is expected — the mutations alter the protein surface compared to the wild-type. Among the mutants, R30Q and R18I produce the most plausible binders (lowest PPL), suggesting their protein surface remains well-structured despite the mutation. A45P scores highest, consistent with it being the most structurally disruptive mutation — it also introduces a missense change in the rep gene (see Section 8).

Note: Pseudo perplexity is a computational score, not an experimental measurement. These results are used to rank and prioritize candidates, and need to be validated with structural prediction (AlphaFold) and experimental testing.

📓 PepMLM Pseudo-Perplexity Notebook


Script 📓 PepMLM Pseudo-Perplexity Notebook

The original PepMLM notebook was adapted to handle multiple protein sequences at once, reading them from a .csv file (one column called sequence) and generating binders for all 5 mutants in a single run. The key modification is a loop over all input sequences in the “Generate Valid Peptides” cell.

  • Cell 1 — Install packages
#@title Install Packages
! pip install Bio
! pip install transformers

from google.colab import files
import pandas as pd
from Bio import SeqIO
import io
  • Cell 2 — Inputs and parameters

The original notebook assumes a single sequence. We set single_sequence = False and upload the all_sequences.csv file generated in the overlap analysis notebook (Section 9). Parameters: peptide_length = 15, top_k = 3, num_binders = 8.

#@title Inputs and Parameters
single_sequence = False  #@param {type:"boolean"}
protein_seq = "MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ"  #@param {type:"string"}

jobname = "Lprotein_mutants"  #@param {type: "string"}

if single_sequence:
    protein_seq = protein_seq
else:
    uploaded = files.upload()
    key = list(uploaded.keys())[0]
    df = pd.read_csv(io.BytesIO(uploaded[key]), header=0)
    df['sequence'] = df['sequence'].str.strip()
    if list(df.columns) != ['sequence']:
        print('ERROR: improperly formatted file')
    protein_seq = df['sequence'].tolist()

import ipywidgets as widgets
from ipywidgets import Layout
from IPython.display import display
style = {'description_width': 'initial'}

num_binders = 8
top_k = 3
peptide_length = 15

def on_change(change):
    global num_binders
    if change['type'] == 'change' and change['name'] == 'value':
        num_binders = change['new']

def update_values(change):
    global top_k, peptide_length
    top_k = top_k_slider.value
    peptide_length = peptide_length_slider.value

peptide_length_slider = widgets.IntSlider(value=15, min=3, max=50, step=1, description='Peptide Length:', style=style)
top_k_slider = widgets.IntSlider(value=3, min=1, max=10, step=1, description='Top K Value:', style=style)
display(peptide_length_slider)
display(top_k_slider)
peptide_length_slider.observe(update_values, names='value')
top_k_slider.observe(update_values, names='value')

dropdown = widgets.Dropdown(options=[1, 2, 4, 8, 16, 32], value=8,
    description='Number of Binders', style=style)
display(dropdown)
dropdown.observe(on_change)
  • Cell 3 — Load model
#@title Load Model
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
from torch.distributions.categorical import Categorical
import numpy as np
import pandas as pd

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("ChatterjeeLab/PepMLM-650M")
model = AutoModelForMaskedLM.from_pretrained("ChatterjeeLab/PepMLM-650M").to(device)

def compute_pseudo_perplexity(model, tokenizer, protein_seq, binder_seq):
    sequence = protein_seq + binder_seq
    tensor_input = tokenizer.encode(sequence, return_tensors='pt').to(model.device)
    total_loss = 0
    for i in range(-len(binder_seq)-1, -1):
        masked_input = tensor_input.clone()
        masked_input[0, i] = tokenizer.mask_token_id
        labels = torch.full(tensor_input.shape, -100).to(model.device)
        labels[0, i] = tensor_input[0, i]
        with torch.no_grad():
            outputs = model(masked_input, labels=labels)
            total_loss += outputs.loss.item()
    avg_loss = total_loss / len(binder_seq)
    return np.exp(avg_loss)

def generate_peptide_for_single_sequence(protein_seq, peptide_length=15, top_k=3, num_binders=4):
    binders_with_ppl = []
    for _ in range(num_binders):
        masked_peptide = '<mask>' * int(peptide_length)
        input_sequence = protein_seq + masked_peptide
        inputs = tokenizer(input_sequence, return_tensors="pt").to(model.device)
        with torch.no_grad():
            logits = model(**inputs).logits
        mask_token_indices = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
        logits_at_masks = logits[0, mask_token_indices]
        top_k_logits, top_k_indices = logits_at_masks.topk(int(top_k), dim=-1)
        probabilities = torch.nn.functional.softmax(top_k_logits, dim=-1)
        predicted_indices = Categorical(probabilities).sample()
        predicted_token_ids = top_k_indices.gather(-1, predicted_indices.unsqueeze(-1)).squeeze(-1)
        generated_binder = tokenizer.decode(predicted_token_ids, skip_special_tokens=True).replace(' ', '')
        ppl_value = compute_pseudo_perplexity(model, tokenizer, protein_seq, generated_binder)
        binders_with_ppl.append([generated_binder, ppl_value])
    return binders_with_ppl

def generate_peptide(input_seqs, peptide_length=15, top_k=3, num_binders=4):
    if isinstance(input_seqs, str):
        binders = generate_peptide_for_single_sequence(input_seqs, peptide_length, top_k, num_binders)
        return pd.DataFrame(binders, columns=['Binder', 'Pseudo Perplexity'])
    elif isinstance(input_seqs, list):
        results = []
        for seq in input_seqs:
            binders = generate_peptide_for_single_sequence(seq, peptide_length, top_k, num_binders)
            for binder, ppl in binders:
                results.append([seq, binder, ppl])
        return pd.DataFrame(results, columns=['Input Sequence', 'Binder', 'Pseudo Perplexity'])
  • Cell 4 — Generate valid peptides (adapted for multiple sequences)

This is the key cell that was modified. The original notebook generates binders for a single sequence. Here we loop over all 5 mutant sequences and filter out any peptides containing ambiguous amino acid characters (B, J, O, U, X, Z).

# Generate Valid Peptides for multiple sequences
invalid_chars = set("BJOUXZ")
all_results = []

for seq in protein_seq:  # loop over each of the 5 mutant sequences
    valid_binders = []
    while len(valid_binders) < num_binders:
        peptide_df = generate_peptide(seq, peptide_length, top_k, num_binders)
        for _, row in peptide_df.iterrows():
            binder = row["Binder"]
            ppl = row["Pseudo Perplexity"]
            if not any(c in invalid_chars for c in binder):
                valid_binders.append([seq, binder, ppl])
            if len(valid_binders) == num_binders:
                break
    all_results.extend(valid_binders)

valid_df = pd.DataFrame(all_results, columns=["Input Sequence", "Binder", "Pseudo Perplexity"])
valid_df
  • Cell 5 — Add mutant names and sort by PPL
mutant_names = ['R18G', 'R18I', 'R30Q', 'L44P', 'A45P']
mutant_seqs = protein_seq  # list loaded from CSV

seq_to_name = dict(zip(mutant_seqs, mutant_names))
valid_df['Mutant'] = valid_df['Input Sequence'].map(seq_to_name)

sorted_df = valid_df[['Mutant', 'Binder', 'Pseudo Perplexity']].sort_values('Pseudo Perplexity')

print("=== All results sorted by Pseudo Perplexity ===")
print(sorted_df.to_string(index=False))

print("\n=== Best binder per mutant ===")
print(sorted_df.groupby('Mutant').first().to_string())
  • Cell 6 — Compare with reference binder and download results
## Compare best binders with reference binder (RRR cluster — known DnaJ binding site)

known_binder = "TNRRRPFKHEDYPCR"
ppl_known = compute_pseudo_perplexity(model, tokenizer, protein_seq[0], known_binder)

print("=== Reference binder (RRR cluster - DnaJ binding site) ===")
print(f"Sequence:          {known_binder}")
print(f"Pseudo Perplexity: {ppl_known:.3f}")

print("\n=== Best binder per mutant ===")
best = sorted_df.groupby('Mutant').first().reset_index()
for _, row in best.iterrows():
    marker = " ← better than reference" if row['Pseudo Perplexity'] < ppl_known else ""
    print(f"{row['Mutant']:<6} | {row['Binder']} | PPL: {row['Pseudo Perplexity']:.3f}{marker}")

print(f"\nReference | {known_binder} | PPL: {ppl_known:.3f}")

# Save and download final results
reference_row = pd.DataFrame({
    'Mutant': ['Reference (RRR cluster)'],
    'Binder': ['TNRRRPFKHEDYPCR'],
    'Pseudo Perplexity': [2.988]
})
best_df = sorted_df.groupby('Mutant').first().reset_index()
final_df = pd.concat([reference_row, best_df], ignore_index=True)
final_df = final_df.sort_values('Pseudo Perplexity').reset_index(drop=True)
print(final_df.to_string(index=False))

final_df.to_csv('Lprotein_best_binders.csv', index=False)
files.download('Lprotein_best_binders.csv')

11. Structural predictions with AF2-Multimer

We used ColabFold to run AF2-Multimer predictions for all 5 L-protein mutants co-folded with DnaJ. We ran two sets of jobs to answer two questions:

  • Set A: Does each mutant still interact with normal DnaJ (wild-type)?
  • Set B: Does any mutant interact better with the resistant DnaJ (P330Q)?

In ColabFold, both sequences are entered in the query_sequence field separated by :, which tells AF2-Multimer to model them as two chains of a complex. We used the full DnaJ sequence (376 aa, UniProt P08622) as chain B, and each L-protein variant (75 aa) as chain A.

The key confidence score for our analysis is ipTM (interface predicted TM-score, 0–1): it measures how confidently AlphaFold predicts the interface between the two proteins. Values below 0.5 indicate low interface confidence; values above 0.8 are considered reliable. We also report pTM (overall complex confidence) and pLDDT (per-residue confidence). See figure below:

Figura 15. Confidence score scales: pLDDT, ipTM and pTM Figura 15. Confidence score scales: pLDDT, ipTM and pTM

Important limitation: AF2-Multimer is not well-suited for membrane proteins. The L-protein transmembrane domain (positions 41–75) is predicted with low confidence (pLDDT < 50), which explains the very low ipTM values across all jobs. We use the wild-type result as a baseline and compare mutants relative to it, rather than interpreting the absolute values.


  • Set A — L-protein mutants vs DnaJ wild-type

Question: Do our mutations preserve the ability to interact with normal DnaJ?

MutantBest ipTMAvg ipTMpTM (avg)pLDDT (avg)
Wild-type (control)0.1610.1340.52375.6
R18G0.1600.1370.52375.7
R18I0.1600.1380.52375.7
R30Q0.1600.1340.52475.6
L44P0.1620.1350.52675.5
A45P0.1670.1370.52675.5

Answer: Yes. All mutants show ipTM scores essentially identical to the wild-type control. None of the mutations dramatically disrupt the predicted complex with DnaJ WT.

Download results — Set A (L-protein mutants vs DnaJ wild-type): WT · R18G · R18I · R30Q · L44P · A45P


  • Set B — L-protein mutants vs DnaJ P330Q (resistant)

Question: Does any mutant recover interaction with the resistant DnaJ?

MutantBest ipTMAvg ipTMvs WT baseline
Wild-type (baseline)0.1600.138
R18G0.1610.143↑ +0.001
R18I0.1610.141↑ +0.001
R30Q0.1580.136↓ −0.002
L44P0.1680.140↑ +0.008
A45P0.1670.140↑ +0.007

Answer: L44P and A45P show the highest ipTM scores against DnaJ P330Q — both above the wild-type baseline and consistent across all 5 predicted models. R18G and R18I show marginal improvement, and R30Q scores slightly below baseline.

To identify which residues are responsible for the predicted interaction, we used PyMOL to select all L-protein atoms within 5 Å of DnaJ. The iterate command loops over each selected atom and prints its residue number, residue name, and chain — giving us the exact list of contact residues at the interface.

  • L44P contact residues: His24, Asp26 (soluble domain) and Leu56 (TM domain)
  • A45P contact residues: Leu48, Thr52, Leu56 (TM domain only)
L44P + DnaJ P330Q
Figure 15. L44P + DnaJ P330Q — two contact zones: soluble domain (His24, Asp26) and TM domain (Leu56).

L44P shows two predicted contact zones with DnaJ P330Q: one in the soluble domain (His24, Asp26) and one in the transmembrane domain (Leu56). See Figure 16.

A45P contacts DnaJ exclusively through the transmembrane domain (Leu48, Thr52, Leu56), with no involvement of the soluble region. Both results suggest that proline mutations at positions 44 and 45 reposition the TM helix against DnaJ P330Q (See Figure 17). L44P additionally engages the soluble domain, potentially creating a more stable interaction surface.

A45P + DnaJ P330Q
Figure 17. A45P + DnaJ P330Q — single contact zone in TM domain (Leu48, Thr52, Leu56).

The following PyMOL script was used to visualize the interface and list the contact residues mentioned above:

bg_color white
hide everything

# DnaJ solid surface
show surface, chain B
color orange, chain B
set transparency, 0, chain B

# Contact zone on DnaJ in red
select interface_B, (chain B within 5 of chain A)
color red, interface_B

# L-protein cartoon
show cartoon, chain A
color cyan, chain A
set cartoon_tube_radius, 0.5

# L-protein interface residues as spheres
select interface_A, (chain A within 5 of chain B)
show spheres, interface_A
color yellow, interface_A
set sphere_scale, 0.6

set antialias, 2
bg_color white
zoom interface_A, 12

# List interface residues
iterate interface_A, print(resi, resn, chain)
\```

Download results — Set B (L-protein mutants vs DnaJ P330Q): WT · R18G · R18I · R30Q · L44P · A45P


12. Open Question: How do we define a “good” mutant?

A good mutant is one that can still kill bacteria — including resistant ones. Computationally, we use ipTM (predicted interface confidence with DnaJ) and pseudo-perplexity (how natural the sequence looks) as proxies. A mutant scores better if it has higher ipTM against resistant DnaJ P330Q than the wild-type, and lower pseudo-perplexity. By these criteria, L44P and A45P are our best candidates.


Conclusions

All five mutants maintain structural integrity when co-folded with wild-type DnaJ (Set A), confirming that none of the designed mutations disrupt the overall complex. This is a prerequisite for any mutant to be functional.

Against the resistant DnaJ P330Q (Set B), L44P and A45P stand out as the top candidates. Both show consistently higher ipTM scores across all 5 predicted models compared to the wild-type L-protein baseline. Notably, both mutations are located in the transmembrane domain (positions 44–45), suggesting that TM helix geometry may play a role in adapting to the P330Q change in DnaJ. PyMOL interface analysis confirms this: L44P contacts DnaJ P330Q at two zones (His24/Asp26 in the soluble domain and Leu56 in the TM domain), while A45P contacts exclusively through the TM domain (Leu48, Thr52, Leu56). By contrast, R30Q scores slightly below baseline, and R18G and R18I show only marginal improvement.

These are computational predictions with known limitations for membrane proteins. Experimental validation — such as phage lysis assays with resistant bacteria — would be required to confirm whether L44P or A45P actually overcome DnaJ P330Q resistance in vivo.


References

  1. Chamakura KR, Tran JS, Young R. (2017). MS2 Lysis of Escherichia coli Depends on Host Chaperone DnaJ. Journal of Bacteriology, 199(12). https://doi.org/10.1128/JB.00058-17

  2. Chamakura KR, Sham LT, Reed CA, Coleman S, Bernhardt TG, Young R. (2017). Mutational analysis of the MS2 lysis protein L. Microbiology, PMC5775895. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5775895/

  3. Mezhyrova J, Martin J, Börnsen C, Dötsch V, Frangakis AS, Morgner N, Bernhard F. (2023). In vitro characterization of the phage lysis protein MS2-L. Microbiome Research Reports, 2:28. https://doi.org/10.20517/mrr.2023.28

  4. Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. (2022). ColabFold: making protein folding accessible to all. Nature Methods, 19, 679–682. https://doi.org/10.1038/s41592-022-01488-1

  5. Chatterjee et al. PepMLM-650M — peptide binder generation using masked language models. GitHub: https://github.com/programmablebio/pepmlm

This homework was completed with the assistance of Claude AI (Anthropic).