Week 5 — Protein Design Part II

HTGAA Spring 2026 | Week 5 Homework Designing peptide binders for A4V SOD1 and engineering MS2 L-protein mutants using protein language models and structural prediction.


Part A — SOD1 Binder Peptide Design

Target: Superoxide dismutase 1 (SOD1) carrying the A4V mutation (Ala→Val at residue 4), which causes familial ALS by destabilising the N-terminus and promoting toxic aggregation.

A4V Mutant SOD1 sequence used throughout:

MATVKCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSR
KHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGN
AGSRLACGVIGIAQ

Part 1 — Generate Binders with PepMLM

PepMLM (Peptide Masked Language Model) was used to generate peptide binders conditioned on the A4V mutant SOD1 sequence. The model was run via the PepMLM-650M HuggingFace Colab, with peptide length set to 12 amino acids and 4 peptides generated.

Mask token handling: The generated peptides initially contained a trailing X mask token at position 12. The masked position was resolved by exhaustively scoring all 20 standard amino acids at that position in the context of the target sequence and selecting the highest-probability amino acid. Pseudo-perplexity was then recalculated for all complete 12-mers using the masked language modelling approach — masking each position sequentially and computing the log-probability of the true residue.

Results

#Peptide sequenceTypePseudo-perplexity ↓
1WRSYAVGAALWKPepMLM generated9.79
2WRYGAAAGEWWAPepMLM generated11.19
3WRYPPTVVGHKDPepMLM generated15.16
4KRSPVVAGEHKKPepMLM generated18.27
5FLYRWLPSRRGGKnown binder (reference)20.42

Lower pseudo-perplexity = higher model confidence in binding. All four PepMLM-generated peptides outscored the known binder FLYRWLPSRRGG (20.42). Three of the four generated peptides begin with WR, suggesting PepMLM converged on tryptophan-arginine as a favoured N-terminal motif for engaging the SOD1 surface — W provides hydrophobic bulk and aromatic stacking, while R contributes electrostatic complementarity.


Part 2 — Evaluate Binders with AlphaFold3

Each peptide was co-folded with A4V mutant SOD1 as a two-chain complex on the AlphaFold Server. The ipTM (interface predicted TM-score) measures confidence in the predicted protein-peptide interface; a higher value indicates a more confident, tighter interface prediction.

Results

PeptideTypeipTM ↑PAE min (Å) ↓Ranking score
WRSYAVGAALWKPepMLM0.584.12 / 5.280.67
WRYPPTVVGHKDPepMLM0.436.52 / 8.340.56
WRYGAAAGEWWAPepMLM0.377.49 / 8.400.51
KRSPVVAGEHKKPepMLM0.376.63 / 8.790.51
FLYRWLPSRRGGKnown binder0.367.00 / 10.370.49

PAE min values shown as SOD1→peptide / peptide→SOD1. No steric clashes detected in any model. All structures: fraction_disordered = 0.07.

Analysis: AlphaFold3 predicts that WRSYAVGAALWK forms the most confident interface with A4V SOD1 (ipTM 0.58), substantially exceeding the known binder (0.36). The interface PAE of 4.12/5.28 Å is the tightest of all five peptides. In the predicted structure, WRSYAVGAALWK docks along the surface β-barrel near the N-terminal region where V4 sits, suggesting it may directly engage the destabilised N-terminus. WRYPPTVVGHKD ranks second (ipTM 0.43); WRYGAAAGEWWA and KRSPVVAGEHKK both score 0.37, comparable to the known binder. The known binder shows the weakest peptide→SOD1 interface PAE (10.37 Å), consistent with a loosely anchored docking pose.

WRSYAVGAALWK is the standout candidate — it outperforms the known binder on both pseudo-perplexity (9.79 vs 20.42) and ipTM (0.58 vs 0.36).


Part 3 — Evaluate Therapeutic Properties with PeptiVerse

Each PepMLM-generated peptide was evaluated on PeptiVerse for binding affinity, solubility, hemolysis probability, net charge, and molecular weight using the A4V SOD1 sequence as the target.

Results

PeptideBinding affinitypKd/pKiSolubilityHemolysis (prob.)Net chargeMW (Da)
WRSYAVGAALWKMedium7.032Soluble (1.000)Non-hemolytic (0.025)+1.761407.6
WRYGAAAGEWWAWeak6.986Soluble (1.000)Non-hemolytic (0.091)−0.241423.5
WRYPPTVVGHKDWeak4.802Soluble (1.000)Non-hemolytic (0.021)+0.851454.6
KRSPVVAGEHKKWeak5.355Soluble (1.000)Non-hemolytic (0.011)+2.851335.6

Analysis: All four peptides are predicted to be fully soluble and non-hemolytic. WRSYAVGAALWK is the only peptide classified as a medium binder (pKd 7.032), aligning with its superior AlphaFold3 ipTM. Notably, WRYPPTVVGHKD ranked 2nd structurally (ipTM 0.43) yet shows the weakest predicted affinity (pKd 4.802), suggesting its AF3 interface may not reflect strong binding energetics. All molecular weights fall within a reasonable therapeutic range (~1300–1455 Da).

Peptide selected to advance: WRSYAVGAALWK. This peptide consistently ranks first across all three evaluation layers — lowest pseudo-perplexity (9.79), highest ipTM (0.58) with tightest interface PAE, and strongest predicted binding affinity (pKd 7.032). It is fully soluble, non-hemolytic, and carries a mildly positive net charge (+1.76) that may aid electrostatic engagement with the SOD1 surface.


Part 4 — Generate Optimised Peptides with moPPIt

moPPIt uses Multi-Objective Guided Discrete Flow Matching (MOG-DFM) to simultaneously optimise binding affinity, hemolysis safety, non-fouling, and motif engagement at user-specified residue positions — a fundamentally different design paradigm from PepMLM’s sequence-conditioned sampling.

Parameters used:

  • Target: A4V mutant SOD1
  • Peptide length: 12 amino acids
  • Objectives enabled: Hemolysis, Affinity, Solubility, Motif
  • Motif positions: residues 1–6 (the N-terminal region containing the A4V mutation site)
  • Samples: 4

Results

#PeptideHemolysis ↑Non-fouling ↑pKd ↑Motif score ↑
1KKQEYKEILTCR0.9780.8337.200.889
2EQKKQFKEYACN0.9530.8336.610.885
3FSKKRASYRQLC0.9350.7506.620.833
4RQKKKPGGKYFY0.9650.8336.580.737

moPPIt vs PepMLM: While PepMLM consistently converged on WR N-termini, moPPIt generated peptides rich in charged residues (K, R, E) and cysteines — reflecting steering toward the charged N-terminal pocket of SOD1. moPPIt’s explicit motif guidance at residues 1–6 ensures all generated peptides are designed to engage the A4V mutation site directly. The top candidate KKQEYKEILTCR achieves pKd 7.20 and motif score 0.889.

Pre-clinical advancement criteria: Before advancing to clinical studies, candidates would require: (1) experimental binding validation via SPR or ITC; (2) cellular toxicity assays; (3) protease stability profiling; (4) in vivo pharmacokinetic studies; (5) aggregation inhibition assays in SOD1-expressing neuronal cell lines. KKQEYKEILTCR (moPPIt) and WRSYAVGAALWK (PepMLM) would be advanced as primary candidates for head-to-head experimental comparison.


Part C — L-Protein Engineering

Objective: Engineer the MS2 bacteriophage lysis protein (L-protein) to overcome E. coli resistance. E. coli acquires resistance by mutating the DnaJ chaperone, preventing L-protein folding and function. The goal is to design variants that are DnaJ-independent or more efficient membrane lytics.

Wild-type L-protein sequence (UniProt P03609):

METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

Domain structure:

DomainResiduesFunction
Soluble domain1–41Interacts with DnaJ chaperone; key to resistance mechanism
Transmembrane domain42–75Inserts into membrane; forms pores that lyse bacterial cell

Option 1 — ESM2 Log Likelihood Ratio Score-Guided Mutagenesis

Scoring methodology

ESM2 (facebook/esm2_t6_8M_UR50D) was used to compute Log Likelihood Ratios (LLRs) for all possible single amino acid substitutions at every position:

LLR(wt→mut, pos) = log P(mut | context) − log P(wt | context)

A positive LLR indicates the model considers the mutation more evolutionarily likely than the wild-type — a proxy for structural tolerance. The full scoring run produced 1,500 mutation scores (75 positions × 20 amino acids).

ESM2 vs Experimental Data Correlation

MetricValue
Mutations matched between ESM and experimental data100
Mean LLR for beneficial mutations (lysis=1)−0.156
Mean LLR for detrimental mutations (lysis=0)−0.407
Point-biserial correlation (r)0.098
p-value0.331 (not significant)

ESM2 LLR scores show weak, non-significant correlation with experimental lysis activity (r=0.098, p=0.33). Experimentally confirmed beneficial mutations such as R20W and L44P carry negative LLR scores, while some detrimental mutations (e.g. C29R, LLR=2.40) rank very highly. ESM2 captures evolutionary fitness in general protein families, not specifically lysis function — demonstrating a key limitation of sequence-only language models for small, atypical membrane proteins with few evolutionary homologues.

Top ESM2 mutations by region

Soluble domain (positions 1–41):

MutationPositionLLR score
C29R292.395
Y39L392.242
S9Q92.014
F5Q51.795
Y27R271.628
F22R221.602

Transmembrane domain (positions 42–75):

MutationPositionLLR score
K50L502.562
N53L531.865
E61L611.818
T52L521.814
A45L451.539
Q71L711.126

Option 1 Mutants and AF2-Multimer Results

Five mutants were designed by selecting the highest-LLR positions per region (≥3 mutations each, 2 soluble, 2 TM, 1 mixed). Each was co-folded with DnaJ using AF2-Multimer (ColabFold).

MutantMutationsRegionLLR scoresipTMpTMpLDDT
O1-M1C29R, Y39L, S9QSoluble2.40, 2.24, 2.010.1700.53075.29
O1-M2F5Q, Y27R, F22RSoluble1.80, 1.63, 1.600.1400.52075.03
O1-M3K50L, N53L, E61LTM2.56, 1.87, 1.820.1500.52076.10
O1-M4T52L, A45L, Q71LTM1.81, 1.54, 1.130.1500.52076.10
O1-M5C29R, Y39L, K50LMixed2.40, 2.24, 2.560.1600.53075.31

Mutant sequences:

MutantSequence
WTMETRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT
O1-M1METRFPQQQQQTPASTNRRRPFKHEDYPRRRQQRSSTLLVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT
O1-M2METRQPQQSQQTPASTNRRRPRKHEDRPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT
O1-M3METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTLQLLLSLLLAVIRTVTTLQQLLT
O1-M4METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLLIFLSKFLNQLLLSLLEAVIRTVTTLLQLLT
O1-M5METRFPQQSQQTPASTNRRRPFKHEDYPRRRQQRSSTLLVLIFLAIFLSLFTNQLLLSLLEAVIRTVTTLQQLLT

Option 3 — Random Mutagenesis Guided by Experimental Data

Method

A Python function was written to generate random mutation combinations constrained to experimentally validated beneficial positions (lysis=1) from the L-Protein Mutants dataset. Of 139 experimental mutations, 35 showed beneficial lysis activity at 13 positions: 10 in the soluble domain and 3 in the TM region.

Validated beneficial positions:

RegionPositionsConfirmed substitutions
Soluble13, 15, 18, 19, 20, 23, 25, 26, 30, 31P13L, S15A, R18G/I, R19S/H, R20W/L, K23E, E25V/G/D, D26G, R30Q/L, R31I
TM44, 45, 46L44P, A45P, I46F

The function ensures ≥3 mutations per mutant, avoids stop codon-inducing mutations, and satisfies the 2-soluble / 2-TM / 1-mixed regional requirement.

Option 3 Mutants and AF2-Multimer Results

MutantMutationsRegionipTMpTMpLDDT
O3-M1P13L, R18G, E25GSoluble0.1600.53075.03
O3-M2R20W, K23E, R30QSoluble0.1600.52074.97
O3-M3L44P, A45P, I46FTM0.1500.53075.07
O3-M4L44P, I46F, R19STM-focused0.1800.54075.16
O3-M5S15A, E25V, L44P, A45PMixed0.1800.54074.93

Mutant sequences:

MutantSequence
WTMETRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT
O3-M1METRFPQQSQQTLASTNGRRPFKHGDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT
O3-M2METRFPQQSQQTPASTNRRWPFEHEDYPCQRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT
O3-M3METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFPPFFLSKFTNQLLLSLLEAVIRTVTTLQQLLT
O3-M4METRFPQQSQQTPASTNRSRPFKHEDYPCRRQQRSSTLYVLIFPAFFLSKFTNQLLLSLLEAVIRTVTTLQQLLT
O3-M5METRFPQQSQQTPAATNRRRPFKHVDYPCRRQQRSSTLYVLIFPPIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

Option 1 vs Option 3 — Comparative Analysis

CriterionOption 1 (ESM-guided)Option 3 (Experimental-guided)
Mutation selection basisESM2 LLR (evolutionary likelihood)Experimental lysis=1 validated mutations
Best ipTM achieved0.170 (O1-M1: C29R, Y39L, S9Q)0.180 (O3-M4 and O3-M5)
ESM-experimental correlationWeak (r=0.098, p=0.33)Not applicable
Coverage of sequence spaceFull genome scan (all 75 positions)Limited to 13 validated positions
Risk of harmful mutationsHigher — high LLR ≠ lysis activityLower — all mutations pre-validated
Best candidate to advanceO1-M1 (C29R, Y39L, S9Q)O3-M5 (S15A, E25V, L44P, A45P)

Option 3 produces slightly better AF2-Multimer scores because its mutations are drawn from positions experimentally confirmed to preserve or improve lysis function — the protein is more likely to retain a folded, interaction-competent conformation. Option 1 provides a richer, genome-wide map of evolutionary tolerance via the ESM2 heatmap, but its predictions do not correlate well with lysis phenotype for this atypical small membrane protein. Together, both approaches offer complementary views: Option 1 highlights evolutionarily permissive positions, while Option 3 grounds mutation selection in direct functional evidence.

Defining a “good” mutant: A good L-protein mutant must simultaneously satisfy: (1) computational — high pLDDT in the soluble domain indicating stable folding; (2) mechanistic — altered or weakened interface with DnaJ, reflecting chaperone independence; (3) functional — maintained TM helix propensity for efficient membrane insertion; and (4) experimental — clear plaques on DnaJ-mutant resistant E. coli strains in plaque assays, which is the definitive test this project addresses. O3-M5 (S15A, E25V, L44P, A45P) is the top candidate to advance — it targets both resistance mechanisms simultaneously by combining soluble-domain mutations that reduce DnaJ dependency with TM mutations that alter membrane insertion geometry.


Assignment Summary

PartTool usedKey result
A-1PepMLMWRSYAVGAALWK (perplexity 9.79) outperforms known binder (20.42)
A-2AlphaFold3WRSYAVGAALWK ipTM 0.58 vs known binder 0.36; tightest predicted interface
A-3PeptiVerseAll peptides soluble + non-hemolytic; WRSYAVGAALWK only medium binder (pKd 7.03)
A-4moPPItKKQEYKEILTCR — pKd 7.20, motif score 0.889; targets A4V site directly
C Option 1ESM2 + AF2-MultimerHeatmap generated; weak ESM-experiment correlation (r=0.098); best: O1-M1 (ipTM 0.17)
C Option 3Experimental data + AF2-MultimerBest candidates: O3-M4 and O3-M5 (both ipTM 0.18)
OverallSOD1 binder: WRSYAVGAALWK; L-protein: O3-M5 (S15A, E25V, L44P, A45P)