Week 5 — Protein Design Part II

HTGAA Spring 2026 | Week 5 Homework Designing peptide binders for A4V SOD1 and engineering MS2 L-protein mutants using protein language models and structural prediction.

Part A — SOD1 Binder Peptide Design

Target: Superoxide dismutase 1 (SOD1) carrying the A4V mutation (Ala→Val at residue 4), which causes familial ALS by destabilising the N-terminus and promoting toxic aggregation.

A4V Mutant SOD1 sequence used throughout:

MATVKCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSR
KHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGN
AGSRLACGVIGIAQ

Part 1 — Generate Binders with PepMLM

PepMLM (Peptide Masked Language Model) was used to generate peptide binders conditioned on the A4V mutant SOD1 sequence. The model was run via the PepMLM-650M HuggingFace Colab, with peptide length set to 12 amino acids and 4 peptides generated.

Mask token handling: The generated peptides initially contained a trailing X mask token at position 12. The masked position was resolved by exhaustively scoring all 20 standard amino acids at that position in the context of the target sequence and selecting the highest-probability amino acid. Pseudo-perplexity was then recalculated for all complete 12-mers using the masked language modelling approach — masking each position sequentially and computing the log-probability of the true residue.

Results

#	Peptide sequence	Type	Pseudo-perplexity ↓
1	`WRSYAVGAALWK`	PepMLM generated	9.79
2	`WRYGAAAGEWWA`	PepMLM generated	11.19
3	`WRYPPTVVGHKD`	PepMLM generated	15.16
4	`KRSPVVAGEHKK`	PepMLM generated	18.27
5	`FLYRWLPSRRGG`	Known binder (reference)	20.42

Lower pseudo-perplexity = higher model confidence in binding. All four PepMLM-generated peptides outscored the known binder FLYRWLPSRRGG (20.42). Three of the four generated peptides begin with WR, suggesting PepMLM converged on tryptophan-arginine as a favoured N-terminal motif for engaging the SOD1 surface — W provides hydrophobic bulk and aromatic stacking, while R contributes electrostatic complementarity.

Part 2 — Evaluate Binders with AlphaFold3

Each peptide was co-folded with A4V mutant SOD1 as a two-chain complex on the AlphaFold Server. The ipTM (interface predicted TM-score) measures confidence in the predicted protein-peptide interface; a higher value indicates a more confident, tighter interface prediction.

Results

Peptide	Type	ipTM ↑	PAE min (Å) ↓	Ranking score
`WRSYAVGAALWK`	PepMLM	0.58	4.12 / 5.28	0.67
`WRYPPTVVGHKD`	PepMLM	0.43	6.52 / 8.34	0.56
`WRYGAAAGEWWA`	PepMLM	0.37	7.49 / 8.40	0.51
`KRSPVVAGEHKK`	PepMLM	0.37	6.63 / 8.79	0.51
`FLYRWLPSRRGG`	Known binder	0.36	7.00 / 10.37	0.49

PAE min values shown as SOD1→peptide / peptide→SOD1. No steric clashes detected in any model. All structures: fraction_disordered = 0.07.

Analysis: AlphaFold3 predicts that WRSYAVGAALWK forms the most confident interface with A4V SOD1 (ipTM 0.58), substantially exceeding the known binder (0.36). The interface PAE of 4.12/5.28 Å is the tightest of all five peptides. In the predicted structure, WRSYAVGAALWK docks along the surface β-barrel near the N-terminal region where V4 sits, suggesting it may directly engage the destabilised N-terminus. WRYPPTVVGHKD ranks second (ipTM 0.43); WRYGAAAGEWWA and KRSPVVAGEHKK both score 0.37, comparable to the known binder. The known binder shows the weakest peptide→SOD1 interface PAE (10.37 Å), consistent with a loosely anchored docking pose.

WRSYAVGAALWK is the standout candidate — it outperforms the known binder on both pseudo-perplexity (9.79 vs 20.42) and ipTM (0.58 vs 0.36).

Part 3 — Evaluate Therapeutic Properties with PeptiVerse

Each PepMLM-generated peptide was evaluated on PeptiVerse for binding affinity, solubility, hemolysis probability, net charge, and molecular weight using the A4V SOD1 sequence as the target.

Results

Peptide	Binding affinity	pKd/pKi	Solubility	Hemolysis (prob.)	Net charge	MW (Da)
`WRSYAVGAALWK`	Medium	7.032	Soluble (1.000)	Non-hemolytic (0.025)	+1.76	1407.6
`WRYGAAAGEWWA`	Weak	6.986	Soluble (1.000)	Non-hemolytic (0.091)	−0.24	1423.5
`WRYPPTVVGHKD`	Weak	4.802	Soluble (1.000)	Non-hemolytic (0.021)	+0.85	1454.6
`KRSPVVAGEHKK`	Weak	5.355	Soluble (1.000)	Non-hemolytic (0.011)	+2.85	1335.6

Analysis: All four peptides are predicted to be fully soluble and non-hemolytic. WRSYAVGAALWK is the only peptide classified as a medium binder (pKd 7.032), aligning with its superior AlphaFold3 ipTM. Notably, WRYPPTVVGHKD ranked 2nd structurally (ipTM 0.43) yet shows the weakest predicted affinity (pKd 4.802), suggesting its AF3 interface may not reflect strong binding energetics. All molecular weights fall within a reasonable therapeutic range (~1300–1455 Da).

Peptide selected to advance: WRSYAVGAALWK. This peptide consistently ranks first across all three evaluation layers — lowest pseudo-perplexity (9.79), highest ipTM (0.58) with tightest interface PAE, and strongest predicted binding affinity (pKd 7.032). It is fully soluble, non-hemolytic, and carries a mildly positive net charge (+1.76) that may aid electrostatic engagement with the SOD1 surface.

Part 4 — Generate Optimised Peptides with moPPIt

moPPIt uses Multi-Objective Guided Discrete Flow Matching (MOG-DFM) to simultaneously optimise binding affinity, hemolysis safety, non-fouling, and motif engagement at user-specified residue positions — a fundamentally different design paradigm from PepMLM’s sequence-conditioned sampling.

Parameters used:

Target: A4V mutant SOD1
Peptide length: 12 amino acids
Objectives enabled: Hemolysis, Affinity, Solubility, Motif
Motif positions: residues 1–6 (the N-terminal region containing the A4V mutation site)
Samples: 4

Results

#	Peptide	Hemolysis ↑	Non-fouling ↑	pKd ↑	Motif score ↑
1	`KKQEYKEILTCR`	0.978	0.833	7.20	0.889
2	`EQKKQFKEYACN`	0.953	0.833	6.61	0.885
3	`FSKKRASYRQLC`	0.935	0.750	6.62	0.833
4	`RQKKKPGGKYFY`	0.965	0.833	6.58	0.737

moPPIt vs PepMLM: While PepMLM consistently converged on WR N-termini, moPPIt generated peptides rich in charged residues (K, R, E) and cysteines — reflecting steering toward the charged N-terminal pocket of SOD1. moPPIt’s explicit motif guidance at residues 1–6 ensures all generated peptides are designed to engage the A4V mutation site directly. The top candidate KKQEYKEILTCR achieves pKd 7.20 and motif score 0.889.

Pre-clinical advancement criteria: Before advancing to clinical studies, candidates would require: (1) experimental binding validation via SPR or ITC; (2) cellular toxicity assays; (3) protease stability profiling; (4) in vivo pharmacokinetic studies; (5) aggregation inhibition assays in SOD1-expressing neuronal cell lines. KKQEYKEILTCR (moPPIt) and WRSYAVGAALWK (PepMLM) would be advanced as primary candidates for head-to-head experimental comparison.

Part C — L-Protein Engineering

Objective: Engineer the MS2 bacteriophage lysis protein (L-protein) to overcome E. coli resistance. E. coli acquires resistance by mutating the DnaJ chaperone, preventing L-protein folding and function. The goal is to design variants that are DnaJ-independent or more efficient membrane lytics.

Wild-type L-protein sequence (UniProt P03609):

METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

Domain structure:

Domain	Residues	Function
Soluble domain	1–41	Interacts with DnaJ chaperone; key to resistance mechanism
Transmembrane domain	42–75	Inserts into membrane; forms pores that lyse bacterial cell

Option 1 — ESM2 Log Likelihood Ratio Score-Guided Mutagenesis

Scoring methodology

ESM2 (facebook/esm2_t6_8M_UR50D) was used to compute Log Likelihood Ratios (LLRs) for all possible single amino acid substitutions at every position:

LLR(wt→mut, pos) = log P(mut | context) − log P(wt | context)

A positive LLR indicates the model considers the mutation more evolutionarily likely than the wild-type — a proxy for structural tolerance. The full scoring run produced 1,500 mutation scores (75 positions × 20 amino acids).

ESM2 vs Experimental Data Correlation

Metric	Value
Mutations matched between ESM and experimental data	100
Mean LLR for beneficial mutations (lysis=1)	−0.156
Mean LLR for detrimental mutations (lysis=0)	−0.407
Point-biserial correlation (r)	0.098
p-value	0.331 (not significant)

ESM2 LLR scores show weak, non-significant correlation with experimental lysis activity (r=0.098, p=0.33). Experimentally confirmed beneficial mutations such as R20W and L44P carry negative LLR scores, while some detrimental mutations (e.g. C29R, LLR=2.40) rank very highly. ESM2 captures evolutionary fitness in general protein families, not specifically lysis function — demonstrating a key limitation of sequence-only language models for small, atypical membrane proteins with few evolutionary homologues.

Top ESM2 mutations by region

Soluble domain (positions 1–41):

Mutation	Position	LLR score
C29R	29	2.395
Y39L	39	2.242
S9Q	9	2.014
F5Q	5	1.795
Y27R	27	1.628
F22R	22	1.602

Transmembrane domain (positions 42–75):

Mutation	Position	LLR score
K50L	50	2.562
N53L	53	1.865
E61L	61	1.818
T52L	52	1.814
A45L	45	1.539
Q71L	71	1.126

Option 1 Mutants and AF2-Multimer Results

Five mutants were designed by selecting the highest-LLR positions per region (≥3 mutations each, 2 soluble, 2 TM, 1 mixed). Each was co-folded with DnaJ using AF2-Multimer (ColabFold).

Mutant	Mutations	Region	LLR scores	ipTM	pTM	pLDDT
O1-M1	C29R, Y39L, S9Q	Soluble	2.40, 2.24, 2.01	0.170	0.530	75.29
O1-M2	F5Q, Y27R, F22R	Soluble	1.80, 1.63, 1.60	0.140	0.520	75.03
O1-M3	K50L, N53L, E61L	TM	2.56, 1.87, 1.82	0.150	0.520	76.10
O1-M4	T52L, A45L, Q71L	TM	1.81, 1.54, 1.13	0.150	0.520	76.10
O1-M5	C29R, Y39L, K50L	Mixed	2.40, 2.24, 2.56	0.160	0.530	75.31

Mutant sequences:

Mutant	Sequence
WT	`METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT`
O1-M1	`METRFPQQQQQTPASTNRRRPFKHEDYPRRRQQRSSTLLVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT`
O1-M2	`METRQPQQSQQTPASTNRRRPRKHEDRPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT`
O1-M3	`METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTLQLLLSLLLAVIRTVTTLQQLLT`
O1-M4	`METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLLIFLSKFLNQLLLSLLEAVIRTVTTLLQLLT`
O1-M5	`METRFPQQSQQTPASTNRRRPFKHEDYPRRRQQRSSTLLVLIFLAIFLSLFTNQLLLSLLEAVIRTVTTLQQLLT`

Option 3 — Random Mutagenesis Guided by Experimental Data

Method

A Python function was written to generate random mutation combinations constrained to experimentally validated beneficial positions (lysis=1) from the L-Protein Mutants dataset. Of 139 experimental mutations, 35 showed beneficial lysis activity at 13 positions: 10 in the soluble domain and 3 in the TM region.

Validated beneficial positions:

Region	Positions	Confirmed substitutions
Soluble	13, 15, 18, 19, 20, 23, 25, 26, 30, 31	P13L, S15A, R18G/I, R19S/H, R20W/L, K23E, E25V/G/D, D26G, R30Q/L, R31I
TM	44, 45, 46	L44P, A45P, I46F

The function ensures ≥3 mutations per mutant, avoids stop codon-inducing mutations, and satisfies the 2-soluble / 2-TM / 1-mixed regional requirement.

Option 3 Mutants and AF2-Multimer Results

Mutant	Mutations	Region	ipTM	pTM	pLDDT
O3-M1	P13L, R18G, E25G	Soluble	0.160	0.530	75.03
O3-M2	R20W, K23E, R30Q	Soluble	0.160	0.520	74.97
O3-M3	L44P, A45P, I46F	TM	0.150	0.530	75.07
O3-M4	L44P, I46F, R19S	TM-focused	0.180	0.540	75.16
O3-M5	S15A, E25V, L44P, A45P	Mixed	0.180	0.540	74.93

Mutant sequences:

Mutant	Sequence
WT	`METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT`
O3-M1	`METRFPQQSQQTLASTNGRRPFKHGDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT`
O3-M2	`METRFPQQSQQTPASTNRRWPFEHEDYPCQRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT`
O3-M3	`METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFPPFFLSKFTNQLLLSLLEAVIRTVTTLQQLLT`
O3-M4	`METRFPQQSQQTPASTNRSRPFKHEDYPCRRQQRSSTLYVLIFPAFFLSKFTNQLLLSLLEAVIRTVTTLQQLLT`
O3-M5	`METRFPQQSQQTPAATNRRRPFKHVDYPCRRQQRSSTLYVLIFPPIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT`

Option 1 vs Option 3 — Comparative Analysis

Criterion	Option 1 (ESM-guided)	Option 3 (Experimental-guided)
Mutation selection basis	ESM2 LLR (evolutionary likelihood)	Experimental lysis=1 validated mutations
Best ipTM achieved	0.170 (O1-M1: C29R, Y39L, S9Q)	0.180 (O3-M4 and O3-M5)
ESM-experimental correlation	Weak (r=0.098, p=0.33)	Not applicable
Coverage of sequence space	Full genome scan (all 75 positions)	Limited to 13 validated positions
Risk of harmful mutations	Higher — high LLR ≠ lysis activity	Lower — all mutations pre-validated
Best candidate to advance	O1-M1 (C29R, Y39L, S9Q)	O3-M5 (S15A, E25V, L44P, A45P)

Option 3 produces slightly better AF2-Multimer scores because its mutations are drawn from positions experimentally confirmed to preserve or improve lysis function — the protein is more likely to retain a folded, interaction-competent conformation. Option 1 provides a richer, genome-wide map of evolutionary tolerance via the ESM2 heatmap, but its predictions do not correlate well with lysis phenotype for this atypical small membrane protein. Together, both approaches offer complementary views: Option 1 highlights evolutionarily permissive positions, while Option 3 grounds mutation selection in direct functional evidence.

Defining a “good” mutant: A good L-protein mutant must simultaneously satisfy: (1) computational — high pLDDT in the soluble domain indicating stable folding; (2) mechanistic — altered or weakened interface with DnaJ, reflecting chaperone independence; (3) functional — maintained TM helix propensity for efficient membrane insertion; and (4) experimental — clear plaques on DnaJ-mutant resistant E. coli strains in plaque assays, which is the definitive test this project addresses. O3-M5 (S15A, E25V, L44P, A45P) is the top candidate to advance — it targets both resistance mechanisms simultaneously by combining soluble-domain mutations that reduce DnaJ dependency with TM mutations that alter membrane insertion geometry.

Assignment Summary

Part	Tool used	Key result
A-1	PepMLM	`WRSYAVGAALWK` (perplexity 9.79) outperforms known binder (20.42)
A-2	AlphaFold3	`WRSYAVGAALWK` ipTM 0.58 vs known binder 0.36; tightest predicted interface
A-3	PeptiVerse	All peptides soluble + non-hemolytic; `WRSYAVGAALWK` only medium binder (pKd 7.03)
A-4	moPPIt	`KKQEYKEILTCR` — pKd 7.20, motif score 0.889; targets A4V site directly
C Option 1	ESM2 + AF2-Multimer	Heatmap generated; weak ESM-experiment correlation (r=0.098); best: O1-M1 (ipTM 0.17)
C Option 3	Experimental data + AF2-Multimer	Best candidates: O3-M4 and O3-M5 (both ipTM 0.18)
Overall	—	SOD1 binder: `WRSYAVGAALWK`; L-protein: O3-M5 (S15A, E25V, L44P, A45P)