Week 5 Review: Protein Design Part II

Week 5 — Protein Design II

AI-driven peptide and protein engineering, worked end-to-end on two targets.

TL;DR
Tool stack for peptide design: PepMLM (generate) → AlphaFold3 (validate) → PeptiVerse (triage) → moPPIt (re-target). Each tool catches a failure the others miss.
Target 1: SOD1-A4V (ALS). PepMLM alone produces mode-collapsed peptides that all dock at the wrong AF3 default surface. moPPIt with motif guidance produces target-aware chemistry. Advance: B3 PAEKWFVFWHPT (sub-µM predicted Kd, dimer-interface targeted).
Target 2: MS2 L-protein. ESM-style saturation scan vs random vs experiment-led picks. Big finding: language-model preference and experimental lysis function have r = +0.007 correlation. The model’s top picks would have destroyed function.
Meta-lesson: Unsupervised protein language models predict sequence plausibility, not function. On under-represented protein families they can be actively misleading.

Course: HTGAA Spring 2026 · Lecture (Mar 3): Gabriele Corso, Pranam Chatterjee — Protein Design Part II · Author: Fiona C (Committed Listener BioPunk Node)

The tool stack at a glance

flowchart LR
    T[Target sequence] --> P[PepMLM<br/>generate plausible binders<br/>perplexity score]
    P --> A[AlphaFold3<br/>co-fold target+peptide<br/>ipTM, PAE, pLDDT, pose]
    A --> V[PeptiVerse<br/>developability triage<br/>solubility, hemolysis, Kd]
    V --> D{Site OK?<br/>Developable?}
    D -- no, redirect --> M[moPPIt<br/>site-targeted re-generation<br/>multi-objective guided]
    M --> A
    D -- yes --> L[Lead candidate<br/>for wet-lab]
    style P fill:#e0f2fe
    style A fill:#fef3c7
    style V fill:#cfe9d4
    style M fill:#e9d5ff

Tool	Purpose	Output
PepMLM (ChatterjeeLab, ESM-2 fine-tune)	Generate plausible binders from target sequence	12-mer peptides + pseudo-perplexity
AlphaFold3 (AlphaFold Server)	Predict protein-peptide complex structure	ipTM, pTM, PAE matrix, per-residue pLDDT, 5-model ensemble
PeptiVerse (ChatterjeeLab Space)	Therapeutic-property classification	pKd, solubility, hemolysis, MW, charge, GRAVY
moPPIt + MOG-DFM (ChatterjeeLab)	Site-targeted multi-objective peptide generation	Top-N from ~100 sampled trajectories per run
ESM saturation scan (ESM-2)	Predict effect of every possible single mutation	LLR score for each (position, mutation) pair

Concept sidebars

Peptide vs small molecule — they’re different modalities, not just different sizes

A peptide can sit in the same MW range as a small-molecule drug, but in drug discovery the two are separate modality classes.

	Small molecule	Peptide
Built from	Arbitrary organic synthesis	Polymerised α-amino acids
Lipinski compliance	Typically yes (≤500 Da)	Typically no (multiple violations)
Interface area buried	~300–500 Å²	~800–2000 Å²
Target topology	Deep hydrophobic pocket	Flat / shallow PPI surface
Oral bioavailability	Usually yes	Usually no (gut proteases)
CNS penetration	Often possible	Hard without engineering
Example this week	JQ1 / BRD4 (457 Da, Part B)	FLYRWLPSRRGG / SOD1 (1506 Da, Part A)

Rule of thumb: target topology dictates modality. Deep pocket → small molecule. Flat surface → peptide. Both modalities appear in Week 5 for exactly this reason.

Perplexity in one worked example

Perplexity = exponentiated mean negative log-likelihood. Read as: the effective number of equally-likely choices the model is hedging between at each position.

For a toy 5-mer PEPTI with per-position probabilities P=0.20, E=0.15, P=0.25, T=0.10, I=0.05:

Step	Calc	Value
Joint likelihood	0.20 × 0.15 × 0.25 × 0.10 × 0.05	3.75 × 10⁻⁵
Sum of log-likelihoods	ln(p₁) + … + ln(p₅)	−10.19
Per-residue mean NLL	10.19 / 5	2.04
Perplexity	exp(2.04)	7.7

So the model is hedging across the equivalent of ~7.7 amino acids per position. Versus the random baseline of 20, that’s real information; versus a strong binder (PPL 2–3), it’s mediocre.

Watch: ESM-2 / PepMLM use base-e. Some NLP literature uses base-2 (“bits per character”). Don’t mix.

Masked LMs like PepMLM report pseudo-perplexity — mask each position one at a time, predict the actual residue from the rest. Interpretation is the same.

AF3 metrics for protein-peptide complexes

Metric	What it measures	Threshold
pLDDT (per-residue)	Local backbone confidence (0–100)	>90 confident · 70–90 OK · <50 disordered
pTM (global)	Whole-structure accuracy (0–1)	>0.5 fold correct
ipTM (interface)	Interface accuracy (0–1)	>0.8 high · 0.6–0.8 grey · <0.6 likely wrong
PAE (pairwise)	Expected error in Å between residue pairs	<5 Å between interface residues = confident pose

Critical short-peptide caveat. Standard ipTM cutoffs were calibrated against protein-protein complexes. For a 12-mer vs a 154-aa target, ipTM is systematically biased downward (Stein & Dunbrack, bioRxiv 2025 — the ipSAE analysis). A 12-mer with ipTM 0.5 is not auto-junk. Use the literature benchmark’s ipTM as the calibration anchor, not a universal threshold.

moPPIt vs PepMLM

PepMLM samples binders plausible against the target as a whole. It can’t be told where to bind. moPPIt’s Multi-Objective Guided Discrete Flow Matching (MOG-DFM) adds:

Motif guidance — specify which target residues to engage.
Multi-objective optimization — affinity, solubility, motif specificity simultaneously, during sampling (not as a post-hoc filter).
Pareto-front output — multiple candidates at different trade-off points, not a single “best.”

Worked example 1 — SOD1-A4V peptide therapeutics

Why this target

	Detail
Disease	Familial ALS (most aggressive SOD1 variant)
Mutation	Ala → Val at residue 4 (mature numbering); position 5 in UniProt P00441
Survival from symptom onset	~1.4 years for A4V vs 3–5 yr for ALS overall
Mechanism	Toxic gain-of-function — A4V destabilizes monomer and dimer interface, drives aggregation. Not loss of dismutase activity.
Therapeutic surfaces	A4V site (β1) or dimer interface (~residues 51–54 + 114–116)
Why peptide modality	Surface is flat + shallow; ~80% nonpolar dimer interface — no deep pocket for small molecules

flowchart TD
    WT[Native SOD1 homodimer<br/>Cu/Zn bound, Tm > 90°C] --> M[A4V monomer<br/>destabilized N-terminus]
    M --> R[Disulfide-reduced<br/>demetalated apo monomer]
    R --> O[Misfolded oligomers / trimers]
    O --> F[Insoluble fibrils]
    F --> D[Motor neuron death]
    style WT fill:#cfe9d4
    style F fill:#fecaca
    style D fill:#fecaca

A useful binder must engage the A4V site itself or the dimer interface. Anything else is therapeutically irrelevant, no matter how good the prediction looks.

Stage 1 — PepMLM generation

Default sampling on SOD1-A4V produced four 12-mer peptides plus the literature benchmark.

#	Sequence	PPL	Issue
1	`WRYGVYAVAHKX`	10.72	`X` ambiguity code at position 12
2	`WHYYAYAAAHKX`	10.70	`X` ambiguity code at position 12
3	`WHYPAAAVRLWX`	12.76	`X` ambiguity code at position 12
4	`WHYGAAAVRLKE`	11.76	clean
Benchmark	`FLYRWLPSRRGG`	~20.64 (prior student run, proxy)	clean

Three failure modes immediately:

Pitfall 1 — Mode collapse. All four peptides share residues at positions 1 (W), 3 (Y), 7 (A), and 11 (K, in three of four). Peptides 3 and 4 differ at only 3 of 12 positions. ESM-2’s training distribution under-represents this target; the model defaults to a generic aromatic-cationic anchor pattern.

Pitfall 2 — X ambiguity codes. X is the IUPAC “any/unknown amino acid” — not a real residue. ESM-2’s tokenizer carries X; when probability mass is spread, X wins the argmax. Substituted to G (matching benchmark’s GG terminus) for downstream use, with the substitution flagged transparently.

Pitfall 3 — Higher-than-textbook PPLs. Textbook “PPL ~ 2–5 = good binder” doesn’t apply here. The whole distribution sits at 10–20 for this target. Read PPL relative to benchmark, not against universal thresholds.

Stage 2 — AlphaFold3 validation

Five jobs on alphafoldserver.com (chain A = SOD1-A4V, chain B = peptide). Top model per job:

Peptide	ipTM (top)	ipTM range	Peptide pLDDT	Interface PAE median (Å)	Top contact site	A4V engaged?
P1 `WRYGVYAVAHKG`	0.49	0.29–0.49	45.9	9.70	Residues 138, 142, 144	No
P2 `WHYYAYAAAHKG`	0.40	0.19–0.40	51.4	10.30	Residues 138, 142, 144	No
P3 `WHYPAAAVRLWG`	0.28	0.18–0.28	41.6	15.45	Residues 138, 142, 143	No
P4 `WHYGAAAVRLKE`	0.35	0.21–0.35	42.8	13.70	Residues 138, 63, 137	No
Benchmark	0.25	0.18–0.25	38.3	16.20	Residues 138, 142, 143	No

flowchart LR
    All[All 5 peptides<br/>including benchmark] --> Site[Top contacts:<br/>residues 138, 142, 143, 144<br/>= electrostatic loop, back face]
    All --> A4V[A4V site, residue 5<br/>max contact prob 0.00 - 0.01<br/>min PAE 11.7 - 14.8 Å]
    style Site fill:#fef3c7
    style A4V fill:#fecaca

Pitfall 4 — AF3 default-site convergence. When AF3 can’t find a strong anchor, it deposits poorly-anchored peptides on whichever surface is geometrically convenient. For SOD1-A4V that’s the C-terminal electrostatic loop. Five different peptides converging there — including the literature benchmark — means AF3 doesn’t have a confident pose for any of them and is showing us its default surface.

Stage 3 — PeptiVerse developability

All five passed developability (Soluble at 1.000 prob, hemolysis < 0.05). But the binding-affinity ranking inverts AF3:

Peptide	PepMLM PPL rank	AF3 ipTM rank	PeptiVerse pKd	Approx Kd
P1 `WRYGVYAVAHKG`	2	1	5.483	~3.3 µM
P2 `WHYYAYAAAHKG`	1	2	5.496	~3.2 µM
P3 `WHYPAAAVRLWG`	4	4	5.995	~1.0 µM
P4 `WHYGAAAVRLKE`	3	3	5.457	~3.5 µM
Benchmark	5	5	5.555	~2.8 µM

Pitfall 5 — Cross-model disagreement. PepMLM says P2 best. AF3 says P1 best. PeptiVerse says P3 best. The “best” peptide depends on which model you trust most. Spearman correlation between AF3 ipTM and PeptiVerse pKd across the four generated peptides is strongly negative.

Stage 4 — moPPIt site-targeted re-generation

Two parallel runs with explicit motif guidance:

Run	Motif positions	Strategy	Mean predicted Kd
A	UniProt 2–10 (A4V cluster)	Engage the destabilization site directly	~1.4 µM
B	UniProt 51–54 + 114–116 (dimer interface)	Lock the native dimer	~140 nM

Run A — A4V cluster (5 samples):

#	Sequence	pKd	Motif	Charge	Cys	Aromatics
A1	`CTSGVNVGPGGP`	6.086	0.571	0	C@1	—
A2	`ADSENCAPSSVH`	5.888	0.552	−2	C@6	—
A3	`PSEKFCVKKHTT`	5.853	0.652	+2	C@6	F
A4	`MFAGIKNKEQQT`	5.455	0.743	+1	—	F
A5	`QGKCKFKQFNPV`	5.957	0.805	+3	C@4	2F

Run B — Dimer interface (3 samples):

#	Sequence	pKd	Motif	Charge	Aromatics	Modality
B1	`CTAVLNVGLEWC`	6.393	0.827	−1	1W	Flanking Cys (possible macrocycle)
B2	`GLLAFYFYYLWF`	7.720 (~19 nM)	0.831	0	7	Extreme hydrophobic (developability flag)
B3	`PAEKWFVFWHPT`	6.480	0.771	~0	4	Balanced

Key finding — target-aware chemistry. Run A (mixed basic/hydrophobic target) is compositionally diverse: charges span −2 to +3, frequent Cys, low aromatic content. Run B (flat hydrophobic interface, ~80% nonpolar in literature) is aromatic-rich, electrostatically neutral, with one candidate (B1) showing flanking Cys consistent with a macrocyclic design (though this is a hypothesis, not validated — see caveat below). moPPIt’s guidance produces chemistry appropriate to target-surface biology in a way PepMLM’s unconditional sampling cannot.

Caveat on B1. Flanking Cys at positions 1 and 12 could form an intramolecular disulfide and fold into a macrocycle. But it could also be coincidental (n=3 samples), a classifier-learned pattern without intentional macrocyclization, or a sampling accident. Validating the macrocyclic form requires Boltz-2 or RoseTTAFold All-Atom (not AF3 — the AlphaFold Server doesn’t support intramolecular peptide disulfides).

Caveat on B2. Predicted pKd 7.72 is the workspace headline number, but 7 of 12 residues are aromatic. PeptiVerse’s solubility classifier (1.000) is almost certainly out-of-distribution for this composition. In wet-lab practice, peptides this hydrophobic don’t dissolve, self-aggregate, and bind non-specifically. B2 is a teaching example of why predicted Kd is not the only metric.

Advance recommendations

flowchart LR
    Workspace[12 candidate peptides] --> P[Primary: B3<br/>PAEKWFVFWHPT<br/>balanced, sub-µM, dimer-interface]
    Workspace --> Alt[Alternate: B1<br/>CTAVLNVGLEWC<br/>macrocycle hypothesis]
    Workspace --> A4V[A4V site: A5<br/>QGKCKFKQFNPV<br/>best Run A profile]
    style P fill:#cfe9d4
    style Alt fill:#fef3c7
    style A4V fill:#e0f2fe

Worked example 2 — MS2 L-protein engineering

Why this target

	Detail
Protein	MS2 bacteriophage lysis protein, 75 aa
Sequence (UniProt P03609)	`METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT`
Domain split	Soluble (1–40, DnaJ-interacting) + TM (41–75, pore-forming)
Engineering goal	Improve stability + auto-folding; break dependence on E. coli DnaJ chaperone
Therapeutic rationale	DnaJ-independent L-protein would overcome the most common E. coli resistance mechanism, expanding phage-therapy spectrum

Three pick strategies, head-to-head

We ran three strategies for picking 5 mutations each, then cross-scored against the Chamakura experimental lysis dataset (n=59 unique mutations with measured lysis 0/1).

flowchart TD
    Q[Pick mutations for engineering] --> R[1. Random mutagenesis<br/>option3 python script<br/>seed=42, 2-4 substitutions]
    Q --> L[2. LLR-informed only<br/>ESM saturation scan<br/>top 1% by LLR]
    Q --> E[3. Experiment-led<br/>filter for lysis=1<br/>rank by LLR within set]
    R --> RR[Mean LLR: -1.04<br/>One variant breaks initiator Met<br/>Lysis outcome: unknown for these]
    L --> LR[Mean LLR: +2.15<br/>All in top 1% of landscape<br/>~0 percent likely lysis preservation]
    E --> ER[Mean LLR: +0.18<br/>Lysis preserved: 100 percent<br/>by construction]
    style RR fill:#fef3c7
    style LR fill:#fecaca
    style ER fill:#cfe9d4

Strategy	Picks	Mean LLR	Lysis preservation
Random	various 2–4 mutation combos, incl. M1N (initiator break)	−1.04	unknown for these
LLR-informed (model only)	C29S, Y39L, K50L, N53L, S9Q	+2.15 (top 1% of landscape)	~0% — all picks at positions where neighbor experiments kill lysis
Experiment-led	E25G, K23E, A45P, I46F, D26G	+0.18 (mediocre)	100% by construction

The big finding

ESM-2 LLR and experimental lysis function have essentially zero correlation for L-protein:

Metric	Value
Pearson r (LLR vs lysis 0/1)	+0.007
AUC (discrimination)	0.476 (below chance)
Mean LLR, lysis-preserving (n=19)	−0.371
Mean LLR, lysis-killing (n=40)	−0.389

Statistical-power note: at n=59, the 95% CI on r is approximately ±0.26 (Fisher z). The data are consistent with anywhere from a small inverse correlation to a small positive correlation. The qualitative point — LLR is not informative enough for confident-pick selection on L-protein — holds regardless.

Quartile breakdown (59 unique mutations sorted by LLR):

Quartile	LLR range	Lysis preserved	Rate
Q1 (top 25% LLR)	+0.33 to +2.40	3 / 14	21% ← worst
Q2 (middle-high)	−0.14 to +0.31	8 / 14	57% ← best
Q3 (middle-low)	−0.77 to −0.17	2 / 14	14%
Q4 (bottom 25%)	−5.26 to −0.79	6 / 17	35%

The top-LLR quartile has the worst lysis preservation rate. Functionally essential residues are positions where the WT looks “unusual” to the model precisely because the unusual residue is doing necessary work. The model says “change it”; the experiment says “don’t.”

What our top LLR picks would have done

Cross-checking each LLR-informed pick against the experimental neighborhood:

LLR pick	LLR	Nearest experimental data	Result
K50L	+2.56 (#1 overall)	K50 → E, I, N, Q all tested	All 4 kill lysis
N53L	+1.87	N53 → D, H, I, K, Q, S all tested	All 6 kill lysis
C29S	+2.04	C29 → R tested	Kills lysis
Y39L	+2.24	Y39 → H tested	Kills lysis
S9Q	+2.01	no data at position 9	unknown

The model’s highest-confidence picks land on residues whose unusual identity is doing functional work.

The meta-lesson

Unsupervised protein language models predict sequence plausibility, not biological function. For under-represented protein families (phage proteins, membrane proteins, anything outside UniRef50’s bulk training distribution), they can be actively misleading — and the misleading is worst at the top of their confidence distribution. Use them as one signal among many, weight experimental data heavily when available, and never trust the top picks blindly on an unfamiliar target.

Why ESM-2 fails on L-protein specifically

Phage protein — under-represented in UniRef50
Membrane-active — PLMs are notoriously weak on membrane proteins
Many WT residues are “unusual” by general-protein statistics precisely because they’re doing functional work (single free Cys at 29, Lys mid-TM at 50, Arg-rich N-terminus)
The measured function (E. coli lysis via DnaJ chaperone + membrane insertion) requires correct folding and downstream context the model can’t see

Pitfalls cheat sheet

Pitfall	Where it bit us	How to diagnose	How to fix
PepMLM mode collapse on under-represented target	Stage 1, all four peptides shared 5–7 positions	Hamming-distance histogram across the generated set	Higher top_k or temperature; generate more, pick diverse
`X` ambiguity-code in PepMLM output	3 of 4 SOD1 peptides ended in X	Look for `X` in the sequence string	Substitute (G is the safest default); document transparently
AF3 default-site convergence	All 5 SOD1 peptides docked at same wrong patch	Multiple input peptides converging to same predicted contacts	Site-targeted re-generation via moPPIt
Short-peptide ipTM bias	12-mer ipTMs all 0.25–0.49, below universal threshold	ipTM significantly lower than benchmark for known binders	Calibrate against benchmark; consider ipSAE (Dunbrack 2025)
Cross-model disagreement	PepMLM, AF3, PeptiVerse rank candidates differently	Rank-correlation across model outputs	Multi-model ensembling; cross-validate against wet-lab when possible
LLM ≠ function on under-represented family	L-protein r = +0.007 with lysis	Compute LLR-vs-experiment correlation; quartile preservation rates	Use experimental data as primary filter; LLR as tiebreaker
PeptiVerse out-of-distribution for extreme compositions	B2 with 7 aromatics scored as soluble	GRAVY > 1.0 or other extreme physicochemical profile	Cross-check with Tango/Zyggregator/AGGRESCAN for aggregation

Paper	DOI	Why
PepMLM — Chen, T., Quinn, Z., Dumas, M., Vincoff, S., Chatterjee, P., et al. Target sequence-conditioned design of peptide binders using masked language modeling. Nat Biotechnol 2025	10.1038/s41587-025-02761-2	The tool used in Stage 1
AlphaFold3 — Abramson, J., et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 2024	10.1038/s41586-024-07487-w	The tool used in Stage 2
MS2 L-protein DnaJ dependence — Chamakura, K. R., et al. Viral protein antibiotic inhibits lipid II flippase activity. J Bacteriol 2017	PMC5446614	The mechanistic basis for Part C; companion paper PMC5775895 is the L-Protein Mutants experimental dataset
SOD1 dimer destabilization — Broom, H. R., et al. Destabilization of the dimer interface is a common consequence of diverse ALS-associated mutations in metal-free SOD1. Protein Science 2015	10.1002/pro.2803	Why SOD1 dimer interface is the relevant therapeutic surface

Course resources

Resource	URL	Notes
PepMLM-650M model card	huggingface.co/ChatterjeeLab/PepMLM-650M	Linked Colab + HF Space demo. License MIT.
AlphaFold Server	alphafoldserver.com	Free AF3 web interface. Limitation: doesn’t support intramolecular polymer covalent bonds — use Boltz-2 for cyclized peptide validation.
PeptiVerse Space	huggingface.co/spaces/ChatterjeeLab/PeptiVerse	Batch input supported — one peptide per line, target pasted once.
moPPIt model card	huggingface.co/ChatterjeeLab/moPPIt	Colab requires GPU runtime. Motif positions accept range or comma syntax.
ESM saturation Colab (Part C)	colab.research.google.com	Produces 1,425-substitution CSV for a 75-mer.

Homeworks

Week 5 Review: Protein Design Part II

Week 5 — Protein Design II

The tool stack at a glance

Concept sidebars

Peptide vs small molecule — they’re different modalities, not just different sizes

Perplexity in one worked example

AF3 metrics for protein-peptide complexes

moPPIt vs PepMLM

Worked example 1 — SOD1-A4V peptide therapeutics

Why this target

Stage 1 — PepMLM generation

Stage 2 — AlphaFold3 validation

Stage 3 — PeptiVerse developability

Stage 4 — moPPIt site-targeted re-generation

Advance recommendations

Worked example 2 — MS2 L-protein engineering

Why this target

Three pick strategies, head-to-head

The big finding

What our top LLR picks would have done

The meta-lesson

Why ESM-2 fails on L-protein specifically

Pitfalls cheat sheet

Recommended reading

Course resources

Homeworks

Subsections of Homeworks

Week 5 Review: Protein Design Part II

Week 5 — Protein Design II

The tool stack at a glance

Concept sidebars

Peptide vs small molecule — they’re different modalities, not just different sizes

Perplexity in one worked example

AF3 metrics for protein-peptide complexes

moPPIt vs PepMLM

Worked example 1 — SOD1-A4V peptide therapeutics

Why this target

Stage 1 — PepMLM generation

Stage 2 — AlphaFold3 validation

Stage 3 — PeptiVerse developability

Stage 4 — moPPIt site-targeted re-generation

Advance recommendations

Worked example 2 — MS2 L-protein engineering

Why this target

Three pick strategies, head-to-head

The big finding

What our top LLR picks would have done

The meta-lesson

Why ESM-2 fails on L-protein specifically

Pitfalls cheat sheet

Recommended reading

Course resources