Homeworks

Subsections of Homeworks

Week 5 Review: Protein Design Part II

Week 5 โ€” Protein Design II

AI-driven peptide and protein engineering, worked end-to-end on two targets.

TL;DR

  • Tool stack for peptide design: PepMLM (generate) โ†’ AlphaFold3 (validate) โ†’ PeptiVerse (triage) โ†’ moPPIt (re-target). Each tool catches a failure the others miss.
  • Target 1: SOD1-A4V (ALS). PepMLM alone produces mode-collapsed peptides that all dock at the wrong AF3 default surface. moPPIt with motif guidance produces target-aware chemistry. Advance: B3 PAEKWFVFWHPT (sub-ยตM predicted Kd, dimer-interface targeted).
  • Target 2: MS2 L-protein. ESM-style saturation scan vs random vs experiment-led picks. Big finding: language-model preference and experimental lysis function have r = +0.007 correlation. The model’s top picks would have destroyed function.
  • Meta-lesson: Unsupervised protein language models predict sequence plausibility, not function. On under-represented protein families they can be actively misleading.

Course: HTGAA Spring 2026 ยท Lecture (Mar 3): Gabriele Corso, Pranam Chatterjee โ€” Protein Design Part II ยท Author: Fiona C (Committed Listener BioPunk Node)


The tool stack at a glance

flowchart LR
    T[Target sequence] --> P[PepMLM<br/>generate plausible binders<br/>perplexity score]
    P --> A[AlphaFold3<br/>co-fold target+peptide<br/>ipTM, PAE, pLDDT, pose]
    A --> V[PeptiVerse<br/>developability triage<br/>solubility, hemolysis, Kd]
    V --> D{Site OK?<br/>Developable?}
    D -- no, redirect --> M[moPPIt<br/>site-targeted re-generation<br/>multi-objective guided]
    M --> A
    D -- yes --> L[Lead candidate<br/>for wet-lab]
    style P fill:#e0f2fe
    style A fill:#fef3c7
    style V fill:#cfe9d4
    style M fill:#e9d5ff
ToolPurposeOutput
PepMLM (ChatterjeeLab, ESM-2 fine-tune)Generate plausible binders from target sequence12-mer peptides + pseudo-perplexity
AlphaFold3 (AlphaFold Server)Predict protein-peptide complex structureipTM, pTM, PAE matrix, per-residue pLDDT, 5-model ensemble
PeptiVerse (ChatterjeeLab Space)Therapeutic-property classificationpKd, solubility, hemolysis, MW, charge, GRAVY
moPPIt + MOG-DFM (ChatterjeeLab)Site-targeted multi-objective peptide generationTop-N from ~100 sampled trajectories per run
ESM saturation scan (ESM-2)Predict effect of every possible single mutationLLR score for each (position, mutation) pair

Concept sidebars

Peptide vs small molecule โ€” they’re different modalities, not just different sizes

A peptide can sit in the same MW range as a small-molecule drug, but in drug discovery the two are separate modality classes.

Small moleculePeptide
Built fromArbitrary organic synthesisPolymerised ฮฑ-amino acids
Lipinski complianceTypically yes (โ‰ค500 Da)Typically no (multiple violations)
Interface area buried~300โ€“500 ร…ยฒ~800โ€“2000 ร…ยฒ
Target topologyDeep hydrophobic pocketFlat / shallow PPI surface
Oral bioavailabilityUsually yesUsually no (gut proteases)
CNS penetrationOften possibleHard without engineering
Example this weekJQ1 / BRD4 (457 Da, Part B)FLYRWLPSRRGG / SOD1 (1506 Da, Part A)

Rule of thumb: target topology dictates modality. Deep pocket โ†’ small molecule. Flat surface โ†’ peptide. Both modalities appear in Week 5 for exactly this reason.


Perplexity in one worked example

Perplexity = exponentiated mean negative log-likelihood. Read as: the effective number of equally-likely choices the model is hedging between at each position.

For a toy 5-mer PEPTI with per-position probabilities P=0.20, E=0.15, P=0.25, T=0.10, I=0.05:

StepCalcValue
Joint likelihood0.20 ร— 0.15 ร— 0.25 ร— 0.10 ร— 0.053.75 ร— 10โปโต
Sum of log-likelihoodsln(pโ‚) + โ€ฆ + ln(pโ‚…)โˆ’10.19
Per-residue mean NLL10.19 / 52.04
Perplexityexp(2.04)7.7

So the model is hedging across the equivalent of ~7.7 amino acids per position. Versus the random baseline of 20, that’s real information; versus a strong binder (PPL 2โ€“3), it’s mediocre.

Watch: ESM-2 / PepMLM use base-e. Some NLP literature uses base-2 (“bits per character”). Don’t mix.

Masked LMs like PepMLM report pseudo-perplexity โ€” mask each position one at a time, predict the actual residue from the rest. Interpretation is the same.


AF3 metrics for protein-peptide complexes

MetricWhat it measuresThreshold
pLDDT (per-residue)Local backbone confidence (0โ€“100)>90 confident ยท 70โ€“90 OK ยท <50 disordered
pTM (global)Whole-structure accuracy (0โ€“1)>0.5 fold correct
ipTM (interface)Interface accuracy (0โ€“1)>0.8 high ยท 0.6โ€“0.8 grey ยท <0.6 likely wrong
PAE (pairwise)Expected error in ร… between residue pairs<5 ร… between interface residues = confident pose

Critical short-peptide caveat. Standard ipTM cutoffs were calibrated against protein-protein complexes. For a 12-mer vs a 154-aa target, ipTM is systematically biased downward (Stein & Dunbrack, bioRxiv 2025 โ€” the ipSAE analysis). A 12-mer with ipTM 0.5 is not auto-junk. Use the literature benchmark’s ipTM as the calibration anchor, not a universal threshold.


moPPIt vs PepMLM

PepMLM samples binders plausible against the target as a whole. It can’t be told where to bind. moPPIt’s Multi-Objective Guided Discrete Flow Matching (MOG-DFM) adds:

  1. Motif guidance โ€” specify which target residues to engage.
  2. Multi-objective optimization โ€” affinity, solubility, motif specificity simultaneously, during sampling (not as a post-hoc filter).
  3. Pareto-front output โ€” multiple candidates at different trade-off points, not a single “best.”

Worked example 1 โ€” SOD1-A4V peptide therapeutics

Why this target

Detail
DiseaseFamilial ALS (most aggressive SOD1 variant)
MutationAla โ†’ Val at residue 4 (mature numbering); position 5 in UniProt P00441
Survival from symptom onset~1.4 years for A4V vs 3โ€“5 yr for ALS overall
MechanismToxic gain-of-function โ€” A4V destabilizes monomer and dimer interface, drives aggregation. Not loss of dismutase activity.
Therapeutic surfacesA4V site (ฮฒ1) or dimer interface (~residues 51โ€“54 + 114โ€“116)
Why peptide modalitySurface is flat + shallow; ~80% nonpolar dimer interface โ€” no deep pocket for small molecules
flowchart TD
    WT[Native SOD1 homodimer<br/>Cu/Zn bound, Tm > 90ยฐC] --> M[A4V monomer<br/>destabilized N-terminus]
    M --> R[Disulfide-reduced<br/>demetalated apo monomer]
    R --> O[Misfolded oligomers / trimers]
    O --> F[Insoluble fibrils]
    F --> D[Motor neuron death]
    style WT fill:#cfe9d4
    style F fill:#fecaca
    style D fill:#fecaca

A useful binder must engage the A4V site itself or the dimer interface. Anything else is therapeutically irrelevant, no matter how good the prediction looks.

Stage 1 โ€” PepMLM generation

Default sampling on SOD1-A4V produced four 12-mer peptides plus the literature benchmark.

#SequencePPLIssue
1WRYGVYAVAH**KX**10.72X ambiguity code at position 12
2WHYYAYAAAH**KX**10.70X ambiguity code at position 12
3WHYPAAAVRL**WX**12.76X ambiguity code at position 12
4WHYGAAAVRL**KE**11.76clean
BenchmarkFLYRWLPSRRGG~20.64 (prior student run, proxy)clean

Three failure modes immediately:

Pitfall 1 โ€” Mode collapse. All four peptides share residues at positions 1 (W), 3 (Y), 7 (A), and 11 (K, in three of four). Peptides 3 and 4 differ at only 3 of 12 positions. ESM-2’s training distribution under-represents this target; the model defaults to a generic aromatic-cationic anchor pattern.

Pitfall 2 โ€” X ambiguity codes. X is the IUPAC “any/unknown amino acid” โ€” not a real residue. ESM-2’s tokenizer carries X; when probability mass is spread, X wins the argmax. Substituted to G (matching benchmark’s GG terminus) for downstream use, with the substitution flagged transparently.

Pitfall 3 โ€” Higher-than-textbook PPLs. Textbook “PPL ~ 2โ€“5 = good binder” doesn’t apply here. The whole distribution sits at 10โ€“20 for this target. Read PPL relative to benchmark, not against universal thresholds.

Stage 2 โ€” AlphaFold3 validation

Five jobs on alphafoldserver.com (chain A = SOD1-A4V, chain B = peptide). Top model per job:

PeptideipTM (top)ipTM rangePeptide pLDDTInterface PAE median (ร…)Top contact siteA4V engaged?
P1 WRYGVYAVAHKG0.490.29โ€“0.4945.99.70Residues 138, 142, 144No
P2 WHYYAYAAAHKG0.400.19โ€“0.4051.410.30Residues 138, 142, 144No
P3 WHYPAAAVRLWG0.280.18โ€“0.2841.615.45Residues 138, 142, 143No
P4 WHYGAAAVRLKE0.350.21โ€“0.3542.813.70Residues 138, 63, 137No
Benchmark0.250.18โ€“0.2538.316.20Residues 138, 142, 143No
flowchart LR
    All[All 5 peptides<br/>including benchmark] --> Site[Top contacts:<br/>residues 138, 142, 143, 144<br/>= electrostatic loop, back face]
    All --> A4V[A4V site, residue 5<br/>max contact prob 0.00 - 0.01<br/>min PAE 11.7 - 14.8 ร…]
    style Site fill:#fef3c7
    style A4V fill:#fecaca

Pitfall 4 โ€” AF3 default-site convergence. When AF3 can’t find a strong anchor, it deposits poorly-anchored peptides on whichever surface is geometrically convenient. For SOD1-A4V that’s the C-terminal electrostatic loop. Five different peptides converging there โ€” including the literature benchmark โ€” means AF3 doesn’t have a confident pose for any of them and is showing us its default surface.

Stage 3 โ€” PeptiVerse developability

All five passed developability (Soluble at 1.000 prob, hemolysis < 0.05). But the binding-affinity ranking inverts AF3:

PeptidePepMLM PPL rankAF3 ipTM rankPeptiVerse pKdApprox Kd
P1 WRYGVYAVAHKG215.483~3.3 ยตM
P2 WHYYAYAAAHKG125.496~3.2 ยตM
P3 WHYPAAAVRLWG445.995~1.0 ยตM
P4 WHYGAAAVRLKE335.457~3.5 ยตM
Benchmark555.555~2.8 ยตM

Pitfall 5 โ€” Cross-model disagreement. PepMLM says P2 best. AF3 says P1 best. PeptiVerse says P3 best. The “best” peptide depends on which model you trust most. Spearman correlation between AF3 ipTM and PeptiVerse pKd across the four generated peptides is strongly negative.

Stage 4 โ€” moPPIt site-targeted re-generation

Two parallel runs with explicit motif guidance:

RunMotif positionsStrategyMean predicted Kd
AUniProt 2โ€“10 (A4V cluster)Engage the destabilization site directly~1.4 ยตM
BUniProt 51โ€“54 + 114โ€“116 (dimer interface)Lock the native dimer~140 nM

Run A โ€” A4V cluster (5 samples):

#SequencepKdMotifChargeCysAromatics
A1CTSGVNVGPGGP6.0860.5710C@1โ€”
A2ADSENCAPSSVH5.8880.552โˆ’2C@6โ€”
A3PSEKFCVKKHTT5.8530.652+2C@6F
A4MFAGIKNKEQQT5.4550.743+1โ€”F
A5QGKCKFKQFNPV5.9570.805+3C@42F

Run B โ€” Dimer interface (3 samples):

#SequencepKdMotifChargeAromaticsModality
B1CTAVLNVGLEWC6.3930.827โˆ’11WFlanking Cys (possible macrocycle)
B2GLLAFYFYYLWF7.720 (~19 nM)0.83107Extreme hydrophobic (developability flag)
B3PAEKWFVFWHPT6.4800.771~04Balanced

Key finding โ€” target-aware chemistry. Run A (mixed basic/hydrophobic target) is compositionally diverse: charges span โˆ’2 to +3, frequent Cys, low aromatic content. Run B (flat hydrophobic interface, ~80% nonpolar in literature) is aromatic-rich, electrostatically neutral, with one candidate (B1) showing flanking Cys consistent with a macrocyclic design (though this is a hypothesis, not validated โ€” see caveat below). moPPIt’s guidance produces chemistry appropriate to target-surface biology in a way PepMLM’s unconditional sampling cannot.

Caveat on B1. Flanking Cys at positions 1 and 12 could form an intramolecular disulfide and fold into a macrocycle. But it could also be coincidental (n=3 samples), a classifier-learned pattern without intentional macrocyclization, or a sampling accident. Validating the macrocyclic form requires Boltz-2 or RoseTTAFold All-Atom (not AF3 โ€” the AlphaFold Server doesn’t support intramolecular peptide disulfides).

Caveat on B2. Predicted pKd 7.72 is the workspace headline number, but 7 of 12 residues are aromatic. PeptiVerse’s solubility classifier (1.000) is almost certainly out-of-distribution for this composition. In wet-lab practice, peptides this hydrophobic don’t dissolve, self-aggregate, and bind non-specifically. B2 is a teaching example of why predicted Kd is not the only metric.

Advance recommendations

flowchart LR
    Workspace[12 candidate peptides] --> P[Primary: B3<br/>PAEKWFVFWHPT<br/>balanced, sub-ยตM, dimer-interface]
    Workspace --> Alt[Alternate: B1<br/>CTAVLNVGLEWC<br/>macrocycle hypothesis]
    Workspace --> A4V[A4V site: A5<br/>QGKCKFKQFNPV<br/>best Run A profile]
    style P fill:#cfe9d4
    style Alt fill:#fef3c7
    style A4V fill:#e0f2fe

Worked example 2 โ€” MS2 L-protein engineering

Why this target

Detail
ProteinMS2 bacteriophage lysis protein, 75 aa
Sequence (UniProt P03609)METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT
Domain splitSoluble (1โ€“40, DnaJ-interacting) + TM (41โ€“75, pore-forming)
Engineering goalImprove stability + auto-folding; break dependence on E. coli DnaJ chaperone
Therapeutic rationaleDnaJ-independent L-protein would overcome the most common E. coli resistance mechanism, expanding phage-therapy spectrum

Three pick strategies, head-to-head

We ran three strategies for picking 5 mutations each, then cross-scored against the Chamakura experimental lysis dataset (n=59 unique mutations with measured lysis 0/1).

flowchart TD
    Q[Pick mutations for engineering] --> R[1. Random mutagenesis<br/>option3 python script<br/>seed=42, 2-4 substitutions]
    Q --> L[2. LLR-informed only<br/>ESM saturation scan<br/>top 1% by LLR]
    Q --> E[3. Experiment-led<br/>filter for lysis=1<br/>rank by LLR within set]
    R --> RR[Mean LLR: -1.04<br/>One variant breaks initiator Met<br/>Lysis outcome: unknown for these]
    L --> LR[Mean LLR: +2.15<br/>All in top 1% of landscape<br/>~0 percent likely lysis preservation]
    E --> ER[Mean LLR: +0.18<br/>Lysis preserved: 100 percent<br/>by construction]
    style RR fill:#fef3c7
    style LR fill:#fecaca
    style ER fill:#cfe9d4
StrategyPicksMean LLRLysis preservation
Randomvarious 2โ€“4 mutation combos, incl. M1N (initiator break)โˆ’1.04unknown for these
LLR-informed (model only)C29S, Y39L, K50L, N53L, S9Q+2.15 (top 1% of landscape)~0% โ€” all picks at positions where neighbor experiments kill lysis
Experiment-ledE25G, K23E, A45P, I46F, D26G+0.18 (mediocre)100% by construction

The big finding

ESM-2 LLR and experimental lysis function have essentially zero correlation for L-protein:

MetricValue
Pearson r (LLR vs lysis 0/1)+0.007
AUC (discrimination)0.476 (below chance)
Mean LLR, lysis-preserving (n=19)โˆ’0.371
Mean LLR, lysis-killing (n=40)โˆ’0.389

Statistical-power note: at n=59, the 95% CI on r is approximately ยฑ0.26 (Fisher z). The data are consistent with anywhere from a small inverse correlation to a small positive correlation. The qualitative point โ€” LLR is not informative enough for confident-pick selection on L-protein โ€” holds regardless.

Quartile breakdown (59 unique mutations sorted by LLR):

QuartileLLR rangeLysis preservedRate
Q1 (top 25% LLR)+0.33 to +2.403 / 1421% โ† worst
Q2 (middle-high)โˆ’0.14 to +0.318 / 1457% โ† best
Q3 (middle-low)โˆ’0.77 to โˆ’0.172 / 1414%
Q4 (bottom 25%)โˆ’5.26 to โˆ’0.796 / 1735%

The top-LLR quartile has the worst lysis preservation rate. Functionally essential residues are positions where the WT looks “unusual” to the model precisely because the unusual residue is doing necessary work. The model says “change it”; the experiment says “don’t.”

What our top LLR picks would have done

Cross-checking each LLR-informed pick against the experimental neighborhood:

LLR pickLLRNearest experimental dataResult
K50L+2.56 (#1 overall)K50 โ†’ E, I, N, Q all testedAll 4 kill lysis
N53L+1.87N53 โ†’ D, H, I, K, Q, S all testedAll 6 kill lysis
C29S+2.04C29 โ†’ R testedKills lysis
Y39L+2.24Y39 โ†’ H testedKills lysis
S9Q+2.01no data at position 9unknown

The model’s highest-confidence picks land on residues whose unusual identity is doing functional work.

The meta-lesson

Unsupervised protein language models predict sequence plausibility, not biological function. For under-represented protein families (phage proteins, membrane proteins, anything outside UniRef50’s bulk training distribution), they can be actively misleading โ€” and the misleading is worst at the top of their confidence distribution. Use them as one signal among many, weight experimental data heavily when available, and never trust the top picks blindly on an unfamiliar target.

Why ESM-2 fails on L-protein specifically

  • Phage protein โ€” under-represented in UniRef50
  • Membrane-active โ€” PLMs are notoriously weak on membrane proteins
  • Many WT residues are “unusual” by general-protein statistics precisely because they’re doing functional work (single free Cys at 29, Lys mid-TM at 50, Arg-rich N-terminus)
  • The measured function (E. coli lysis via DnaJ chaperone + membrane insertion) requires correct folding and downstream context the model can’t see

Pitfalls cheat sheet

PitfallWhere it bit usHow to diagnoseHow to fix
PepMLM mode collapse on under-represented targetStage 1, all four peptides shared 5โ€“7 positionsHamming-distance histogram across the generated setHigher top_k or temperature; generate more, pick diverse
X ambiguity-code in PepMLM output3 of 4 SOD1 peptides ended in XLook for X in the sequence stringSubstitute (G is the safest default); document transparently
AF3 default-site convergenceAll 5 SOD1 peptides docked at same wrong patchMultiple input peptides converging to same predicted contactsSite-targeted re-generation via moPPIt
Short-peptide ipTM bias12-mer ipTMs all 0.25โ€“0.49, below universal thresholdipTM significantly lower than benchmark for known bindersCalibrate against benchmark; consider ipSAE (Dunbrack 2025)
Cross-model disagreementPepMLM, AF3, PeptiVerse rank candidates differentlyRank-correlation across model outputsMulti-model ensembling; cross-validate against wet-lab when possible
LLM โ‰  function on under-represented familyL-protein r = +0.007 with lysisCompute LLR-vs-experiment correlation; quartile preservation ratesUse experimental data as primary filter; LLR as tiebreaker
PeptiVerse out-of-distribution for extreme compositionsB2 with 7 aromatics scored as solubleGRAVY > 1.0 or other extreme physicochemical profileCross-check with Tango/Zyggregator/AGGRESCAN for aggregation

PaperDOIWhy
PepMLM โ€” Chen, T., Quinn, Z., Dumas, M., Vincoff, S., Chatterjee, P., et al. Target sequence-conditioned design of peptide binders using masked language modeling. Nat Biotechnol 202510.1038/s41587-025-02761-2The tool used in Stage 1
AlphaFold3 โ€” Abramson, J., et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 202410.1038/s41586-024-07487-wThe tool used in Stage 2
MS2 L-protein DnaJ dependence โ€” Chamakura, K. R., et al. Viral protein antibiotic inhibits lipid II flippase activity. J Bacteriol 2017PMC5446614The mechanistic basis for Part C; companion paper PMC5775895 is the L-Protein Mutants experimental dataset
SOD1 dimer destabilization โ€” Broom, H. R., et al. Destabilization of the dimer interface is a common consequence of diverse ALS-associated mutations in metal-free SOD1. Protein Science 201510.1002/pro.2803Why SOD1 dimer interface is the relevant therapeutic surface

Course resources

ResourceURLNotes
PepMLM-650M model cardhuggingface.co/ChatterjeeLab/PepMLM-650MLinked Colab + HF Space demo. License MIT.
AlphaFold Serveralphafoldserver.comFree AF3 web interface. Limitation: doesn’t support intramolecular polymer covalent bonds โ€” use Boltz-2 for cyclized peptide validation.
PeptiVerse Spacehuggingface.co/spaces/ChatterjeeLab/PeptiVerseBatch input supported โ€” one peptide per line, target pasted once.
moPPIt model cardhuggingface.co/ChatterjeeLab/moPPItColab requires GPU runtime. Motif positions accept range or comma syntax.
ESM saturation Colab (Part C)colab.research.google.comProduces 1,425-substitution CSV for a 75-mer.