Subsections of Homeworks
Week 5 Review: Protein Design Part II
Week 5 โ Protein Design II
AI-driven peptide and protein engineering, worked end-to-end on two targets.
TL;DR
- Tool stack for peptide design: PepMLM (generate) โ AlphaFold3 (validate) โ PeptiVerse (triage) โ moPPIt (re-target). Each tool catches a failure the others miss.
- Target 1: SOD1-A4V (ALS). PepMLM alone produces mode-collapsed peptides that all dock at the wrong AF3 default surface. moPPIt with motif guidance produces target-aware chemistry. Advance: B3
PAEKWFVFWHPT(sub-ยตM predicted Kd, dimer-interface targeted).- Target 2: MS2 L-protein. ESM-style saturation scan vs random vs experiment-led picks. Big finding: language-model preference and experimental lysis function have r = +0.007 correlation. The model’s top picks would have destroyed function.
- Meta-lesson: Unsupervised protein language models predict sequence plausibility, not function. On under-represented protein families they can be actively misleading.
Course: HTGAA Spring 2026 ยท Lecture (Mar 3): Gabriele Corso, Pranam Chatterjee โ Protein Design Part II ยท Author: Fiona C (Committed Listener BioPunk Node)
The tool stack at a glance
flowchart LR
T[Target sequence] --> P[PepMLM<br/>generate plausible binders<br/>perplexity score]
P --> A[AlphaFold3<br/>co-fold target+peptide<br/>ipTM, PAE, pLDDT, pose]
A --> V[PeptiVerse<br/>developability triage<br/>solubility, hemolysis, Kd]
V --> D{Site OK?<br/>Developable?}
D -- no, redirect --> M[moPPIt<br/>site-targeted re-generation<br/>multi-objective guided]
M --> A
D -- yes --> L[Lead candidate<br/>for wet-lab]
style P fill:#e0f2fe
style A fill:#fef3c7
style V fill:#cfe9d4
style M fill:#e9d5ff| Tool | Purpose | Output |
|---|---|---|
| PepMLM (ChatterjeeLab, ESM-2 fine-tune) | Generate plausible binders from target sequence | 12-mer peptides + pseudo-perplexity |
| AlphaFold3 (AlphaFold Server) | Predict protein-peptide complex structure | ipTM, pTM, PAE matrix, per-residue pLDDT, 5-model ensemble |
| PeptiVerse (ChatterjeeLab Space) | Therapeutic-property classification | pKd, solubility, hemolysis, MW, charge, GRAVY |
| moPPIt + MOG-DFM (ChatterjeeLab) | Site-targeted multi-objective peptide generation | Top-N from ~100 sampled trajectories per run |
| ESM saturation scan (ESM-2) | Predict effect of every possible single mutation | LLR score for each (position, mutation) pair |
Concept sidebars
Peptide vs small molecule โ they’re different modalities, not just different sizes
A peptide can sit in the same MW range as a small-molecule drug, but in drug discovery the two are separate modality classes.
| Small molecule | Peptide | |
|---|---|---|
| Built from | Arbitrary organic synthesis | Polymerised ฮฑ-amino acids |
| Lipinski compliance | Typically yes (โค500 Da) | Typically no (multiple violations) |
| Interface area buried | ~300โ500 ร ยฒ | ~800โ2000 ร ยฒ |
| Target topology | Deep hydrophobic pocket | Flat / shallow PPI surface |
| Oral bioavailability | Usually yes | Usually no (gut proteases) |
| CNS penetration | Often possible | Hard without engineering |
| Example this week | JQ1 / BRD4 (457 Da, Part B) | FLYRWLPSRRGG / SOD1 (1506 Da, Part A) |
Rule of thumb: target topology dictates modality. Deep pocket โ small molecule. Flat surface โ peptide. Both modalities appear in Week 5 for exactly this reason.
Perplexity in one worked example
Perplexity = exponentiated mean negative log-likelihood. Read as: the effective number of equally-likely choices the model is hedging between at each position.
For a toy 5-mer PEPTI with per-position probabilities P=0.20, E=0.15, P=0.25, T=0.10, I=0.05:
| Step | Calc | Value |
|---|---|---|
| Joint likelihood | 0.20 ร 0.15 ร 0.25 ร 0.10 ร 0.05 | 3.75 ร 10โปโต |
| Sum of log-likelihoods | ln(pโ) + โฆ + ln(pโ ) | โ10.19 |
| Per-residue mean NLL | 10.19 / 5 | 2.04 |
| Perplexity | exp(2.04) | 7.7 |
So the model is hedging across the equivalent of ~7.7 amino acids per position. Versus the random baseline of 20, that’s real information; versus a strong binder (PPL 2โ3), it’s mediocre.
Watch: ESM-2 / PepMLM use base-e. Some NLP literature uses base-2 (“bits per character”). Don’t mix.
Masked LMs like PepMLM report pseudo-perplexity โ mask each position one at a time, predict the actual residue from the rest. Interpretation is the same.
AF3 metrics for protein-peptide complexes
| Metric | What it measures | Threshold |
|---|---|---|
| pLDDT (per-residue) | Local backbone confidence (0โ100) | >90 confident ยท 70โ90 OK ยท <50 disordered |
| pTM (global) | Whole-structure accuracy (0โ1) | >0.5 fold correct |
| ipTM (interface) | Interface accuracy (0โ1) | >0.8 high ยท 0.6โ0.8 grey ยท <0.6 likely wrong |
| PAE (pairwise) | Expected error in ร between residue pairs | <5 ร between interface residues = confident pose |
Critical short-peptide caveat. Standard ipTM cutoffs were calibrated against protein-protein complexes. For a 12-mer vs a 154-aa target, ipTM is systematically biased downward (Stein & Dunbrack, bioRxiv 2025 โ the
ipSAEanalysis). A 12-mer with ipTM 0.5 is not auto-junk. Use the literature benchmark’s ipTM as the calibration anchor, not a universal threshold.
moPPIt vs PepMLM
PepMLM samples binders plausible against the target as a whole. It can’t be told where to bind. moPPIt’s Multi-Objective Guided Discrete Flow Matching (MOG-DFM) adds:
- Motif guidance โ specify which target residues to engage.
- Multi-objective optimization โ affinity, solubility, motif specificity simultaneously, during sampling (not as a post-hoc filter).
- Pareto-front output โ multiple candidates at different trade-off points, not a single “best.”
Worked example 1 โ SOD1-A4V peptide therapeutics
Why this target
| Detail | |
|---|---|
| Disease | Familial ALS (most aggressive SOD1 variant) |
| Mutation | Ala โ Val at residue 4 (mature numbering); position 5 in UniProt P00441 |
| Survival from symptom onset | ~1.4 years for A4V vs 3โ5 yr for ALS overall |
| Mechanism | Toxic gain-of-function โ A4V destabilizes monomer and dimer interface, drives aggregation. Not loss of dismutase activity. |
| Therapeutic surfaces | A4V site (ฮฒ1) or dimer interface (~residues 51โ54 + 114โ116) |
| Why peptide modality | Surface is flat + shallow; ~80% nonpolar dimer interface โ no deep pocket for small molecules |
flowchart TD
WT[Native SOD1 homodimer<br/>Cu/Zn bound, Tm > 90ยฐC] --> M[A4V monomer<br/>destabilized N-terminus]
M --> R[Disulfide-reduced<br/>demetalated apo monomer]
R --> O[Misfolded oligomers / trimers]
O --> F[Insoluble fibrils]
F --> D[Motor neuron death]
style WT fill:#cfe9d4
style F fill:#fecaca
style D fill:#fecacaA useful binder must engage the A4V site itself or the dimer interface. Anything else is therapeutically irrelevant, no matter how good the prediction looks.
Stage 1 โ PepMLM generation
Default sampling on SOD1-A4V produced four 12-mer peptides plus the literature benchmark.
| # | Sequence | PPL | Issue |
|---|---|---|---|
| 1 | WRYGVYAVAH**KX** | 10.72 | X ambiguity code at position 12 |
| 2 | WHYYAYAAAH**KX** | 10.70 | X ambiguity code at position 12 |
| 3 | WHYPAAAVRL**WX** | 12.76 | X ambiguity code at position 12 |
| 4 | WHYGAAAVRL**KE** | 11.76 | clean |
| Benchmark | FLYRWLPSRRGG | ~20.64 (prior student run, proxy) | clean |
Three failure modes immediately:
Pitfall 1 โ Mode collapse. All four peptides share residues at positions 1 (W), 3 (Y), 7 (A), and 11 (K, in three of four). Peptides 3 and 4 differ at only 3 of 12 positions. ESM-2’s training distribution under-represents this target; the model defaults to a generic aromatic-cationic anchor pattern.
Pitfall 2 โ
Xambiguity codes.Xis the IUPAC “any/unknown amino acid” โ not a real residue. ESM-2’s tokenizer carriesX; when probability mass is spread,Xwins the argmax. Substituted toG(matching benchmark’sGGterminus) for downstream use, with the substitution flagged transparently.
Pitfall 3 โ Higher-than-textbook PPLs. Textbook “PPL ~ 2โ5 = good binder” doesn’t apply here. The whole distribution sits at 10โ20 for this target. Read PPL relative to benchmark, not against universal thresholds.
Stage 2 โ AlphaFold3 validation
Five jobs on alphafoldserver.com (chain A = SOD1-A4V, chain B = peptide). Top model per job:
| Peptide | ipTM (top) | ipTM range | Peptide pLDDT | Interface PAE median (ร ) | Top contact site | A4V engaged? |
|---|---|---|---|---|---|---|
P1 WRYGVYAVAHKG | 0.49 | 0.29โ0.49 | 45.9 | 9.70 | Residues 138, 142, 144 | No |
P2 WHYYAYAAAHKG | 0.40 | 0.19โ0.40 | 51.4 | 10.30 | Residues 138, 142, 144 | No |
P3 WHYPAAAVRLWG | 0.28 | 0.18โ0.28 | 41.6 | 15.45 | Residues 138, 142, 143 | No |
P4 WHYGAAAVRLKE | 0.35 | 0.21โ0.35 | 42.8 | 13.70 | Residues 138, 63, 137 | No |
| Benchmark | 0.25 | 0.18โ0.25 | 38.3 | 16.20 | Residues 138, 142, 143 | No |
flowchart LR
All[All 5 peptides<br/>including benchmark] --> Site[Top contacts:<br/>residues 138, 142, 143, 144<br/>= electrostatic loop, back face]
All --> A4V[A4V site, residue 5<br/>max contact prob 0.00 - 0.01<br/>min PAE 11.7 - 14.8 ร
]
style Site fill:#fef3c7
style A4V fill:#fecacaPitfall 4 โ AF3 default-site convergence. When AF3 can’t find a strong anchor, it deposits poorly-anchored peptides on whichever surface is geometrically convenient. For SOD1-A4V that’s the C-terminal electrostatic loop. Five different peptides converging there โ including the literature benchmark โ means AF3 doesn’t have a confident pose for any of them and is showing us its default surface.
Stage 3 โ PeptiVerse developability
All five passed developability (Soluble at 1.000 prob, hemolysis < 0.05). But the binding-affinity ranking inverts AF3:
| Peptide | PepMLM PPL rank | AF3 ipTM rank | PeptiVerse pKd | Approx Kd |
|---|---|---|---|---|
P1 WRYGVYAVAHKG | 2 | 1 | 5.483 | ~3.3 ยตM |
P2 WHYYAYAAAHKG | 1 | 2 | 5.496 | ~3.2 ยตM |
P3 WHYPAAAVRLWG | 4 | 4 | 5.995 | ~1.0 ยตM |
P4 WHYGAAAVRLKE | 3 | 3 | 5.457 | ~3.5 ยตM |
| Benchmark | 5 | 5 | 5.555 | ~2.8 ยตM |
Pitfall 5 โ Cross-model disagreement. PepMLM says P2 best. AF3 says P1 best. PeptiVerse says P3 best. The “best” peptide depends on which model you trust most. Spearman correlation between AF3 ipTM and PeptiVerse pKd across the four generated peptides is strongly negative.
Stage 4 โ moPPIt site-targeted re-generation
Two parallel runs with explicit motif guidance:
| Run | Motif positions | Strategy | Mean predicted Kd |
|---|---|---|---|
| A | UniProt 2โ10 (A4V cluster) | Engage the destabilization site directly | ~1.4 ยตM |
| B | UniProt 51โ54 + 114โ116 (dimer interface) | Lock the native dimer | ~140 nM |
Run A โ A4V cluster (5 samples):
| # | Sequence | pKd | Motif | Charge | Cys | Aromatics |
|---|---|---|---|---|---|---|
| A1 | CTSGVNVGPGGP | 6.086 | 0.571 | 0 | C@1 | โ |
| A2 | ADSENCAPSSVH | 5.888 | 0.552 | โ2 | C@6 | โ |
| A3 | PSEKFCVKKHTT | 5.853 | 0.652 | +2 | C@6 | F |
| A4 | MFAGIKNKEQQT | 5.455 | 0.743 | +1 | โ | F |
| A5 | QGKCKFKQFNPV | 5.957 | 0.805 | +3 | C@4 | 2F |
Run B โ Dimer interface (3 samples):
| # | Sequence | pKd | Motif | Charge | Aromatics | Modality |
|---|---|---|---|---|---|---|
| B1 | CTAVLNVGLEWC | 6.393 | 0.827 | โ1 | 1W | Flanking Cys (possible macrocycle) |
| B2 | GLLAFYFYYLWF | 7.720 (~19 nM) | 0.831 | 0 | 7 | Extreme hydrophobic (developability flag) |
| B3 | PAEKWFVFWHPT | 6.480 | 0.771 | ~0 | 4 | Balanced |
Key finding โ target-aware chemistry. Run A (mixed basic/hydrophobic target) is compositionally diverse: charges span โ2 to +3, frequent Cys, low aromatic content. Run B (flat hydrophobic interface, ~80% nonpolar in literature) is aromatic-rich, electrostatically neutral, with one candidate (B1) showing flanking Cys consistent with a macrocyclic design (though this is a hypothesis, not validated โ see caveat below). moPPIt’s guidance produces chemistry appropriate to target-surface biology in a way PepMLM’s unconditional sampling cannot.
Caveat on B1. Flanking Cys at positions 1 and 12 could form an intramolecular disulfide and fold into a macrocycle. But it could also be coincidental (n=3 samples), a classifier-learned pattern without intentional macrocyclization, or a sampling accident. Validating the macrocyclic form requires Boltz-2 or RoseTTAFold All-Atom (not AF3 โ the AlphaFold Server doesn’t support intramolecular peptide disulfides).
Caveat on B2. Predicted pKd 7.72 is the workspace headline number, but 7 of 12 residues are aromatic. PeptiVerse’s solubility classifier (1.000) is almost certainly out-of-distribution for this composition. In wet-lab practice, peptides this hydrophobic don’t dissolve, self-aggregate, and bind non-specifically. B2 is a teaching example of why predicted Kd is not the only metric.
Advance recommendations
flowchart LR
Workspace[12 candidate peptides] --> P[Primary: B3<br/>PAEKWFVFWHPT<br/>balanced, sub-ยตM, dimer-interface]
Workspace --> Alt[Alternate: B1<br/>CTAVLNVGLEWC<br/>macrocycle hypothesis]
Workspace --> A4V[A4V site: A5<br/>QGKCKFKQFNPV<br/>best Run A profile]
style P fill:#cfe9d4
style Alt fill:#fef3c7
style A4V fill:#e0f2feWorked example 2 โ MS2 L-protein engineering
Why this target
| Detail | |
|---|---|
| Protein | MS2 bacteriophage lysis protein, 75 aa |
| Sequence (UniProt P03609) | METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT |
| Domain split | Soluble (1โ40, DnaJ-interacting) + TM (41โ75, pore-forming) |
| Engineering goal | Improve stability + auto-folding; break dependence on E. coli DnaJ chaperone |
| Therapeutic rationale | DnaJ-independent L-protein would overcome the most common E. coli resistance mechanism, expanding phage-therapy spectrum |
Three pick strategies, head-to-head
We ran three strategies for picking 5 mutations each, then cross-scored against the Chamakura experimental lysis dataset (n=59 unique mutations with measured lysis 0/1).
flowchart TD
Q[Pick mutations for engineering] --> R[1. Random mutagenesis<br/>option3 python script<br/>seed=42, 2-4 substitutions]
Q --> L[2. LLR-informed only<br/>ESM saturation scan<br/>top 1% by LLR]
Q --> E[3. Experiment-led<br/>filter for lysis=1<br/>rank by LLR within set]
R --> RR[Mean LLR: -1.04<br/>One variant breaks initiator Met<br/>Lysis outcome: unknown for these]
L --> LR[Mean LLR: +2.15<br/>All in top 1% of landscape<br/>~0 percent likely lysis preservation]
E --> ER[Mean LLR: +0.18<br/>Lysis preserved: 100 percent<br/>by construction]
style RR fill:#fef3c7
style LR fill:#fecaca
style ER fill:#cfe9d4| Strategy | Picks | Mean LLR | Lysis preservation |
|---|---|---|---|
| Random | various 2โ4 mutation combos, incl. M1N (initiator break) | โ1.04 | unknown for these |
| LLR-informed (model only) | C29S, Y39L, K50L, N53L, S9Q | +2.15 (top 1% of landscape) | ~0% โ all picks at positions where neighbor experiments kill lysis |
| Experiment-led | E25G, K23E, A45P, I46F, D26G | +0.18 (mediocre) | 100% by construction |
The big finding
ESM-2 LLR and experimental lysis function have essentially zero correlation for L-protein:
| Metric | Value |
|---|---|
| Pearson r (LLR vs lysis 0/1) | +0.007 |
| AUC (discrimination) | 0.476 (below chance) |
| Mean LLR, lysis-preserving (n=19) | โ0.371 |
| Mean LLR, lysis-killing (n=40) | โ0.389 |
Statistical-power note: at n=59, the 95% CI on r is approximately ยฑ0.26 (Fisher z). The data are consistent with anywhere from a small inverse correlation to a small positive correlation. The qualitative point โ LLR is not informative enough for confident-pick selection on L-protein โ holds regardless.
Quartile breakdown (59 unique mutations sorted by LLR):
| Quartile | LLR range | Lysis preserved | Rate |
|---|---|---|---|
| Q1 (top 25% LLR) | +0.33 to +2.40 | 3 / 14 | 21% โ worst |
| Q2 (middle-high) | โ0.14 to +0.31 | 8 / 14 | 57% โ best |
| Q3 (middle-low) | โ0.77 to โ0.17 | 2 / 14 | 14% |
| Q4 (bottom 25%) | โ5.26 to โ0.79 | 6 / 17 | 35% |
The top-LLR quartile has the worst lysis preservation rate. Functionally essential residues are positions where the WT looks “unusual” to the model precisely because the unusual residue is doing necessary work. The model says “change it”; the experiment says “don’t.”
What our top LLR picks would have done
Cross-checking each LLR-informed pick against the experimental neighborhood:
| LLR pick | LLR | Nearest experimental data | Result |
|---|---|---|---|
| K50L | +2.56 (#1 overall) | K50 โ E, I, N, Q all tested | All 4 kill lysis |
| N53L | +1.87 | N53 โ D, H, I, K, Q, S all tested | All 6 kill lysis |
| C29S | +2.04 | C29 โ R tested | Kills lysis |
| Y39L | +2.24 | Y39 โ H tested | Kills lysis |
| S9Q | +2.01 | no data at position 9 | unknown |
The model’s highest-confidence picks land on residues whose unusual identity is doing functional work.
The meta-lesson
Unsupervised protein language models predict sequence plausibility, not biological function. For under-represented protein families (phage proteins, membrane proteins, anything outside UniRef50’s bulk training distribution), they can be actively misleading โ and the misleading is worst at the top of their confidence distribution. Use them as one signal among many, weight experimental data heavily when available, and never trust the top picks blindly on an unfamiliar target.
Why ESM-2 fails on L-protein specifically
- Phage protein โ under-represented in UniRef50
- Membrane-active โ PLMs are notoriously weak on membrane proteins
- Many WT residues are “unusual” by general-protein statistics precisely because they’re doing functional work (single free Cys at 29, Lys mid-TM at 50, Arg-rich N-terminus)
- The measured function (E. coli lysis via DnaJ chaperone + membrane insertion) requires correct folding and downstream context the model can’t see
Pitfalls cheat sheet
| Pitfall | Where it bit us | How to diagnose | How to fix |
|---|---|---|---|
| PepMLM mode collapse on under-represented target | Stage 1, all four peptides shared 5โ7 positions | Hamming-distance histogram across the generated set | Higher top_k or temperature; generate more, pick diverse |
X ambiguity-code in PepMLM output | 3 of 4 SOD1 peptides ended in X | Look for X in the sequence string | Substitute (G is the safest default); document transparently |
| AF3 default-site convergence | All 5 SOD1 peptides docked at same wrong patch | Multiple input peptides converging to same predicted contacts | Site-targeted re-generation via moPPIt |
| Short-peptide ipTM bias | 12-mer ipTMs all 0.25โ0.49, below universal threshold | ipTM significantly lower than benchmark for known binders | Calibrate against benchmark; consider ipSAE (Dunbrack 2025) |
| Cross-model disagreement | PepMLM, AF3, PeptiVerse rank candidates differently | Rank-correlation across model outputs | Multi-model ensembling; cross-validate against wet-lab when possible |
| LLM โ function on under-represented family | L-protein r = +0.007 with lysis | Compute LLR-vs-experiment correlation; quartile preservation rates | Use experimental data as primary filter; LLR as tiebreaker |
| PeptiVerse out-of-distribution for extreme compositions | B2 with 7 aromatics scored as soluble | GRAVY > 1.0 or other extreme physicochemical profile | Cross-check with Tango/Zyggregator/AGGRESCAN for aggregation |
Recommended reading
| Paper | DOI | Why |
|---|---|---|
| PepMLM โ Chen, T., Quinn, Z., Dumas, M., Vincoff, S., Chatterjee, P., et al. Target sequence-conditioned design of peptide binders using masked language modeling. Nat Biotechnol 2025 | 10.1038/s41587-025-02761-2 | The tool used in Stage 1 |
| AlphaFold3 โ Abramson, J., et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 2024 | 10.1038/s41586-024-07487-w | The tool used in Stage 2 |
| MS2 L-protein DnaJ dependence โ Chamakura, K. R., et al. Viral protein antibiotic inhibits lipid II flippase activity. J Bacteriol 2017 | PMC5446614 | The mechanistic basis for Part C; companion paper PMC5775895 is the L-Protein Mutants experimental dataset |
| SOD1 dimer destabilization โ Broom, H. R., et al. Destabilization of the dimer interface is a common consequence of diverse ALS-associated mutations in metal-free SOD1. Protein Science 2015 | 10.1002/pro.2803 | Why SOD1 dimer interface is the relevant therapeutic surface |
Course resources
| Resource | URL | Notes |
|---|---|---|
| PepMLM-650M model card | huggingface.co/ChatterjeeLab/PepMLM-650M | Linked Colab + HF Space demo. License MIT. |
| AlphaFold Server | alphafoldserver.com | Free AF3 web interface. Limitation: doesn’t support intramolecular polymer covalent bonds โ use Boltz-2 for cyclized peptide validation. |
| PeptiVerse Space | huggingface.co/spaces/ChatterjeeLab/PeptiVerse | Batch input supported โ one peptide per line, target pasted once. |
| moPPIt model card | huggingface.co/ChatterjeeLab/moPPIt | Colab requires GPU runtime. Motif positions accept range or comma syntax. |
| ESM saturation Colab (Part C) | colab.research.google.com | Produces 1,425-substitution CSV for a 75-mer. |