Week 5 HW: Protein Design Part II
๐งฌ Week 5: Protein Design Part II
Part A: SOD1 Binder Peptide Design
Part 1: Generate Binders with PepMLM
Step 1 โ SOD1 A4V Mutant Sequence
The human SOD1 sequence was retrieved from UniProt P00441 (154 aa). The A4V mutation was introduced at mature position 4 (UniProt position 5), changing Alanine to Valine:
Context: WT: MATKAAVCVLK... โ A4V: MATKVVCVLK...
Step 2 โ PepMLM Peptide Generation
I ran the PepMLM-650M Colab notebook with the SOD1 A4V sequence, generating 8 candidate peptides of length 12. The model outputs a pseudo perplexity score for each peptide โ lower values indicate higher model confidence that the peptide is a plausible binder for the target.
Full PepMLM-650M Output (8 peptides)
| # | Peptide Sequence | Pseudo Perplexity | Notes |
|---|---|---|---|
| 1 | WRVGATGVAHKX | 7.18 | Best score; X at pos 12 โ replaced with K |
| 2 | WLYGPVGLAHKX | 8.55 | X at pos 12 |
| 3 | WLYGPVAVAWWX | 9.37 | X at pos 12 |
| 4 | WHYGAVVAEWKK | 10.54 | Clean sequence |
| 5 | HLYYAAALRHKX | 14.75 | X at pos 12 |
| 6 | HLYYATALRHKX | 14.78 | X at pos 12 |
| 7 | WLYPAAAVRHWK | 18.69 | Clean sequence |
| 8 | WRYPPVVVAWWE | 18.72 | Clean sequence |
Note: Five peptides contained an unknown residue ‘X’ at position 12. This is a known PepMLM artifact where the final position mask is not fully resolved. For the top-scoring peptide (WRVGATGVAHKX), I replaced X โ K (lysine) based on the pattern of other peptides ending in K/KK, which is consistent with cationic residues at C-termini aiding solubility and target engagement.
Selected Peptides for Downstream Evaluation
I selected 4 peptides spanning the perplexity range, plus the known SOD1 binder as a reference:
| # | Peptide Sequence (12 aa) | Perplexity | Source |
|---|---|---|---|
| 1 | WRVGATGVAHKK | 7.18 | PepMLM generated (XโK) |
| 2 | WHYGAVVAEWKK | 10.54 | PepMLM generated |
| 3 | WLYPAAAVRHWK | 18.69 | PepMLM generated |
| 4 | WRYPPVVVAWWE | 18.72 | PepMLM generated |
| 5 | FLYRWLPSRRGG | โ | Known SOD1 binder (reference) |
Perplexity interpretation: Lower pseudo perplexity indicates higher model confidence that the peptide is a plausible binder. Our best peptide (WRVGATGVAHKK, 7.18) shows the strongest model confidence, while WRYPPVVVAWWE (18.72) is the weakest. Values below ~10 are generally promising. Including peptides across the range lets us test whether PepMLM’s confidence score correlates with AlphaFold3 structural predictions and PeptiVerse binding affinity estimates.
Part 2: Evaluate Binders with AlphaFold3
Each peptide was modeled as a two-chain complex with SOD1 A4V on AlphaFold3 Server. Five separate jobs were submitted (one per peptide), each containing the full 154 aa SOD1 A4V sequence as Chain A and the 12 aa peptide as Chain B.
| Peptide | ipTM | pTM | Binding Location | Notes |
|---|---|---|---|---|
WRVGATGVAHKK | 0.56 | 0.88 | Extended along ฮฒ-barrel surface | Best ipTM; exceeds known binder |
WHYGAVVAEWKK | 0.32 | 0.75 | Helical, near ฮฒ-barrel top | Forms short helix; low interface confidence |
WLYPAAAVRHWK | 0.31 | 0.75 | Extended, partial contact | Low interface confidence |
WRYPPVVVAWWE | 0.23 | 0.77 | Extended, loose association | Worst ipTM; poor interface |
FLYRWLPSRRGG | 0.32 | 0.82 | Extended along surface | Known binder reference |
ipTM interpretation: The interface predicted Template Modeling score (ipTM) measures confidence in the predicted protein-peptide interface. Values above 0.7 indicate confident binding; 0.5โ0.7 is moderate; below 0.5 is low confidence. The pTM score reflects overall fold confidence for the complex.
Analysis
WRVGATGVAHKK stands out as the best candidate with an ipTM of 0.56 โ the only peptide in the moderate-confidence range, and substantially higher than all others including the known SOD1 binder FLYRWLPSRRGG (ipTM = 0.32). This correlates with its PepMLM perplexity score (7.18, lowest/best), suggesting PepMLM’s confidence metric is predictive of structural binding quality.
The remaining three PepMLM-generated peptides (ipTM 0.23โ0.32) showed low binding confidence, comparable to or below the known binder. Interestingly, the known binder also scored low (0.32), which may reflect that short linear peptides are inherently challenging for AlphaFold3 to model with high confidence โ the true binding mode may involve conformational selection or induced fit not captured by static prediction.
All peptides showed high pTM values for the SOD1 protein itself (0.75โ0.88), confirming that AlphaFold3 confidently predicts the SOD1 ฮฒ-barrel fold regardless of the peptide partner. The peptide chains generally showed lower per-residue confidence (yellow/orange coloring in the 3D viewer), consistent with the flexibility expected of short unstructured peptides.
Part 3: Evaluate Properties with PeptiVerse
Each peptide was evaluated using PeptiVerse with all supported property predictions enabled (Solubility, Permeability, Hemolysis, Non-Fouling, Half-Life). Binding Affinity prediction requires the target protein sequence in a separate input field.
| Peptide | Solubility | Permeability | Hemolysis Prob. | Non-Fouling | Half-Life (h) | Net Charge | MW (Da) |
|---|---|---|---|---|---|---|---|
WRVGATGVAHKK | 1.000 | 0.355 | 0.027 | 0.302 | 0.292 | +2.85 | 1309.5 |
WHYGAVVAEWKK | 1.000 | 0.084 | 0.032 | 0.283 | 0.479 | +0.85 | 1473.7 |
WLYPAAAVRHWK | 1.000 | 0.720 | 0.027 | 0.356 | 0.387 | +1.85 | 1497.7 |
WRYPPVVVAWWE | 1.000 | 0.375 | 0.230 | 0.178 | 0.381 | โ0.23 | 1587.8 |
FLYRWLPSRRGG | 1.000 | 0.862 | 0.047 | 0.666 | 0.310 | +2.76 | 1507.7 |
Analysis & Peptide Selection
All five peptides are predicted fully soluble (probability 1.0), which is encouraging for therapeutic development. Key differences emerge in other properties:
Hemolysis: Four peptides show very low hemolysis probability (โค0.047), indicating safety for blood contact. However, WRYPPVVVAWWE has an elevated hemolysis probability of 0.230 โ likely due to its high hydrophobic content (multiple W, V, P residues) and net negative charge, which may promote membrane disruption.
Permeability: The known binder FLYRWLPSRRGG shows the highest permeability (0.862), followed by WLYPAAAVRHWK (0.720). High permeability is desirable for intracellular targets like SOD1. The other three peptides are predicted non-permeable (<0.4).
Non-Fouling: Only the known binder FLYRWLPSRRGG is predicted non-fouling (0.666), meaning it resists non-specific protein adsorption โ an important property for in vivo use. All PepMLM-generated peptides score below 0.36.
Half-Life: All peptides show short predicted half-lives (0.29โ0.48 h), typical for unmodified linear peptides. WHYGAVVAEWKK has the longest at 0.479 h.
Selected Peptide for Advancement
Based on PeptiVerse analysis, I select WLYPAAAVRHWK as the most promising PepMLM-generated candidate. Rationale: (1) highest membrane permeability among generated peptides (0.720), critical since SOD1 is an intracellular target; (2) very low hemolysis risk (0.027); (3) full solubility; (4) moderate positive charge (+1.85) favorable for cellular uptake. Although its PepMLM perplexity was higher (18.69), the therapeutic property profile is superior to the lower-perplexity candidates. Final ranking will incorporate AlphaFold3 structural confidence (ipTM scores) once those results are available.
Part 4: Generate Optimized Peptides with moPPIt
I used moPPIt-v3 (Multi-Objective Peptide Property Transformer) via the Colab notebook with GPU runtime (T4). moPPIt uses flow matching with multi-objective property guidance to generate peptides optimized for specific therapeutic properties simultaneously.
Settings used: Target protein = SOD1 A4V (154 aa), Binder length = 12, Num_Samples = 3, Objectives: Hemolysis (guidance scale 1), Non-Fouling (1), Solubility (1), Affinity (1).
| # | moPPIt Peptide | Length | Sequence Characteristics |
|---|---|---|---|
| 1 | GLTTEEEFLRWR | 12 | Net negative charge; Glu-rich mid-section; aromatic C-terminus (W, R) |
| 2 | GDLLRELWEGET | 12 | Mixed charged residues (R, E); Trp for binding; acidic C-terminus |
| 3 | LEQKLKSTETQV | 12 | Balanced charge (K, E, Q); polar-rich; no aromatics |
Comparison: PepMLM vs moPPIt
The moPPIt and PepMLM peptides show notably different sequence characteristics, reflecting their different generation strategies:
Charge profiles: PepMLM peptides tend toward positive charge (e.g., WRVGATGVAHKK at +2.85, WLYPAAAVRHWK at +1.85), driven by Lys/Arg/His residues. moPPIt peptides are more charge-balanced or net negative (GLTTEEEFLRWR has three Glu residues), likely reflecting the solubility and non-fouling optimization objectives which favor charged, hydrophilic sequences.
Aromatic content: PepMLM peptides are Trp-heavy (every peptide starts with W or F), while moPPIt peptides use aromatics more sparingly โ LEQKLKSTETQV has none at all. This is consistent with moPPIt’s hemolysis minimization, since aromatic/hydrophobic residues can promote membrane disruption.
Sequence diversity: moPPIt produces more polar, hydrophilic sequences (Glu, Gln, Thr, Ser) compared to PepMLM’s hydrophobic-rich outputs (Val, Ala, Pro). This trade-off may improve solubility and reduce hemolysis at the cost of membrane permeability โ a consideration for intracellular targets like SOD1.
Design philosophy: PepMLM samples plausible binders conditioned on the entire target sequence (language-model perplexity), while moPPIt uses multi-objective optimization with explicit property guidance. PepMLM captures natural binding motifs; moPPIt biases sequences toward user-specified therapeutic properties.
How would you evaluate these peptides before clinical advancement?
Before advancing to clinical studies, I would evaluate moPPIt peptides through: (1) structural validation via AlphaFold3 or molecular dynamics to confirm binding pose, (2) PeptiVerse therapeutic property screening to compare against PepMLM candidates, (3) in vitro binding assays (SPR or ITC) against recombinant SOD1 A4V, (4) aggregation inhibition assays using ThT fluorescence, (5) cell-based assays in SOD1 A4V-expressing motor neuron models, (6) stability and pharmacokinetic profiling, and (7) peptide modifications (stapling, cyclization, D-amino acid substitution) to improve metabolic stability.
Part B: BRD4 Drug Discovery Platform Tutorial
Part C: L-Protein Mutant Design (Phage Lysis Protein)
Background: L-Protein Structure & Function
| Property | Details |
|---|---|
| UniProt | P03609 |
| Length | 75 amino acids |
| Soluble domain | Residues 1โ40 (interacts with DnaJ chaperone) |
| Transmembrane domain | Residues 41โ75 (forms pores in E. coli membrane) |
| Function | Forms oligomeric pores in E. coli inner membrane, causing lysis |
| Resistance mechanism | E. coli mutates DnaJ to prevent L-protein folding |
Step 1: ESM2 Deep Mutational Scanning
I ran ESM2 (150M parameter model) masked marginal scoring on the full L-protein sequence. For each position, the model predicts how every possible amino acid substitution would affect the protein’s fitness. Positive scores indicate mutations predicted to be beneficial (more likely in the evolutionary landscape); negative scores indicate deleterious mutations.
ESM2 Mutational Landscape Heatmap
The heatmap below shows the ESM2 log-likelihood ratio for every possible single point mutation. Green = beneficial, Red = deleterious, White = neutral.
| Rank | Mutation | ESM2 Score | Domain | Interpretation |
|---|---|---|---|---|
| 1 | K50L | +3.50 | TM | Replace charged Lys with hydrophobic Leu in membrane |
| 2 | C29R | +3.01 | Soluble | Eliminate reactive cysteine, add positive charge |
| 3 | K50P | +2.95 | TM | Break helix at charged position |
| 4 | C29P | +2.94 | Soluble | Constrain backbone, eliminate thiol |
| 5 | K50I | +2.92 | TM | Hydrophobic replacement at K50 |
| 6 | K50F | +2.76 | TM | Aromatic hydrophobic at K50 |
| 7 | K50V | +2.71 | TM | Small hydrophobic at K50 |
| 8 | C29Q | +2.69 | Soluble | Polar replacement for Cys |
| 9 | N53L | +2.61 | TM | Replace polar Asn with hydrophobic Leu |
| 10 | S9Q | +2.54 | Soluble | Improve N-terminal stability |
Key Observations from ESM2 Scoring
Position C29 (Soluble domain): The cysteine at position 29 is the most mutable residue in the soluble domain. Nearly every substitution scores positively, suggesting this Cys may cause problems โ possibly non-productive disulfide bonds that require DnaJ for resolution. Replacing C29 could enable DnaJ-independent folding.
Position K50 (TM domain): The lysine at position 50 is the highest-scoring position overall. A charged lysine in the middle of a transmembrane helix is energetically unfavorable. Replacing it with hydrophobic residues (L, I, V, F) dramatically improves the ESM2 score, suggesting better membrane insertion and pore stability.
Position N53 (TM domain): Another polar residue in the TM domain that scores well when replaced with hydrophobic amino acids, consistent with improved membrane compatibility.
Step 2: Correlation with Experimental Data
The experimental dataset (L-Protein Mutants) contains known mutations and their effect on lysis. Key observations from comparing ESM2 predictions with experimental results:
ESM2 vs Experimental Correlation
The ESM2 language model scores partially correlate with experimental lysis data, but with important caveats. The model captures general protein fitness (foldability, evolutionary plausibility) rather than the specific functional property of lysis. Mutations that improve membrane insertion (K50 replacements) are correctly identified as beneficial. However, the model may miss functional interactions specific to pore formation or DnaJ binding, since these are not directly encoded in evolutionary sequence statistics. This means ESM2 is a useful first-pass filter but should be combined with structural predictions (AF2-Multimer) and experimental validation.
Step 3: Five Designed L-Protein Mutants
Based on ESM2 scoring, experimental data, and mechanistic reasoning, I designed 5 mutant variants. Per the requirements: 2 have mutations in the soluble domain, 2 in the transmembrane domain, and 1 combines both.
Mutant 1 โ Soluble Domain: C29R + S9Q
| Mutation | ESM2 Score | Domain |
|---|---|---|
| C29R | +3.01 | Soluble |
| S9Q | +2.54 | Soluble |
| Combined score | +5.56 |
Rationale: C29R eliminates the reactive cysteine that may cause non-productive disulfide bonds requiring DnaJ-assisted folding. The arginine replacement adds a positive charge that could enhance electrostatic interactions in the soluble domain. S9Q improves local stability near the N-terminus. Together, these aim to enable DnaJ-independent folding.
Mutant 2 โ Soluble Domain: C29P + Y39L
| Mutation | ESM2 Score | Domain |
|---|---|---|
| C29P | +2.94 | Soluble |
| Y39L | +2.11 | Soluble |
| Combined score | +5.05 |
Rationale: C29P replaces cysteine with proline, which constrains backbone flexibility and completely eliminates thiol reactivity. Proline may create a structural turn that promotes autonomous folding. Y39L at the soluble/TM boundary replaces a bulky aromatic with a hydrophobic leucine, potentially improving the transition from soluble to membrane-embedded regions.
Mutant 3 โ TM Domain: K50L + N53L
| Mutation | ESM2 Score | Domain |
|---|---|---|
| K50L | +3.50 | TM |
| N53L | +2.61 | TM |
| Combined score | +6.12 |
Rationale: K50L is the single highest-scoring mutation overall โ replacing a charged lysine with hydrophobic leucine in the TM domain should dramatically improve membrane insertion thermodynamics. N53L similarly replaces a polar asparagine with leucine. Together, these create a more hydrophobic TM helix that should insert into the membrane more efficiently, enhancing pore formation speed and potentially enabling faster lysis before the host can acquire resistance.
Mutant 4 โ TM Domain: K50I + A45L
| Mutation | ESM2 Score | Domain |
|---|---|---|
| K50I | +2.92 | TM |
| A45L | +1.46 | TM |
| Combined score | +4.38 |
Rationale: K50I provides an alternative hydrophobic replacement at the critical K50 position โ isoleucine is a ฮฒ-branched amino acid that packs differently than leucine, which may alter pore geometry. A45L increases hydrophobicity earlier in the TM helix. This variant tests whether different hydrophobic amino acids at position 50 produce different lysis kinetics.
Mutant 5 โ Combined: C29R + K50L + N53L
| Mutation | ESM2 Score | Domain |
|---|---|---|
| C29R | +3.01 | Soluble |
| K50L | +3.50 | TM |
| N53L | +2.61 | TM |
| Combined score | +9.13 |
Rationale: This triple mutant combines the best soluble domain mutation (C29R) with the best TM domain mutations (K50L, N53L). It attacks both resistance mechanisms simultaneously: C29R enables DnaJ-independent folding (overcoming chaperone resistance), while K50L + N53L enhance membrane pore formation (faster kill before resistance emerges). This is the most ambitious design with the highest combined ESM2 score of +9.13.
Summary of All 5 Mutants
| Mutant | Mutations | Domain | ESM2 Score | Design Goal |
|---|---|---|---|---|
| 1 | C29R, S9Q | Soluble | +5.56 | DnaJ-independent folding |
| 2 | C29P, Y39L | Soluble | +5.05 | Autonomous folding via backbone constraint |
| 3 | K50L, N53L | TM | +6.12 | Faster membrane insertion & pore formation |
| 4 | K50I, A45L | TM | +4.38 | Alternative pore geometry |
| 5 | C29R, K50L, N53L | Both | +9.13 | Dual-mechanism: folding + lysis enhancement |
Step 4: AlphaFold2-Multimer Structural Validation
Open-Ended Question
How do you define how “good” or effective mutants are?
Evaluating L-protein mutant effectiveness requires a multi-level approach:
Computationally: (1) ESM2 log-likelihood scores capture evolutionary plausibility โ positive scores suggest the mutation is compatible with the protein fold. (2) AF2-Multimer pLDDT and pTM scores assess structural confidence of the oligomeric pore assembly. (3) Co-folding with DnaJ can predict whether mutations reduce DnaJ dependency.
Experimentally: (1) The primary readout is the plaque assay โ does the phage with the mutant L-protein still form plaques on E. coli? Larger or more abundant plaques indicate more effective lysis. (2) Lysis timing assays measure how quickly cells lyse after phage infection. (3) Testing against DnaJ-mutant E. coli strains specifically evaluates resistance-breaking ability. (4) Expression levels can be measured by Western blot.
A truly “good” mutant would maintain or improve lysis efficiency on wild-type E. coli while also lysing DnaJ-mutant strains that resist wild-type MS2.
AI Disclosure
I used Claude (Anthropic) to help with: formatting and structuring this homework page, interpreting PepMLM perplexity scores, analyzing PeptiVerse therapeutic property predictions, comparing PepMLM vs moPPIt peptide generation approaches, and spelling/grammar clean-up. All external tool runs (PepMLM, AlphaFold3, PeptiVerse, moPPIt, ESM2) were performed by me; Claude assisted with result interpretation and documentation.
HTGAA Spring 2026 ยท Week 5 Homework ยท Protein Design Part II ยท Constantin ยท Committed Listener