Week 5 HW: Protein Design Part II

๐Ÿงฌ Week 5: Protein Design Part II

HTGAA Spring 2026 ยท Constantin ยท Committed Listener

Part A: SOD1 Binder Peptide Design

**Background:** Superoxide dismutase 1 (SOD1) is a cytosolic antioxidant enzyme. The A4V mutation causes one of the most aggressive forms of familial ALS by destabilizing the N-terminus and promoting toxic aggregation. Our goal is to design short peptides that bind mutant SOD1 and evaluate their therapeutic potential.

Part 1: Generate Binders with PepMLM

Step 1 โ€” SOD1 A4V Mutant Sequence

The human SOD1 sequence was retrieved from UniProt P00441 (154 aa). The A4V mutation was introduced at mature position 4 (UniProt position 5), changing Alanine to Valine:

>SOD1_A4V | UniProt:P00441 | A4V familial ALS mutant
MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTS
AGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVV
HEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

Context: WT: MATKAAVCVLK... โ†’ A4V: MATKVVCVLK...

Step 2 โ€” PepMLM Peptide Generation

I ran the PepMLM-650M Colab notebook with the SOD1 A4V sequence, generating 8 candidate peptides of length 12. The model outputs a pseudo perplexity score for each peptide โ€” lower values indicate higher model confidence that the peptide is a plausible binder for the target.

Full PepMLM-650M Output (8 peptides)

#Peptide SequencePseudo PerplexityNotes
1WRVGATGVAHKX7.18Best score; X at pos 12 โ†’ replaced with K
2WLYGPVGLAHKX8.55X at pos 12
3WLYGPVAVAWWX9.37X at pos 12
4WHYGAVVAEWKK10.54Clean sequence
5HLYYAAALRHKX14.75X at pos 12
6HLYYATALRHKX14.78X at pos 12
7WLYPAAAVRHWK18.69Clean sequence
8WRYPPVVVAWWE18.72Clean sequence

Note: Five peptides contained an unknown residue ‘X’ at position 12. This is a known PepMLM artifact where the final position mask is not fully resolved. For the top-scoring peptide (WRVGATGVAHKX), I replaced X โ†’ K (lysine) based on the pattern of other peptides ending in K/KK, which is consistent with cationic residues at C-termini aiding solubility and target engagement.

Selected Peptides for Downstream Evaluation

I selected 4 peptides spanning the perplexity range, plus the known SOD1 binder as a reference:

#Peptide Sequence (12 aa)PerplexitySource
1WRVGATGVAHKK7.18PepMLM generated (Xโ†’K)
2WHYGAVVAEWKK10.54PepMLM generated
3WLYPAAAVRHWK18.69PepMLM generated
4WRYPPVVVAWWE18.72PepMLM generated
5FLYRWLPSRRGGโ€”Known SOD1 binder (reference)

Perplexity interpretation: Lower pseudo perplexity indicates higher model confidence that the peptide is a plausible binder. Our best peptide (WRVGATGVAHKK, 7.18) shows the strongest model confidence, while WRYPPVVVAWWE (18.72) is the weakest. Values below ~10 are generally promising. Including peptides across the range lets us test whether PepMLM’s confidence score correlates with AlphaFold3 structural predictions and PeptiVerse binding affinity estimates.

Part 2: Evaluate Binders with AlphaFold3

Each peptide was modeled as a two-chain complex with SOD1 A4V on AlphaFold3 Server. Five separate jobs were submitted (one per peptide), each containing the full 154 aa SOD1 A4V sequence as Chain A and the 12 aa peptide as Chain B.

PeptideipTMpTMBinding LocationNotes
WRVGATGVAHKK0.560.88Extended along ฮฒ-barrel surfaceBest ipTM; exceeds known binder
WHYGAVVAEWKK0.320.75Helical, near ฮฒ-barrel topForms short helix; low interface confidence
WLYPAAAVRHWK0.310.75Extended, partial contactLow interface confidence
WRYPPVVVAWWE0.230.77Extended, loose associationWorst ipTM; poor interface
FLYRWLPSRRGG0.320.82Extended along surfaceKnown binder reference

ipTM interpretation: The interface predicted Template Modeling score (ipTM) measures confidence in the predicted protein-peptide interface. Values above 0.7 indicate confident binding; 0.5โ€“0.7 is moderate; below 0.5 is low confidence. The pTM score reflects overall fold confidence for the complex.

Analysis

WRVGATGVAHKK stands out as the best candidate with an ipTM of 0.56 โ€” the only peptide in the moderate-confidence range, and substantially higher than all others including the known SOD1 binder FLYRWLPSRRGG (ipTM = 0.32). This correlates with its PepMLM perplexity score (7.18, lowest/best), suggesting PepMLM’s confidence metric is predictive of structural binding quality.

The remaining three PepMLM-generated peptides (ipTM 0.23โ€“0.32) showed low binding confidence, comparable to or below the known binder. Interestingly, the known binder also scored low (0.32), which may reflect that short linear peptides are inherently challenging for AlphaFold3 to model with high confidence โ€” the true binding mode may involve conformational selection or induced fit not captured by static prediction.

All peptides showed high pTM values for the SOD1 protein itself (0.75โ€“0.88), confirming that AlphaFold3 confidently predicts the SOD1 ฮฒ-barrel fold regardless of the peptide partner. The peptide chains generally showed lower per-residue confidence (yellow/orange coloring in the 3D viewer), consistent with the flexibility expected of short unstructured peptides.

Part 3: Evaluate Properties with PeptiVerse

Each peptide was evaluated using PeptiVerse with all supported property predictions enabled (Solubility, Permeability, Hemolysis, Non-Fouling, Half-Life). Binding Affinity prediction requires the target protein sequence in a separate input field.

PeptideSolubilityPermeabilityHemolysis Prob.Non-FoulingHalf-Life (h)Net ChargeMW (Da)
WRVGATGVAHKK1.0000.3550.0270.3020.292+2.851309.5
WHYGAVVAEWKK1.0000.0840.0320.2830.479+0.851473.7
WLYPAAAVRHWK1.0000.7200.0270.3560.387+1.851497.7
WRYPPVVVAWWE1.0000.3750.2300.1780.381โˆ’0.231587.8
FLYRWLPSRRGG1.0000.8620.0470.6660.310+2.761507.7

Analysis & Peptide Selection

All five peptides are predicted fully soluble (probability 1.0), which is encouraging for therapeutic development. Key differences emerge in other properties:

Hemolysis: Four peptides show very low hemolysis probability (โ‰ค0.047), indicating safety for blood contact. However, WRYPPVVVAWWE has an elevated hemolysis probability of 0.230 โ€” likely due to its high hydrophobic content (multiple W, V, P residues) and net negative charge, which may promote membrane disruption.

Permeability: The known binder FLYRWLPSRRGG shows the highest permeability (0.862), followed by WLYPAAAVRHWK (0.720). High permeability is desirable for intracellular targets like SOD1. The other three peptides are predicted non-permeable (<0.4).

Non-Fouling: Only the known binder FLYRWLPSRRGG is predicted non-fouling (0.666), meaning it resists non-specific protein adsorption โ€” an important property for in vivo use. All PepMLM-generated peptides score below 0.36.

Half-Life: All peptides show short predicted half-lives (0.29โ€“0.48 h), typical for unmodified linear peptides. WHYGAVVAEWKK has the longest at 0.479 h.

Selected Peptide for Advancement

Based on PeptiVerse analysis, I select WLYPAAAVRHWK as the most promising PepMLM-generated candidate. Rationale: (1) highest membrane permeability among generated peptides (0.720), critical since SOD1 is an intracellular target; (2) very low hemolysis risk (0.027); (3) full solubility; (4) moderate positive charge (+1.85) favorable for cellular uptake. Although its PepMLM perplexity was higher (18.69), the therapeutic property profile is superior to the lower-perplexity candidates. Final ranking will incorporate AlphaFold3 structural confidence (ipTM scores) once those results are available.

Part 4: Generate Optimized Peptides with moPPIt

I used moPPIt-v3 (Multi-Objective Peptide Property Transformer) via the Colab notebook with GPU runtime (T4). moPPIt uses flow matching with multi-objective property guidance to generate peptides optimized for specific therapeutic properties simultaneously.

Settings used: Target protein = SOD1 A4V (154 aa), Binder length = 12, Num_Samples = 3, Objectives: Hemolysis (guidance scale 1), Non-Fouling (1), Solubility (1), Affinity (1).

#moPPIt PeptideLengthSequence Characteristics
1GLTTEEEFLRWR12Net negative charge; Glu-rich mid-section; aromatic C-terminus (W, R)
2GDLLRELWEGET12Mixed charged residues (R, E); Trp for binding; acidic C-terminus
3LEQKLKSTETQV12Balanced charge (K, E, Q); polar-rich; no aromatics

Comparison: PepMLM vs moPPIt

The moPPIt and PepMLM peptides show notably different sequence characteristics, reflecting their different generation strategies:

Charge profiles: PepMLM peptides tend toward positive charge (e.g., WRVGATGVAHKK at +2.85, WLYPAAAVRHWK at +1.85), driven by Lys/Arg/His residues. moPPIt peptides are more charge-balanced or net negative (GLTTEEEFLRWR has three Glu residues), likely reflecting the solubility and non-fouling optimization objectives which favor charged, hydrophilic sequences.

Aromatic content: PepMLM peptides are Trp-heavy (every peptide starts with W or F), while moPPIt peptides use aromatics more sparingly โ€” LEQKLKSTETQV has none at all. This is consistent with moPPIt’s hemolysis minimization, since aromatic/hydrophobic residues can promote membrane disruption.

Sequence diversity: moPPIt produces more polar, hydrophilic sequences (Glu, Gln, Thr, Ser) compared to PepMLM’s hydrophobic-rich outputs (Val, Ala, Pro). This trade-off may improve solubility and reduce hemolysis at the cost of membrane permeability โ€” a consideration for intracellular targets like SOD1.

Design philosophy: PepMLM samples plausible binders conditioned on the entire target sequence (language-model perplexity), while moPPIt uses multi-objective optimization with explicit property guidance. PepMLM captures natural binding motifs; moPPIt biases sequences toward user-specified therapeutic properties.

How would you evaluate these peptides before clinical advancement?

Before advancing to clinical studies, I would evaluate moPPIt peptides through: (1) structural validation via AlphaFold3 or molecular dynamics to confirm binding pose, (2) PeptiVerse therapeutic property screening to compare against PepMLM candidates, (3) in vitro binding assays (SPR or ITC) against recombinant SOD1 A4V, (4) aggregation inhibition assays using ThT fluorescence, (5) cell-based assays in SOD1 A4V-expressing motor neuron models, (6) stability and pharmacokinetic profiling, and (7) peptide modifications (stapling, cyclization, D-amino acid substitution) to improve metabolic stability.

Part B: BRD4 Drug Discovery Platform Tutorial

**Note:** Part B (Boltz Lab BRD4 tutorial) is marked **Optional** for Committed Listeners. This section is skipped.

Part C: L-Protein Mutant Design (Phage Lysis Protein)

**Objective:** Improve the stability and autofolding of the MS2 phage lysis protein (L-protein) to overcome E. coli resistance. We want mutations that either (1) make the L-protein fold independently of the DnaJ chaperone, (2) achieve faster/more efficient lysis, or (3) increase expression. We use ESM2 protein language model scoring combined with experimental data to design 5 mutant variants.

Background: L-Protein Structure & Function

PropertyDetails
UniProtP03609
Length75 amino acids
Soluble domainResidues 1โ€“40 (interacts with DnaJ chaperone)
Transmembrane domainResidues 41โ€“75 (forms pores in E. coli membrane)
FunctionForms oligomeric pores in E. coli inner membrane, causing lysis
Resistance mechanismE. coli mutates DnaJ to prevent L-protein folding
L-Protein: METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT
           |<-------- Soluble domain (1-40) -------->||<------ TM domain (41-75) ------|

Step 1: ESM2 Deep Mutational Scanning

I ran ESM2 (150M parameter model) masked marginal scoring on the full L-protein sequence. For each position, the model predicts how every possible amino acid substitution would affect the protein’s fitness. Positive scores indicate mutations predicted to be beneficial (more likely in the evolutionary landscape); negative scores indicate deleterious mutations.

ESM2 Mutational Landscape Heatmap

The heatmap below shows the ESM2 log-likelihood ratio for every possible single point mutation. Green = beneficial, Red = deleterious, White = neutral.

RankMutationESM2 ScoreDomainInterpretation
1K50L+3.50TMReplace charged Lys with hydrophobic Leu in membrane
2C29R+3.01SolubleEliminate reactive cysteine, add positive charge
3K50P+2.95TMBreak helix at charged position
4C29P+2.94SolubleConstrain backbone, eliminate thiol
5K50I+2.92TMHydrophobic replacement at K50
6K50F+2.76TMAromatic hydrophobic at K50
7K50V+2.71TMSmall hydrophobic at K50
8C29Q+2.69SolublePolar replacement for Cys
9N53L+2.61TMReplace polar Asn with hydrophobic Leu
10S9Q+2.54SolubleImprove N-terminal stability

Key Observations from ESM2 Scoring

Position C29 (Soluble domain): The cysteine at position 29 is the most mutable residue in the soluble domain. Nearly every substitution scores positively, suggesting this Cys may cause problems โ€” possibly non-productive disulfide bonds that require DnaJ for resolution. Replacing C29 could enable DnaJ-independent folding.

Position K50 (TM domain): The lysine at position 50 is the highest-scoring position overall. A charged lysine in the middle of a transmembrane helix is energetically unfavorable. Replacing it with hydrophobic residues (L, I, V, F) dramatically improves the ESM2 score, suggesting better membrane insertion and pore stability.

Position N53 (TM domain): Another polar residue in the TM domain that scores well when replaced with hydrophobic amino acids, consistent with improved membrane compatibility.

Step 2: Correlation with Experimental Data

The experimental dataset (L-Protein Mutants) contains known mutations and their effect on lysis. Key observations from comparing ESM2 predictions with experimental results:

ESM2 vs Experimental Correlation

The ESM2 language model scores partially correlate with experimental lysis data, but with important caveats. The model captures general protein fitness (foldability, evolutionary plausibility) rather than the specific functional property of lysis. Mutations that improve membrane insertion (K50 replacements) are correctly identified as beneficial. However, the model may miss functional interactions specific to pore formation or DnaJ binding, since these are not directly encoded in evolutionary sequence statistics. This means ESM2 is a useful first-pass filter but should be combined with structural predictions (AF2-Multimer) and experimental validation.

Step 3: Five Designed L-Protein Mutants

Based on ESM2 scoring, experimental data, and mechanistic reasoning, I designed 5 mutant variants. Per the requirements: 2 have mutations in the soluble domain, 2 in the transmembrane domain, and 1 combines both.

Mutant 1 โ€” Soluble Domain: C29R + S9Q

MutationESM2 ScoreDomain
C29R+3.01Soluble
S9Q+2.54Soluble
Combined score+5.56
WT:  METRFPQQ**S**QQTPASTNRRRPFKHEDYP**C**RRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT
M1:  METRFPQQ**Q**QQTPASTNRRRPFKHEDYP**R**RRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

Rationale: C29R eliminates the reactive cysteine that may cause non-productive disulfide bonds requiring DnaJ-assisted folding. The arginine replacement adds a positive charge that could enhance electrostatic interactions in the soluble domain. S9Q improves local stability near the N-terminus. Together, these aim to enable DnaJ-independent folding.

Mutant 2 โ€” Soluble Domain: C29P + Y39L

MutationESM2 ScoreDomain
C29P+2.94Soluble
Y39L+2.11Soluble
Combined score+5.05
WT:  METRFPQQSQQTPASTNRRRPFKHEDYP**C**RRQQRSSTL**Y**VLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT
M2:  METRFPQQSQQTPASTNRRRPFKHEDYP**P**RRQQRSSTL**L**VLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

Rationale: C29P replaces cysteine with proline, which constrains backbone flexibility and completely eliminates thiol reactivity. Proline may create a structural turn that promotes autonomous folding. Y39L at the soluble/TM boundary replaces a bulky aromatic with a hydrophobic leucine, potentially improving the transition from soluble to membrane-embedded regions.

Mutant 3 โ€” TM Domain: K50L + N53L

MutationESM2 ScoreDomain
K50L+3.50TM
N53L+2.61TM
Combined score+6.12
WT:  METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLS**K**FT**N**QLLLSLLEAVIRTVTTLQQLLT
M3:  METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLS**L**FT**L**QLLLSLLEAVIRTVTTLQQLLT

Rationale: K50L is the single highest-scoring mutation overall โ€” replacing a charged lysine with hydrophobic leucine in the TM domain should dramatically improve membrane insertion thermodynamics. N53L similarly replaces a polar asparagine with leucine. Together, these create a more hydrophobic TM helix that should insert into the membrane more efficiently, enhancing pore formation speed and potentially enabling faster lysis before the host can acquire resistance.

Mutant 4 โ€” TM Domain: K50I + A45L

MutationESM2 ScoreDomain
K50I+2.92TM
A45L+1.46TM
Combined score+4.38
WT:  METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFL**A**IFLS**K**FTNQLLLSLLEAVIRTVTTLQQLLT
M4:  METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFL**L**IFLS**I**FTNQLLLSLLEAVIRTVTTLQQLLT

Rationale: K50I provides an alternative hydrophobic replacement at the critical K50 position โ€” isoleucine is a ฮฒ-branched amino acid that packs differently than leucine, which may alter pore geometry. A45L increases hydrophobicity earlier in the TM helix. This variant tests whether different hydrophobic amino acids at position 50 produce different lysis kinetics.

Mutant 5 โ€” Combined: C29R + K50L + N53L

MutationESM2 ScoreDomain
C29R+3.01Soluble
K50L+3.50TM
N53L+2.61TM
Combined score+9.13
WT:  METRFPQQSQQTPASTNRRRPFKHEDYP**C**RRQQRSSTLYVLIFLAIFLS**K**FT**N**QLLLSLLEAVIRTVTTLQQLLT
M5:  METRFPQQSQQTPASTNRRRPFKHEDYP**R**RRQQRSSTLYVLIFLAIFLS**L**FT**L**QLLLSLLEAVIRTVTTLQQLLT

Rationale: This triple mutant combines the best soluble domain mutation (C29R) with the best TM domain mutations (K50L, N53L). It attacks both resistance mechanisms simultaneously: C29R enables DnaJ-independent folding (overcoming chaperone resistance), while K50L + N53L enhance membrane pore formation (faster kill before resistance emerges). This is the most ambitious design with the highest combined ESM2 score of +9.13.

Summary of All 5 Mutants

MutantMutationsDomainESM2 ScoreDesign Goal
1C29R, S9QSoluble+5.56DnaJ-independent folding
2C29P, Y39LSoluble+5.05Autonomous folding via backbone constraint
3K50L, N53LTM+6.12Faster membrane insertion & pore formation
4K50I, A45LTM+4.38Alternative pore geometry
5C29R, K50L, N53LBoth+9.13Dual-mechanism: folding + lysis enhancement

Step 4: AlphaFold2-Multimer Structural Validation

**Note:** AF2-Multimer structural validation is an optional extension step. The sequences below are provided for future validation using the [AF2-Multimer Colab notebook](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb). Fold each mutant as an 8-chain oligomer (for pore assembly) or co-fold with DnaJ by pasting each mutant sequence 8 times separated by colons. Example for Mutant 3:
# Mutant 3 (K50L, N53L) as 8-mer for pore assembly prediction:
METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTLQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTLQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTLQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTLQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTLQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTLQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTLQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTLQLLLSLLEAVIRTVTTLQQLLT
# Mutant 1 (C29R, S9Q) co-folded with DnaJ for interaction analysis:
METRFPQQQQQTPASTNRRRPFKHEDYPRRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:MAKQDYYEILGVSKTAEEREIRKAYKRLAMKYHPDRNQGDKEAEAKFKEIKEAYEVLTDSQKRAAYDQYGHAAFEQGGMGGGGFGGGADFSDIFGDVFGDIFGGGRGRQRAARGADLRYNMELTLEEAVRGVTKEIRIPTLEECDVCHGSGAKPGTQPQTCPTCHGSGQVQMRQGFFAVQQTCPHCQGRGTLIKDPCNKCHGHGRVERSKTLSVKIPAGVDTGDRIRLAGEGEAGEHGAPAGDLYVQVQVKQHPIFEREGNNLYCEVPINFAMAALGGEIEVPTLDGRVKLKVPGETQTGKLFRMRGKGVKSVRGGAQGDLLCRVVVETPVGLNERQKQLLQELQESFGGPTGEHNSPRSKSFFDGVKKFFDDLTR

Open-Ended Question

How do you define how “good” or effective mutants are?

Evaluating L-protein mutant effectiveness requires a multi-level approach:

Computationally: (1) ESM2 log-likelihood scores capture evolutionary plausibility โ€” positive scores suggest the mutation is compatible with the protein fold. (2) AF2-Multimer pLDDT and pTM scores assess structural confidence of the oligomeric pore assembly. (3) Co-folding with DnaJ can predict whether mutations reduce DnaJ dependency.

Experimentally: (1) The primary readout is the plaque assay โ€” does the phage with the mutant L-protein still form plaques on E. coli? Larger or more abundant plaques indicate more effective lysis. (2) Lysis timing assays measure how quickly cells lyse after phage infection. (3) Testing against DnaJ-mutant E. coli strains specifically evaluates resistance-breaking ability. (4) Expression levels can be measured by Western blot.

A truly “good” mutant would maintain or improve lysis efficiency on wild-type E. coli while also lysing DnaJ-mutant strains that resist wild-type MS2.


AI Disclosure

I used Claude (Anthropic) to help with: formatting and structuring this homework page, interpreting PepMLM perplexity scores, analyzing PeptiVerse therapeutic property predictions, comparing PepMLM vs moPPIt peptide generation approaches, and spelling/grammar clean-up. All external tool runs (PepMLM, AlphaFold3, PeptiVerse, moPPIt, ESM2) were performed by me; Claude assisted with result interpretation and documentation.


HTGAA Spring 2026 ยท Week 5 Homework ยท Protein Design Part II ยท Constantin ยท Committed Listener