Week 5 HW: Protein Design Part II

Part 1: Generate Binders with PepMLM

Sequence Preparation

Retrieved WT SOD1 from UniProt P00441 and introduced A4V (position 5 in full sequence, position 4 in mature protein after Met cleavage):

WT:  M A T K A V C V L K ...
A4V: M A T K V V C V L K ...
               ^

PepMLM Results

Model: PepMLM-650M (ESM-2 based, Colab, T4 GPU) | Parameters: length = 12, num_binders = 4, top_k = 3

Rank	Peptide	Pseudo-Perplexity
1	`WRYPAVAAEHWK`	12.41
2	`WRYPAVAAEHWE`	13.54
3	`WRYYPAGVAWKE`	14.19
4	`WLYPVAVLRWKX`	17.70
ref	`FLYRWLPSRRGG`	— (known SOD1 binder)

3/4 designs share a WRY N-terminal motif (aromatic + cationic). The best peptide (PPL=12.41) and the known binder both feature aromatic-rich N-termini and positively charged residues, suggesting electrostatic complementarity with SOD1’s surface.

Part 2: Evaluate Binders with Boltz-2

Method

Each peptide was modeled as a complex with SOD1-A4V using Boltz-2 (via Amina CLI) — a diffusion-based structure prediction model that produces ipTM scores comparable to AlphaFold-Multimer. SOD1-A4V was chain A, peptide was chain B.

Note: Peptide 4 contained an ambiguous X residue; this was replaced with Ala for structure prediction.

Results

Peptide	Pseudo-Perplexity	ipTM	pLDDT	Confidence	pDockQ2
`WRYPAVAAEHWK`	12.41	0.894	93.6	0.928	0.458
`WLYPVAVLRWKA`	17.70	0.769	93.7	0.904	0.308
`WRYYPAGVAWKE`	14.19	0.753	91.3	0.881	0.195
`FLYRWLPSRRGG` (known)	—	0.675	92.8	0.877	0.095
`WRYPAVAAEHWE`	13.54	0.530	93.1	0.851	0.042

Analysis

The top PepMLM design (WRYPAVAAEHWK) substantially outperforms the known binder, with ipTM = 0.894 vs 0.675 — well above the 0.8 threshold generally considered confident for protein-protein interactions. Three of four PepMLM peptides exceed the known binder’s ipTM.

The best-scoring peptide also had the lowest PepMLM perplexity (12.41), showing agreement between the language model’s confidence and the structure predictor’s assessment. The one exception is peptide 2 (WRYPAVAAEHWE), which differs from peptide 1 by a single residue (K→E at position 12) but drops dramatically in ipTM (0.894 → 0.530) — suggesting this C-terminal charge is critical for the binding interaction.

Predicted structures are available as PDB files in C2_boltz2/ for visualization.

Part 3: Evaluate Therapeutic Properties with PeptiVerse

Method

All peptides (4 PepMLM designs + known binder) were submitted to PeptiVerse with SOD1-A4V as the target. Properties evaluated: binding affinity (pKd), solubility, hemolysis, molecular weight, net charge (pH 7).

Results

Peptide	ipTM (Boltz-2)	Binding Affinity (pKd)	Solubility	Hemolysis (prob)	Net Charge (pH 7)	MW (Da)
`WRYPAVAAEHWK`	0.894	5.44	Soluble	0.023	+0.85	1513.7
`WRYPAVAAEHWE`	0.530	5.58	Soluble	0.035	−1.14	1514.6
`WRYYPAGVAWKE`	0.753	5.95	Soluble	0.022	+0.77	1525.7
`WLYPVAVLRWKA`	0.769	6.48	Soluble	0.043	+1.76	1501.8
`FLYRWLPSRRGG` (known)	0.675	5.97	Soluble	0.047	+2.76	1507.7

Analysis

All peptides are predicted soluble and non-hemolytic (all < 5% probability), which is encouraging for therapeutic development. Binding affinities are all in the “weak binding” range (pKd 5.4–6.5), which is typical for short linear peptides.

Interestingly, ipTM and predicted affinity do not fully agree. The best structural binder (WRYPAVAAEHWK, ipTM = 0.894) has the lowest predicted affinity (pKd = 5.44), while peptide 4 (WLYPVAVLRWKA) has the highest affinity (pKd = 6.48) but only moderate ipTM (0.769). This reflects the fact that PeptiVerse predicts affinity from sequence alone, while Boltz-2 evaluates structural complementarity — the two metrics capture different aspects of binding.

Recommended peptide to advance: WRYPAVAAEHWK. It has the strongest structural evidence for binding (ipTM = 0.894, well above the 0.8 confidence threshold), excellent safety profile (non-hemolytic, soluble, near-neutral charge), and the lowest PepMLM perplexity (12.41). While its sequence-predicted affinity is modest, the high ipTM and pDockQ2 (0.458) suggest a well-defined binding interface that may translate better to experimental validation than affinity predictions alone.

Part 4: Generate Optimized Peptides with moPPIt

Method

moPPIt-v3 uses Multi-Objective Guided Discrete Flow Matching (MOG-DFM) to steer peptide generation toward specific target residues while optimizing multiple therapeutic properties simultaneously. Unlike PepMLM (which samples plausible binders from sequence alone), moPPIt lets you specify where to bind and what properties to optimize.

Parameters: Target = SOD1-A4V, length = 12, motif positions = 1–10 (N-terminal region near A4V), objectives = Hemolysis, Solubility, Motif (weight = 1.0 each), GPU = L4

Results

Peptide	Hemolysis Score	Solubility Score	Motif Score
`DTKVKCGGNTQW`	0.968	0.833	0.803
`GCFEKTTGKTQD`	0.971	0.917	0.769
`KTGGKTQKITWH`	0.962	0.833	0.757
`TDTIRYKRQADE`	0.974	0.833	0.664

(Scores closer to 1.0 = better for hemolysis/solubility; motif score reflects binding to target residues 1–10.)

PepMLM vs moPPIt Comparison

Property	PepMLM Peptides	moPPIt Peptides
Design strategy	Sequence-conditioned sampling	Multi-objective guided flow matching
Composition	Aromatic-rich (WRY motifs), hydrophobic	Charged/polar (K, T, D, E), hydrophilic
Target awareness	Whole protein (implicit)	Specific residues 1–10 (explicit)
Therapeutic optimization	None (binding only)	Hemolysis, solubility optimized jointly

The moPPIt peptides are strikingly different in character — dominated by polar and charged residues (Lys, Thr, Asp, Glu, Gly) rather than the aromatic-heavy PepMLM designs. This likely reflects the multi-objective optimization steering away from hydrophobic residues (which can cause hemolysis and poor solubility) toward safer, more soluble compositions.

How to Evaluate Before Clinical Advancement

Before advancing any peptide toward clinical studies, one would need to:

Binding validation: Surface plasmon resonance (SPR) or isothermal titration calorimetry (ITC) to measure actual binding affinity
Structural confirmation: Cryo-EM or X-ray crystallography of the peptide-SOD1 complex
Selectivity testing: Ensure binding to mutant A4V-SOD1 over wild-type (to avoid disrupting normal SOD1 function)
Cell-based assays: Test whether peptide-E3 ligase fusions can degrade mutant SOD1 in cellular models
Stability and pharmacokinetics: Serum stability, half-life, and cell permeability measurements
In vivo efficacy: Animal models of SOD1-ALS (e.g., SOD1-G93A transgenic mice)

Part C: Phage Lysis Protein Design Challenge

Background

The MS2 phage L-protein (75 residues) lyses E. coli by forming membrane pores. E. coli can resist by mutating its DnaJ chaperone, preventing L-protein folding. Our goal: design L-protein mutants that reduce DnaJ dependence while maintaining lysis activity.

METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT
|_____________ Soluble domain (1-39) ____________|____ Transmembrane (40-75) ____|
         (interacts with DnaJ)                        (forms membrane pores)

Approach: Option 2 — Boltz-2 Co-folding + DMS Analysis

Step 1: Deep Mutational Scan (ESM-1v)

Tool: amina run esm1v -m dms with 5-model ensemble (1,500 mutations scored across 75 positions)

DMS vs Experimental Data Correlation

Cross-referencing the DMS scores with published experimental mutation data (Chamakura et al., 2017):

Experimentally validated lysis-positive mutations in the soluble domain:

Position	Change	Lysis	Protein	DMS Score	DMS Assessment
13	P→L	1	1	−0.37	Tolerated
15	S→A	1	1	+0.01	Favorable
18	R→G	1	1	−0.92	Tolerated
18	R→I	1	1	−1.17	Tolerated
23	K→E	1	0	+0.58	Favorable
25	E→G	1	0	+0.23	Favorable
25	E→V	1	0	+0.04	Favorable
25	E→D	1	0	+0.15	Favorable
26	D→G	1	0	+0.14	Favorable
30	R→Q	1	1	−0.46	Tolerated
30	R→L	1	1	−0.35	Tolerated
31	R→I	1	1	−1.23	Tolerated

Critical discrepancy — Position 29 (Cys): The DMS model scores C29 as the most mutable position (avg = +2.45), yet experimentally C29R kills both lysis and protein expression. The language model misses this — likely a disulfide bond or folding nucleation site. This highlights the importance of cross-validating computational predictions with experimental data.

Most tolerant positions (DMS): 29 (but experimentally essential!), 27, 24, 5, 23, 22, 39, 25 Most conserved positions (DMS): 1 (Met, start codon), 38, 31, 19, 30, 18, 20, 11

Step 2: Boltz-2 Co-fold (L-protein + DnaJ)

Tool: amina run boltz2 with L-protein (chain A, 75 aa) + DnaJ (chain B, 376 aa)

Results: ipTM = 0.165, pLDDT = 71.5, pDockQ2 = 0.009

The very low ipTM confirms what the assignment warned — folding models struggle with this system. The L-protein has a disordered soluble domain and a transmembrane region, neither of which fold well in isolation. However, the low-confidence prediction still places the soluble domain (residues 1–39) in proximity to DnaJ’s N-terminal J-domain, which is consistent with the known chaperone-substrate interaction mode.

Step 3: Designed Mutations

Based on combining: (1) DMS-favorable scores, (2) experimentally validated lysis-positive mutations, and (3) avoidance of conserved/essential positions, we propose 5 L-protein variants:

Variant	Mutations	Region	Rationale
V1	K23E + E25G + D26G	Soluble	All three individually maintain lysis. Remodels the charge landscape of positions 23–26 (removes net +1 charge, adds flexibility). All DMS-favorable.
V2	R18G + R20L + K23E	Soluble	Removes three positive charges from the Arg/Lys-rich stretch (18–23), which is the predicted DnaJ interaction surface. All experimentally lysis-positive. May reduce DnaJ dependence by altering the chaperone recognition motif.
V3	S15A + E25G + R30Q	Soluble	Combines the most conservative experimentally validated mutations across different subregions. S15A and R30Q both maintain lysis AND protein levels, minimizing risk.
V4	P13L + R18I + E25D	Soluble	P13L disrupts a potential turn structure; R18I replaces a charged Arg with hydrophobic Ile; E25D is a conservative acidic→acidic swap. All lysis-positive experimentally.
V5	R19S + K23E + D26G	Soluble	Targets the cationic cluster (R19, K23) that likely mediates DnaJ binding. Replacing Arg/Lys with neutral/acidic residues may enable DnaJ-independent folding while maintaining the downstream lysis machinery.

Design principles:

All individual mutations are experimentally validated as lysis-positive
Mutations target the Arg/Lys-rich region (positions 18–26) that likely mediates DnaJ recognition
No mutations at position 29 (Cys — essential despite DMS scores) or position 1 (Met — start codon)
Each variant has ≥3 mutations for meaningful charge/surface remodeling

Tools Used

Step	Tool	Details
Deep mutational scan	`amina run esm1v -m dms`	5-model ensemble, 1500 mutations
Co-fold L-protein + DnaJ	`amina run boltz2`	2-chain complex, diffusion model
Experimental cross-validation	Published data	Chamakura et al., 2017