Week 5 HW: Protein Design Part 2

Part A: SOD1 A4V Peptide Binder Design

Superoxide dismutase 1 (SOD1) is a cytosolic antioxidant enzyme that converts superoxide radicals into hydrogen peroxide and oxygen. In its native state it forms a stable homodimer and binds copper and zinc. Mutations in SOD1 cause familial Amyotrophic Lateral Sclerosis (ALS). The A4V mutation (alanine to valine at residue 4) leads to one of the most aggressive forms of the disease. The mutation subtly destabilizes the N-terminus, perturbs folding energetics, and promotes toxic aggregation.

The goal here is to design short peptides that bind mutant SOD1, then decide which ones are worth advancing toward therapy, using three models: PepMLM, PeptiVerse, and moPPIt.

Generate four 12-mer binders with PepMLM and record perplexity scores

Four 12-residue peptides were generated using PepMLM-650M conditioned on the SOD1 A4V mutant sequence, alongside the known binder FLYRWLPSRRGG.

Peptide ID	Sequence	Source	Perplexity
1	`WRYYVAAVRWGE`	generated	21.23
2	`WRSPPVGVEHKA`	generated	22.21
3	`WLYYPVGAELKE`	generated	16.06
4	`WHSGVVVLALKA`	generated	13.84
5	`FLYRWLPSRRGG`	known binder	20.64

Lower pseudo-perplexity indicates higher model confidence. Peptide 4 (WHSGVVVLALKA, PPL = 13.84) shows the highest PepMLM confidence, followed by Peptide 3 (WLYYPVGAELKE, PPL = 16.06). Both outperform the known binder (PPL = 20.64), suggesting the model considers them plausible binders. All four generated peptides begin with Trp (W), suggesting a strong N-terminal preference for aromatic anchoring to SOD1.

Evaluate binders with AlphaFold3

All five peptide–SOD1 complexes were submitted to AlphaFold Server (fold date: 2026-03-09). Each job modeled the SOD1 A4V monomer (154 residues, chain A) with one 12-mer peptide (chain B). Results are stored in peptides/af3_results/.

Peptide	ipTM (best)	Binding Location	Surface/Buried	Notes
`WRYYVAAVRWGE`	0.31	Dimer interface / β-barrel	Surface-bound	PAE 9.07 Å, moderate confidence
`WRSPPVGVEHKA`	0.36	Extended surface groove	Surface-bound	Second-best ipTM, extended conformation
`WLYYPVGAELKE`	0.24	β-barrel region	Surface-bound	PAE 10.81 Å, lowest confidence
`WHSGVVVLALKA`	0.48	Dimer interface pocket	Partially buried	Best model: PAE 4.97 Å, well-defined binding
`FLYRWLPSRRGG`	0.31	β-barrel / dimer interface	Surface-bound	Known binder, PAE 8.60 Å

ipTM values range from 0.24 to 0.48 across the five complexes. While all fall below the 0.6 threshold typically considered high-confidence for protein–peptide interactions, they show meaningful differentiation among candidates.

Peptide 4 (WHSGVVVLALKA, ipTM = 0.48) clearly stands out: its ipTM exceeds the known binder FLYRWLPSRRGG (0.31) by 55%, and its PAE of 4.97 Å is roughly half that of the next-best model, indicating a well-resolved binding pose at the dimer interface pocket. It is also the only one predicted to be partially buried, suggesting tighter engagement.

Peptide 2 (WRSPPVGVEHKA, ipTM = 0.36) ranks second structurally, adopting an extended conformation along a surface groove. Peptides 1 and 5 tie at ipTM = 0.31, with Peptide 1 localizing to the dimer interface / β-barrel region and Peptide 5 (the known binder) similarly positioned. Peptide 3 (WLYYPVGAELKE, ipTM = 0.24) has the weakest structural prediction despite its moderate PepMLM perplexity (16.06), with a high PAE (10.81 Å) indicating uncertain binding geometry.

Notably, none of the five peptides bind near the N-terminus where the A4V mutation resides (position 4). All predicted binding sites localize to the dimer interface or β-barrel region, suggesting these peptides may act through general fold stabilization or dimer modulation rather than direct mutation-site engagement.

Evaluate therapeutic properties with PeptiVerse

Peptide	Source	PPL	Binding Affinity (pKd)	Solubility	Hemolysis	Net Charge (pH 7)	MW (Da)
`WRYYVAAVRWGE`	generated	21.23	7.021 (Medium)	1.000	0.093	+0.77	1555.7
`WRSPPVGVEHKA`	generated	22.21	4.826 (Weak)	1.000	0.013	+0.85	1362.5
`WLYYPVGAELKE`	generated	16.06	5.722 (Weak)	1.000	0.033	-1.23	1467.7
`WHSGVVVLALKA`	generated	13.84	6.055 (Weak)	1.000	0.079	+0.85	1279.5
`FLYRWLPSRRGG`	known binder	20.64	5.968 (Weak)	1.000	0.047	+2.76	1507.7

ipTM vs. PeptiVerse affinity. AlphaFold3 structural confidence and PeptiVerse-predicted affinity disagree on the top candidate. Peptide 4 (WHSGVVVLALKA) dominates structurally (ipTM = 0.48, PAE = 4.97 Å) but has only moderate predicted affinity (pKd = 6.055, “Weak”). Conversely, Peptide 1 (WRYYVAAVRWGE) has the best PeptiVerse affinity (pKd = 7.021, “Medium binding”) but an unremarkable ipTM of 0.31. This divergence likely reflects the fact that PeptiVerse predicts binding strength from sequence features while AF3 models 3D structural complementarity. The two views are complementary.

PepMLM perplexity vs. ipTM. These two metrics show better agreement. Peptide 4 ranks first in both (PPL = 13.84, ipTM = 0.48), supporting its candidacy from two independent perspectives. The correlation is imperfect: Peptide 3 ranks second by PepMLM (PPL = 16.06) but last by AF3 (ipTM = 0.24), so low perplexity does not guarantee a well-resolved pose.

Therapeutic safety. All five peptides are predicted to be fully soluble (probability = 1.000) and non-hemolytic (all below 0.10). No candidates present safety red flags. Peptide 2 (WRSPPVGVEHKA) has the lowest hemolysis risk (0.013) but also the weakest binding (pKd = 4.826).

Physicochemical properties. Net charges range from -1.23 to +2.76 at pH 7, all within reasonable bounds for cell-penetrating peptides. The known binder has the highest positive charge (+2.76), consistent with its arginine-rich C-terminus. Molecular weights are in the 1280–1556 Da range, typical for 12-mers.

Top candidate to advance: Peptide 4 (WHSGVVVLALKA), with Peptide 1 (WRYYVAAVRWGE) as a strong alternative.

Peptide 4 has the best PepMLM confidence (PPL = 13.84) and the best AlphaFold3 structural prediction by a wide margin (ipTM = 0.48, PAE = 4.97 Å). Two independent methods (sequence-based PepMLM and structure-based AF3) agree that this peptide has the most credible interaction with SOD1. Its predicted binding at the dimer interface pocket, where it is partially buried, suggests a geometrically specific interaction rather than nonspecific surface adhesion. While its PeptiVerse-predicted affinity is moderate (pKd = 6.055), the structural evidence from AF3 provides stronger support for a real binding event. It is fully soluble, non-hemolytic (0.079), and has the lowest molecular weight (1279.5 Da) among the candidates.

Peptide 1 (WRYYVAAVRWGE) remains a compelling alternative: it has the strongest predicted binding affinity (pKd = 7.021, the only “Medium binding” peptide), excellent safety properties, and a moderate ipTM (0.31). If PeptiVerse affinity predictions are weighted more heavily than AF3 structural models, Peptide 1 would be the preferred choice.

For experimental validation, both peptides merit testing: Peptide 4 as the structurally favored lead, Peptide 1 as the affinity-favored alternative.

Generate optimized peptides with moPPIt

The moPPIt model (discrete flow matching with multi-objective gradient guidance) was used to generate 11 peptides targeting the SOD1 A4V mutant. Target motifs were set to residues 1–15 (N-terminus, near the A4V mutation) and residues 49–54 (dimer interface near the EFGDN loop). Peptide length was 12 amino acids. Objective weights were [1, 1, 1, 4, 4, 2], so affinity and motif specificity were prioritized 4×. Results are in peptides/moPPIt/sod1_moppit_results.csv.

Peptide	Hemolysis	Non-Fouling	Half-Life	Affinity	Motif	Specificity
`QKRRLLSLPVFK`	0.902	0.602	0.80	6.00	0.478	0.622
`YPPCAYYWQATD`	0.929	0.587	3.42	7.10	0.563	0.686
`SIVKTGVTFLTK`	0.920	0.186	1.81	6.38	0.584	0.699
`PPLIHRWYAATM`	0.922	0.321	3.49	6.30	0.444	0.660
`EEQVVKRIKVGP`	0.953	0.736	0.68	6.54	0.580	0.679
`CVQNKKPTFLII`	0.911	0.497	1.56	6.14	0.668	0.647
`LKKKIREFLKLG`	0.952	0.561	1.16	6.19	0.512	0.660
`YDPLPCAWTPTH`	0.935	0.726	2.69	6.57	0.482	0.699
`KPFVFFAKTEIM`	0.932	0.130	1.41	6.25	0.589	0.538
`PTWVIETKKKFR`	0.979	0.611	2.30	5.73	0.609	0.667
`GPKGWTGKQCFI`	0.888	0.711	2.07	7.00	0.474	0.635

Hemolysis: probability of being non-hemolytic (higher is safer). Affinity: predicted binding score (higher is stronger). Motif: fraction of binding at target residues (higher means more on-target).

All 11 peptides show high predicted hemolysis scores (0.89–0.98), indicating low hemolytic risk. Affinity predictions span 5.73 to 7.10, with YPPCAYYWQATD (7.10) and GPKGWTGKQCFI (7.00) showing the strongest predicted binding. Half-lives vary considerably (0.68–3.49 hours), with PPLIHRWYAATM (3.49 h) and YPPCAYYWQATD (3.42 h) the most stable.

Top moPPIt candidates

Category	Peptide	Highlights
Highest affinity	`YPPCAYYWQATD`	Affinity 7.10, half-life 3.42, specificity 0.686
Best motif targeting	`CVQNKKPTFLII`	Strongest on-target binding (motif 0.668)
Best therapeutic profile	`EEQVVKRIKVGP`	Highest non-hemolytic (0.953), best non-fouling (0.736), strong affinity (6.54)
Best overall balance	`YDPLPCAWTPTH`	Affinity 6.57, non-fouling 0.726, half-life 2.69, specificity 0.699

Comparison to PepMLM peptides

Design philosophy. PepMLM generates peptides via masked language modeling conditioned on the target sequence: it learns what peptide “looks right” next to SOD1 based on evolutionary patterns. moPPIt uses discrete flow matching with explicit multi-objective gradient guidance: it actively optimizes for binding affinity, motif specificity, and therapeutic properties simultaneously.
Binding specificity. PepMLM peptides are generated without any notion of where on SOD1 they should bind. moPPIt peptides are explicitly guided toward residues 1–15 and 49–54 via the BindEvaluator motif score, with a specificity penalty that discourages off-target binding.
Sequence composition. PepMLM peptides all start with W (tryptophan), suggesting a strong bias for aromatic N-terminal anchors. moPPIt peptides are more diverse: no single residue dominates, and compositions vary based on the objective trade-offs the sampler explores.
Affinity. moPPIt’s highest-affinity peptide (YPPCAYYWQATD, 7.10) is comparable to PepMLM’s best (WRYYVAAVRWGE, 7.02 via PeptiVerse). However, moPPIt consistently produces peptides in the 6.0–7.1 range, while PepMLM has more variance (4.8–7.0), suggesting moPPIt’s affinity guidance is effective.
Solubility tradeoff. PepMLM peptides all have perfect predicted solubility (1.000). Some moPPIt peptides sacrifice solubility (SIVKTGVTFLTK non-fouling = 0.186, KPFVFFAKTEIM = 0.130) in favor of higher affinity. This reflects the multi-objective nature: aggressive affinity optimization can push sequences toward hydrophobic compositions.

Evaluation before clinical advancement

In silico

Molecular dynamics simulations of peptide–SOD1 complexes (starting from AF3 structures) to assess binding stability.
Binding free energy calculations (MM/PBSA or MM/GBSA) for ranking.
Aggregation prediction (AGGRESCAN, TANGO).

In vitro

Surface plasmon resonance (SPR) or isothermal titration calorimetry (ITC) to measure actual Kₓ against A4V SOD1.
Hemolysis assay with human red blood cells.
Serum stability to validate half-life predictions.
ThT fluorescence and aggregation assays to test whether the peptide inhibits A4V SOD1 aggregation.

Cell-based

Cell viability (MTT/MTS) to confirm non-cytotoxicity.
Cell-penetrating peptide assessment, since SOD1 is cytosolic.
Co-immunoprecipitation to confirm peptide–SOD1 interaction in cellular context.

In vivo preclinical

Pharmacokinetics (bioavailability, clearance, tissue distribution).
Efficacy testing in the SOD1-G93A transgenic ALS mouse model.
Standard safety pharmacology panel.

The key bottleneck for peptide therapeutics is typically delivery (cell penetration plus proteolytic stability), not binding affinity. Strategies to address this include D-amino acid substitution, cyclization, stapling, and conjugation to cell-penetrating peptide motifs.

Part B: BRD4 Drug Discovery with Boltz Lab

Tutorial designed by Geoffrey Smith, Boltz Lab.

Target: BRD4 (Bromodomain-containing protein 4), an epigenetic reader and validated oncology target. BRD4 is a member of the BET (Bromodomain and Extra-Terminal) family. It recognizes acetylated lysine residues on histone tails and recruits transcriptional machinery to gene promoters, driving expression of oncogenes including c-Myc. Dysregulated BRD4 activity is implicated in haematological malignancies, solid tumours, and inflammatory disease.

Reference: Filippakopoulos P. et al. Selective inhibition of BET bromodomains. Nature 468, 1067–1073 (2010). Crystal structure: PDB 3MXF.

Compound progression (Hit → Lead → Candidate)

Stage	Compound	SMILES
Hit	Stripped Back Core	`CC1C2C(=C(SC=2NCCN=1)C)C`
Lead	Triazole + Acid	`O=C(C[C@@H]1N=C(C)C2C(=C(SC=2N2C1=NN=C2C)C)C)O`
Candidate	(+)-JQ1	`O=C(C[C@H]1C2=NN=C(N2C3=C(C(C4=CC=C(C=C4)Cl)=N1)C(C)=C(S3)C)C)OC(C)(C)C`

Boltz-2 metrics

Metric	Range	Meaning	Trust threshold
Binding Confidence	0–1	How confidently Boltz-2 places the ligand in the binding site.	> 0.7 reliable; > 0.8 high confidence
Optimization Score	0–1	Relative affinity ranking for a congeneric series.	Use for relative ranking
Structure Confidence	0–1	Confidence in the predicted structure.	> 0.8 high confidence

All three metrics need to be high to trust a prediction.

Run Boltz-2 predictions for the Hit, Lead, and JQ1

Compound	Binding Confidence	Optimization Score	Structure Confidence
Hit	0.43	0.22	0.93
Lead	0.74	0.27	0.98
JQ1	0.96	0.44	0.98

Does Binding Confidence increase from hit to clinical candidate?

Yes, Binding Confidence increases monotonically across the drug discovery progression: Hit (0.43) → Lead (0.74) → JQ1 (0.96). This is exactly what we would expect: each optimization stage adds chemical features that improve shape complementarity and specific interactions with the BRD4 acetyl-lysine binding pocket. The Hit (stripped back core) contains only the minimal thienodiazepine scaffold with no substituents to make specific contacts, so Boltz-2 has low confidence in placing it. The Lead adds a triazole and carboxylic acid that mimic the acetyl-lysine pharmacophore, roughly doubling the Binding Confidence. JQ1 adds the chlorophenyl group and tert-butyl ester, filling the WPF shelf and ZA channel of the bromodomain pocket and pushing Binding Confidence to 0.96, well above the 0.8 high-confidence threshold.

Structure Confidence is high for all three compounds (0.93–0.98), indicating that the protein structure itself is well-predicted regardless of the ligand. This makes sense since BRD4 is a well-characterized, rigid globular domain.

Inspect the predicted binding pose for JQ1

JQ1 scores 0.96 Binding Confidence with 0.98 Structure Confidence, indicating a highly reliable predicted pose. Key binding interactions, expected from the known crystal structure (PDB 3MXF):

The triazole ring and methyl group occupy the acetyl-lysine recognition site, forming a hydrogen bond with the conserved asparagine (N140) in the BC loop. This is the hallmark interaction of BET bromodomain inhibitors.
The chlorophenyl ring packs against the WPF shelf (W81, P82, F83), providing hydrophobic anchoring.
The tert-butyl ester group extends into the ZA channel, contributing additional hydrophobic contacts and shape complementarity.
The thienodiazepine core sits at the mouth of the pocket, bridging the ZA and BC loops.

Compare the Optimization Scores

The Optimization Scores track the same progression: Hit (0.22) → Lead (0.27) → JQ1 (0.44). JQ1’s score (0.44) is roughly 63% higher than the Lead’s (0.27), reflecting the substantial affinity gain from adding the chlorophenyl and tert-butyl ester groups. The Hit-to-Lead jump is more modest (0.22 → 0.27, ~23%), consistent with the triazole and acid adding some specific contacts but not yet achieving full pocket occupancy.

Using the categorization thresholds, JQ1 falls squarely in the high confidence binder range (Binding Confidence > 0.80, Opt. Score > 0.40). The Lead sits at moderate confidence (0.74, 0.27, both within the 0.65–0.80 and 0.25–0.40 ranges). The Hit falls in the low confidence / non-binder category (0.43, 0.22), consistent with its role as an unoptimized screening hit.

1K virtual screen

A design project was created in Boltz Lab using PDB 3MXF (BRD4 bromodomain 1 co-crystallized with JQ1) as the structural template. JQ1 was specified as the molecular probe to define the acetyl-lysine binding pocket. The platform automatically detected the binding site from the JQ1 co-crystal pose, identifying key pocket residues: the WPF shelf (W81, P82, F83), BC loop (N140), and ZA channel. Project ID: VS-BRD4WO-5P52.

A virtual screen of 993 AI-designed small molecules was generated from the Enamine REAL chemical space with Drug-Like filtering. All compounds were scored by Boltz-2 against the BRD4 binding pocket.

Score distributions across the library

Metric	Min	Max	Mean
Binding Confidence	0.07	0.85	0.30
Optimization Score	0.00	0.48	0.23
Structure Confidence	> 0.84	> 0.96	~0.92

The vast majority of compounds cluster at low Binding Confidence (< 0.40), consistent with the expectation that random chemical space sampling yields few genuine binders. Structure Confidence remains high throughout (> 0.84), indicating that the protein structure predictions are reliable regardless of ligand quality.

Top 5 compounds by Binding Confidence

Rank	ID	Binding Confidence	Opt. Score	SMILES
1	SM-AQ8GBD73	0.85	0.35	`Cc1cc(-c2cc(C)c(Cl)c(C)c2)cc(C)c1O`
2	SM-VP5CRXFK	0.84	0.25	`CN1Cc2c(NC(=O)c3cccnc3)cccc2C1=O`
3	SM-2MZLAGQT	0.80	0.48	`Cc1nc2c(cc1C(=O)Nc1cnn(CC(C)(C)O)c1C)c(C)nn2C`
4	SM-G95H15CR	0.76	0.20	`CCC(=O)N(C)c1ccc2c(c1)CN(C)C2`
5	SM-1ASUYQAA	0.74	0.34	`CCN(C(=O)C(C)C)c1ccc(Cl)cc1F`

Categorize the results and benchmark against JQ1

Category	Criteria	Count	% of library
High confidence binders	BC > 0.80, OS > 0.40	1	0.1%
Moderate confidence	BC 0.65–0.80, OS 0.25–0.40	13	1.3%
Low confidence / non-binders	BC < 0.65, OS < 0.25	979	98.6%

The reference compounds validate the scoring system:

Compound	Category
JQ1	High confidence binder (0.96 / 0.44)
Lead	Moderate confidence (0.74 / 0.27)
Hit	Low confidence (0.43 / 0.22)

The sole high-confidence AI hit:

ID	Binding Confidence	Opt. Score	Structure Confidence	SMILES
SM-2MZLAGQT	0.80	0.48	0.92	`Cc1nc2c(cc1C(=O)Nc1cnn(CC(C)(C)O)c1C)c(C)nn2C`

SM-2MZLAGQT contains a pyridazine-pyrazole core with multiple methyl groups and an amide linker to a neopentyl alcohol. Structurally distinct from JQ1, but it shares nitrogen-rich heterocyclic character.

How does JQ1 rank alongside the AI-generated library?

JQ1 scores BC = 0.96, OS = 0.44, substantially outperforming every AI-generated compound on Binding Confidence. By BC alone, JQ1 ranks #1 by a wide margin (0.96 vs. the next-best AI compound SM-AQ8GBD73 at 0.85). No AI-generated molecule approaches JQ1’s level of binding confidence.

However, SM-2MZLAGQT (the only high-confidence AI hit) achieves a higher Optimization Score (0.48) than JQ1 (0.44). This is notable: the Optimization Score reflects relative affinity ranking within a congeneric series, and SM-2MZLAGQT’s higher OS suggests it may achieve comparable or slightly better binding affinity despite lower structural confidence in its predicted pose.

Compound	BC Rank	OS Rank	BC	OS
JQ1 (benchmark)	1	2	0.96	0.44
SM-2MZLAGQT	4	1	0.80	0.48
SM-AQ8GBD73	2	6	0.85	0.35
SM-VP5CRXFK	3	—	0.84	0.25

JQ1 is not the top compound by Optimization Score, but it dominates Binding Confidence. This is expected: JQ1 is a highly optimized clinical candidate with known high-affinity binding to BRD4, whereas the AI compounds are generated from general chemical space without iterative medicinal chemistry optimization.

How do the top scoring binders compare in binding pose to JQ1?

The top-scoring AI compound SM-2MZLAGQT (Cc1nc2c(cc1C(=O)Nc1cnn(CC(C)(C)O)c1C)c(C)nn2C) contains a fused pyridazine-pyrazole bicyclic core decorated with methyl groups and an amide-linked pyrazole bearing a neopentyl alcohol. Compared with JQ1’s thienodiazepine scaffold:

Shared pharmacophoric features

Both molecules feature nitrogen-rich heterocyclic cores capable of occupying the acetyl-lysine recognition site and forming hydrogen bonds with N140.
Multiple methyl substituents in both compounds provide hydrophobic contacts with the pocket walls.
Both have molecular weights in the drug-like range (SM-2MZLAGQT ~314 Da vs. JQ1 ~457 Da).

Key structural differences

JQ1 uses a thienodiazepine (7-membered ring with sulfur), while SM-2MZLAGQT uses a pyridazine-pyrazole (two fused 6+5 rings with nitrogen).
JQ1’s chlorophenyl group fills the WPF shelf. SM-2MZLAGQT lacks an equivalent aromatic group, which may explain its lower Binding Confidence.
JQ1’s tert-butyl ester extends into the ZA channel; SM-2MZLAGQT’s neopentyl alcohol (CC(C)(C)O) may partially mimic this interaction but with a hydroxyl instead of an ester.
SM-2MZLAGQT is more compact and lacks the extended hydrophobic features that give JQ1 its high shape complementarity.

The second-highest BC compound, SM-AQ8GBD73 (Cc1cc(-c2cc(C)c(Cl)c(C)c2)cc(C)c1O), is a simple biaryl phenol with chlorine and methyl substitution, structurally much simpler than JQ1. Its high BC (0.85) but moderate OS (0.35) suggests it may sit in the pocket with good shape complementarity but lack the specific pharmacophoric interactions (N140 hydrogen bond, ZA channel occupancy) that drive high affinity.

Selectivity analysis: BRD4 vs. BRD2

This analysis was not performed. A selectivity screen against BRD2 (PDB 5UEN) would require re-running the top-scoring compounds from the BRD4 screen against the BRD2 bromodomain structure and comparing Binding Confidence and Optimization Scores across the two targets. Compounds scoring highly for BRD4 but poorly for BRD2 would indicate selectivity, a desirable property for reducing off-target effects, since BRD4 and BRD2 share highly conserved acetyl-lysine binding pockets. JQ1 itself is a pan-BET inhibitor (binds BRD2, BRD3, and BRD4), so identifying BRD4-selective compounds from the AI screen would represent a potential advantage over the benchmark.

Resources

Resource	Link
Boltz Lab Platform	docs.boltz.bio
Key BRD4 paper	Filippakopoulos P. et al. Nature 468, 1067–1073 (2010)
JQ1 PDB structure	rcsb.org/structure/3MXF

Part C: Phage Lysis Protein Design Challenge

L-Protein (Lysis Protein), 75 residues:

METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

Region	Residues	Role
Soluble domain	1–40	Interacts with DnaJ.
Transmembrane domain	41–75	Drives lysis activity.

Engineering goals:

DnaJ independence. L-protein folds and functions without requiring DnaJ.
Faster or more efficient lysis. Reduces the window for E. coli to acquire resistance.
Higher L-protein expression. Increases the amount of functional protein produced.

Approach. ESM-2 mutational scanning, experimental mutant data from PMC5775895, and conservation analysis via pBLAST + ClustalOmega were integrated to design five mutant L-protein sequences.

Generate mutational effect scores with ESM-2

The ESM-2 protein language model (650M parameters) was run on the 75-residue L-protein sequence. For each position, all 19 alternative amino acid substitutions were scored by computing the log-likelihood ratio (LLR = mutant log-probability minus wild-type log-probability). Results are saved in ms2/mutation_scores.csv (1,425 mutations across 75 positions).

Metric	Value
Total mutations scored	1,425
Positions	75
Soluble region (1–40)	760 mutations
Transmembrane region (41–75)	665 mutations
Positive LLR (predicted beneficial)	400 (28.1%)
Negative LLR (predicted deleterious)	1,025 (71.9%)

Top 10 highest-scoring substitutions:

Mutation	LLR	Region
C29R	+3.64	Soluble
K50P	+3.56	TM
C29P	+3.17	Soluble
C29Q	+3.06	Soluble
C29S	+3.04	Soluble
K50L	+2.96	TM
C29K	+2.76	Soluble
C29L	+2.74	Soluble
C29A	+2.55	Soluble
C29T	+2.52	Soluble

Two positions dominate the positive LLR landscape: C29 (cysteine at position 29 in the soluble domain) and K50 (lysine at position 50 in the TM domain). ESM-2 strongly prefers substituting the cysteine at position 29, likely because free cysteines are rare in most proteins and the model considers them destabilizing. K50 scores highly because the model views a charged residue in a hydrophobic TM context as unfavorable. The most strongly disfavored mutations are all at the initiator methionine (M1).

Review the experimental mutant data

Experimental mutant data was obtained from PMC5775895 and is stored in ms2/L-Protein Mutants - Sheet1.csv. The dataset contains 139 entries representing 82 unique mutations across 49 positions in the L-protein.

Category	Count
Total entries	139
Unique mutations	82
Missense mutations	100 (59 unique)
Stop codon mutations	39
Missense with lysis = 1 (functional)	35 (19 unique)
Missense with lysis = 0 (non-functional)	65 (40 unique)

Soluble domain (residues 1–40). This region is remarkably tolerant of mutation. Substitutions at R18, R19, R20 (the arginine-rich region) all retain lysis activity despite dramatically changing the charge profile (R18G, R18I, R19H, R19S, R20L, R20W; all lysis = 1). Positions 23 (K→E) and 25 (E→V, E→G, E→D) are also fully tolerant. Notable exceptions: M1 (initiator Met, essential), P6L (lysis = 0), Q8L (lysis = 0), Y39H (lysis = 0). C29R retains lysis, and C29 itself appears to be non-essential for function despite moderate conservation.

Transmembrane domain (residues 41–75). Far less tolerant. Most substitutions abolish lysis. K50 is functionally critical: all four tested substitutions (K50E, K50I, K50N, K50Q) show lysis = 0, yet the protein is still expressed (protein level = 1 for most), indicating that K50 is required for the lysis mechanism itself, not for protein stability. Proline substitutions in the TM helix are generally lethal (L48P, L56P, L57P, L60P all lysis = 0). Rare functional TM mutations include L44P and A45P; prolines at the TM boundary are tolerated, possibly because they sit at the helix-membrane interface. Positions 49–53 (S49, K50, F51, T52, N53) form a particularly intolerant stretch.

Does the experimental data correlate with the language model scores?

The ESM-2 LLR scores show no meaningful correlation with experimental lysis outcomes.

Test	Result
Point-biserial correlation	rₓ₋ = -0.041, p = 0.757
Mann–Whitney U	U = 421, p = 0.511
Mean LLR for lysis = 1 (functional)	-0.560
Mean LLR for lysis = 0 (non-functional)	-0.433

The correlation is essentially zero and far from statistical significance. If anything, the slight negative trend (functional mutations have marginally lower LLR) contradicts the expected direction. The Mann–Whitney U test confirms that the LLR distributions for functional and non-functional mutations are not distinguishable.

Of the 59 matched mutations, ESM-2 predictions agree with experiment in approximately 30 cases (roughly 50%), no better than random.

What does this say about ESM-2 for the L-protein?

ESM-2’s evolutionary signal does not capture the functional constraints of the L-protein. Several factors explain this:

Extreme sequence rarity. The L-protein is a 75-residue protein encoded by an overlapping reading frame in the MS2 genome. It has very few homologs in sequence databases (only 2–3 close relatives, fr and M12, plus a handful of distantly related levivirus lysis proteins). ESM-2 was trained on millions of sequences, and its effectiveness depends on having sufficient evolutionary depth. The L-protein’s shallow phylogenetic tree gives the model little signal to leverage.
Unusual evolutionary constraints. Because the lysis gene overlaps the coat protein and replicase genes, its evolution is constrained by the reading frames of two other genes. The selective pressures captured in ESM-2’s training reflect these overlapping constraints, not the intrinsic functional requirements of the L-protein itself.
Non-standard function. The L-protein is a single-pass transmembrane toxin whose function (membrane disruption) may not follow the same structure–function relationships ESM-2 captures well for globular enzymes.

The protein-level correlation is equally absent (r = 0.039, p = 0.768), confirming that ESM-2 does not predict expression or stability for this protein either.

Where does the model succeed and where does it fail?

Where ESM-2 succeeds

Strongly deleterious mutations at conserved positions. M1I and M1T (LLR = -6.13 and -5.63) are correctly predicted as non-functional. The initiator methionine is universally conserved and essential. Similarly, I42N (LLR = -1.43, lysis = 0) and I46N (LLR = -1.43, lysis = 0) in the transmembrane domain are correctly identified; replacing hydrophobic residues with polar asparagine disrupts TM helix packing.
Proline substitutions in the TM helix. L48P (LLR = -2.31), L56P (LLR = -1.22), L56H (LLR = -2.11), L57P (LLR = -0.42), and L60P (LLR = -0.84) all correctly receive negative LLR and experimentally show no lysis. ESM-2 recognizes that proline is incompatible with alpha-helical transmembrane segments.

Where ESM-2 fails

The arginine-rich soluble region (R18, R19, R20). R18G (LLR = -1.02), R18I (-1.37), R19H (-1.03), R19S (-0.30), R20L (-0.23), and R20W (-2.30) are all predicted deleterious, yet every one permits lysis. This is because the soluble N-terminal domain (residues 1–40) is largely dispensable for lysis activity; the amino-terminal half of the protein can tolerate extensive mutation as long as the transmembrane domain is intact. ESM-2 cannot distinguish “conserved for overlapping gene constraints” from “conserved for L-protein function.”
Position K50 in the TM domain. K50E (LLR = +0.50), K50I (+2.41), K50N (+0.86), and K50Q (+0.78) all receive positive or near-positive LLR scores, yet all four experimentally show no lysis. K50 is a charged “snorkeling” lysine in the TM domain that is apparently critical for membrane disruption. ESM-2 interprets this unusual charged residue in a hydrophobic context as unfavorable, when in fact it is functionally essential.
The failure pattern is region-dependent. Per-region analysis shows a slight positive trend in the soluble domain (rₓ₋ = +0.134) but a slight negative trend in the transmembrane domain (rₓ₋ = -0.166). ESM-2 is marginally better at predicting outcomes in the soluble domain but actively misleading in the transmembrane domain, likely because the functional rules for single-pass TM toxins differ from the evolutionary patterns in ESM-2’s training set.

Conservation analysis via pBLAST + ClustalOmega

A pBLAST search of the L-protein sequence identified 10 levivirus lysis protein homologs: fr (CAA33137), M12 (AAF19634), GA (CAA27498), JP34 (AAA72211), KU1 (AAF67675), BZ13 (ACT66727), Hgal1 (YP007237174), C1 (YP007237128), PP7 (NP042306), and PRR1 (YP717670). These were aligned with ClustalOmega and conservation scores were computed per position (ms2/conservation_scores.csv, ms2/alignment.fasta).

The alignment spans 11 sequences (MS2 L-protein + 10 homologs). Not all sequences cover every position; the N-terminal and C-terminal regions have variable sequence coverage (2–11 sequences per position).

Highly conserved positions (conservation ≥ 0.80):

Position	Residue	Conservation	Shannon Entropy	Region
1	M	1.00	0.00	Soluble
2	E	1.00	0.00	Soluble
3	T	1.00	0.00	Soluble
4	R	1.00	0.00	Soluble
9	S	0.80	0.72	Soluble
12	T	0.80	0.72	Soluble
29	C	0.82	0.68	Soluble
46	I	0.82	0.87	TM
48	L	0.82	0.87	TM
64	I	0.88	0.54	TM
69	T	0.88	0.54	TM
70	L	0.88	0.54	TM
73	L	1.00	0.00	TM
75	T	1.00	0.00	TM

The first four residues (METR) are universally conserved across all homologs. C29 (conservation = 0.82) is notable as the only cysteine in the protein and is highly conserved despite ESM-2 strongly favoring its substitution, highlighting a disconnect between evolutionary conservation and model preferences.

Highly variable positions (conservation ≤ 0.30):

Position	Residue	Conservation	Most common AA	Region
6	P	0.20	P	Soluble
17	N	0.30	M	Soluble
18	R	0.18	G	Soluble
19	R	0.09	L	Soluble
25	E	0.27	K	Soluble
26	D	0.18	E	Soluble
28	P	0.27	L	Soluble
30	R	0.18	S	Soluble
37	T	0.27	R	Soluble
41	L	0.27	W	TM
43	F	0.27	A	TM
50	K	0.30	D	TM
53	N	0.30	S	TM
56	L	0.30	S	TM
74	L	0.29	P	TM

The soluble domain (positions 1–40) shows a gradient: the first four residues are perfectly conserved, then conservation drops substantially in the R18–R20 arginine-rich region (0.09–0.38) and the E25–P28 stretch (0.18–0.27). The transmembrane domain (positions 41–75) has a mix of well-conserved structural residues (I46, L48, I64, T69, L70, L73, T75) and highly variable positions (L41, F43, K50, N53, L56), suggesting that TM helix geometry is maintained but specific side chains can vary.

Design 5 mutant variants

The variants below were selected by integrating three data sources: ESM-2 LLR scores (predicted mutational effect), conservation analysis (10 levivirus lysis protein homologs aligned via ClustalOmega), and experimental lysis data (59 characterized mutations). Selection criteria: positive LLR, non-conserved position (conservation < 0.8), and experimentally supported where available.

Variant 1: L-K23E

Item	Value
Full mutant sequence	`METRFPQQSQQTPASTNRRRPFEHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT`
Region	Soluble (position 23)
Mutation	K23 → E (lysine to glutamate)
Language model score	+0.289 (predicted beneficial)
Experimental support	Lysis = 1 (functional); Protein level = 0 (not detected by Western blot)
Conservation status	0.545 (moderately variable; Shannon entropy 2.05)
Criteria met	3/3 (positive LLR, non-conserved, experimentally supported)
Rationale	Charge reversal (positive K to negative E) in the soluble domain’s basic region near the DnaJ interaction interface. Position 23 is moderately conserved but shows high entropy (2.05), indicating tolerance for diverse amino acids across levivirus lysis proteins. The K→E substitution replaces the most common residue at this position with a negatively charged alternative, potentially altering the electrostatic interaction surface with DnaJ. Experimentally confirmed to retain lysis activity.
Target goal	DnaJ independence. Charge reversal at the chaperone interaction surface may weaken DnaJ binding while the protein retains lysis function through an alternative folding pathway.

Variant 2: L-E25G

Item	Value
Full mutant sequence	`METRFPQQSQQTPASTNRRRPFKHGDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT`
Region	Soluble (position 25)
Mutation	E25 → G (glutamate to glycine)
Language model score	+0.251 (predicted beneficial)
Experimental support	Lysis = 1 (functional); Protein level = 0
Conservation status	0.273 (highly variable; most common AA at this position is K, not E)
Criteria met	3/3
Rationale	Position 25 is poorly conserved (0.273); across the 11-sequence alignment this site shows K, E, A, I, R, D, and others, indicating minimal functional constraint. The E→G substitution removes a bulky charged side chain and introduces maximum backbone flexibility. Experimentally confirmed functional.
Target goal	Higher expression. Glycine at this unconstrained position may improve co-translational folding efficiency and reduce dependence on chaperone-assisted folding.

Variant 3: L-K50P

Item	Value
Full mutant sequence	`METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSPFTNQLLLSLLEAVIRTVTTLQQLLT`
Region	Transmembrane (position 50)
Mutation	K50 → P (lysine to proline)
Language model score	+3.561 (highest LLR of all candidates)
Experimental support	No direct data for K50P. Caution: K50E, K50I, K50N, and K50Q all show lysis = 0, indicating K50 may be functionally essential.
Conservation status	0.300 (variable; most common AA at this position is D)
Criteria met	2/3 (positive LLR, non-conserved; no direct experimental data)
Rationale	ESM-2 assigns the highest LLR to this mutation because K50 is a charged residue in a hydrophobic TM context, and the model strongly prefers hydrophobic alternatives. However, this represents a known ESM-2 blind spot: K50 appears to be a functionally critical “snorkeling” lysine whose charge is required for membrane disruption. This variant is included as a hypothesis-testing candidate: if K50P retains lysis, it would demonstrate that the helix-breaking property of proline can substitute for the charge-based mechanism.
Target goal	Faster / more efficient lysis. If functional, the proline-induced helix kink could create a more aggressive membrane disruption geometry. This is the highest-risk, highest-reward variant.

Variant 4: L-K50L

Item	Value
Full mutant sequence	`METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTNQLLLSLLEAVIRTVTTLQQLLT`
Region	Transmembrane (position 50)
Mutation	K50 → L (lysine to leucine)
Language model score	+2.956 (second highest LLR)
Experimental support	No direct data for K50L. Same caution as Variant 3: four other K50 substitutions are non-functional.
Conservation status	0.300 (variable)
Criteria met	2/3 (positive LLR, non-conserved)
Rationale	Leucine is the most common residue in alpha-helical TM segments and represents the “default” hydrophobic substitution. Unlike the proline in Variant 3, leucine maintains helix geometry. This variant tests whether the loss of K50’s charge alone abolishes lysis or whether the specific chemistry of K50E/I/N/Q is what fails. Together, Variants 3 and 4 test two hypotheses: (3) can a structural perturbation compensate for charge loss; (4) is any uncharged residue tolerated?
Target goal	Faster / more efficient lysis. If the TM domain can function with a fully hydrophobic helix, this would indicate that membrane insertion efficiency can compensate for the loss of charge-mediated disruption.

Variant 5: L-E25V

Item	Value
Full mutant sequence	`METRFPQQSQQTPASTNRRRPFKHVDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT`
Region	Soluble (position 25)
Mutation	E25 → V (glutamate to valine)
Language model score	+0.152 (predicted mildly beneficial)
Experimental support	Lysis = 1 (functional); Protein level = 0
Conservation status	0.273 (highly variable)
Criteria met	3/3
Rationale	Same position as Variant 2 (E25G) but with a different substitution strategy. While E25G maximizes flexibility, E25V introduces a branched hydrophobic side chain. This provides a paired comparison at a known-tolerant position: flexibility (G) vs. hydrophobicity (V). Position 25 is adjacent to the conserved C29 (conservation = 0.818), so mutations here probe the boundary between the variable N-terminal region and the more constrained core. Both E25G and E25V are experimentally confirmed functional.
Target goal	DnaJ independence. Replacing the charged glutamate with hydrophobic valine at the soluble-domain surface creates a local hydrophobic patch that may reduce the protein’s requirement for DnaJ-mediated folding assistance.

Summary

Variant	Mutation	Region	LLR	Conservation	Exp. Lysis	Target Goal
1	K23E	Soluble	+0.289	0.545	Yes	DnaJ independence
2	E25G	Soluble	+0.251	0.273	Yes	Higher expression
3	K50P	TM	+3.561	0.300	No data*	Faster lysis
4	K50L	TM	+2.956	0.300	No data*	Faster lysis
5	E25V	Soluble	+0.152	0.273	Yes	DnaJ independence

* Other K50 substitutions (E, I, N, Q) experimentally show no lysis.

Caveats

K50 risk. Variants 3 and 4 target position K50, where 4/4 tested mutations are non-functional. These are hypothesis-testing variants, not safe bets. Lower-risk TM alternatives include L44P (lysis = 1, LLR = -1.84) or A45P (lysis = 1, LLR = -0.43), though these have negative ESM-2 scores.
Position redundancy. The design includes two mutations at position 25 and two at position 50. This enables paired comparisons (flexibility vs. hydrophobicity at pos 25; helix-breaking vs. helix-maintaining at pos 50) but reduces position diversity.
ESM-2 limitations for L-protein. As documented in the correlation analysis, ESM-2 LLR scores do not predict lysis outcomes for this protein (rₓ₋ = -0.041). The conservation analysis and experimental data were therefore weighted more heavily in the final selection.