Designing Peptides Against Mutant SOD1
Superoxide dismutase 1 (SOD1) is a cytosolic antioxidant enzyme that converts superoxide radicals into hydrogen peroxide and oxygen. The A4V mutation — Alanine → Valine at residue 4 — causes one of the most aggressive forms of familial ALS by destabilizing the N-terminus and promoting toxic protein aggregation. The challenge: design short peptides that bind the mutant SOD1 and could serve as leads toward therapy.
The computational pipeline used here — language model generation, structural validation with AlphaFold3, multi-property therapeutic screening — directly maps to the protein engineering logic needed for Aim 2 of Füzi Poiesis. Before engineering SQR and PhoA for halotolerance in Lake Budi's saline conditions, the same workflow (generate candidates → validate structure → screen therapeutic/functional properties) would be required to identify which mutations are tolerated and which disrupt function.
Peptide Generation with PepMLM
The human SOD1 sequence (UniProt P00441) was retrieved and the A4V mutation introduced. Using the PepMLM-650M model conditioned on the mutant sequence, four 12-amino-acid binder peptides were generated. The known SOD1-binding peptide FLYRWLPSRRGG was added as a reference control. Pseudo-perplexity scores indicate the model's confidence — lower is better.
| Peptide | Sequence | Pseudo-Perplexity | Origin |
|---|---|---|---|
| Best AI design | WRYGPVALALWK | 12.71 | PepMLM-650M |
| AI design 2 | WHYYVVALALKK | 18.49 | PepMLM-650M |
| AI design 3 | WHYYVVAAAHKE | 19.47 | PepMLM-650M |
| Reference control | FLYRWLPSRRGG | — | Known SOD1 binder |
WRYGPVALALWK achieved the lowest pseudo-perplexity (12.71), indicating PepMLM's highest confidence in this sequence as a plausible binder. Lower perplexity reflects a more natural fit to the evolutionary distribution of SOD1-interacting peptides learned by the model.
Structural Validation — AlphaFold3
Each peptide was submitted to the AlphaFold3 Server alongside the A4V mutant SOD1 sequence as separate chains, modeling the protein-peptide complex. The ipTM score measures inter-chain confidence — how certain the model is about the relative positioning of protein and peptide.
The known control peptide (FLYRWLPSRRGG) achieved a higher ipTM of 0.35 versus 0.24 for the PepMLM-generated WRYGPVALALWK — indicating that the AI-generated design did not match or exceed the structural confidence of the reference binder. Both peptides interact surface-bound rather than being partially buried, and neither localizes near the N-terminus where the A4V mutation sits, nor do they engage the β-barrel region or approach the dimer interface.
The discrepancy between perplexity rank (WRYGPVALALWK was most confident) and ipTM rank (control outperformed it) highlights a key limitation of sequence-only models: PepMLM optimizes for sequence plausibility given the target, not for structural complementarity with a specific binding site.
Therapeutic Property Screening — PeptiVerse
Structural confidence alone is insufficient for therapeutic development. PeptiVerse was used to evaluate the therapeutic property profile of each PepMLM-generated peptide — predicted binding affinity, solubility, hemolysis probability, net charge, and molecular weight.
Comparing PeptiVerse predictions with AlphaFold3 structural models reveals a partial correlation between structural confidence and predicted affinity. WRYGPVALALWK (ipTM 0.24) has one of the highest affinities at 6.238 pKd, while WHYYVVALALKK slightly exceeds this with 6.275 pKd. However, no peptide was classified as hemolytic or poorly soluble — all show solubility probability of 1.000 and hemolysis probability below 0.07.
The peptide selected for advancement is WRYGPVALALWK: it combines the highest structural confidence score (best ipTM among AI-generated designs), strong predicted affinity (6.238 pKd), maximum solubility (1.000), and very low hemolysis risk (0.05). The balance between binding confidence and therapeutic safety profile makes it the most viable candidate for further optimization.
Controlled Design with moPPIt
moPPIt uses Multi-Objective Guided Discrete Flow Matching (MOG-DFM) to steer peptide generation toward specific residues and simultaneously optimize binding and therapeutic properties. Unlike PepMLM, which samples plausible binders conditioned only on the target sequence, moPPIt allows explicit control over the binding site and multiple objectives at once.
Targeting residues 1–10 near the A4V mutation, moPPIt generated the peptide KCIEGFVPKGTE with scores: affinity 6.264, solubility 0.957, hemolysis 0.646, and additional optimization metrics 0.654 and 0.654. The fundamental difference from PepMLM lies in the design methodology: PepMLM samples candidates from statistical probability without site-specific guidance, while moPPIt algorithmically optimizes toward a defined epitope and therapeutic property targets simultaneously.
Before advancing KCIEGFVPKGTE to any clinical stage, hierarchical validation would be required: AlphaFold3 structural confirmation to verify ipTM exceeds the 0.24 baseline, PeptiVerse safety confirmation that hemolysis remains below threshold, and in vitro metabolic stability tests to evaluate resistance to protease degradation in serum.
The moPPIt approach represents a qualitatively different design philosophy — from statistical sampling to multi-objective optimization — that maps directly to the kind of rational engineering needed for proteins operating in complex biological environments like Lake Budi's anoxic sediment.
ESM Language Model Analysis of MS2 Lysis Protein
The lysis protein (L-protein) of bacteriophage MS2 is essential for host cell lysis — a mechanism central to understanding how phages can solve antibiotic resistance. The objective: use ESM protein language model embeddings to predict which mutations at each position would improve stability and auto-folding, then cross-validate against experimental lysis data.
A positive correlation between empirical data and ESM computational scores is confirmed. Box and scatter plots from the notebook show that experimentally successful mutations (Lysis = 1) consistently cluster in the higher LLR score ranges. Position 39 analysis is particularly clear: all variations at that position received positive LLR values, with substitutions to Alanine (0.007531) and Aspartic Acid (0.007593) being the most prominent — both mutations with documented lytic activity in the experimental CSV.
This validates that ESM's language embeddings adequately capture the structural and functional tolerance of the protein — the evolutionary information encoded in millions of natural sequences allows the model to predict which mutations are viable before any experiment is run.
Five L-Protein Variants — Design Rationale
Five strategic variants were designed by integrating experimental data (Lysis=1, Protein Levels=1) with ESM score analysis and structural region information. The design strategy prioritizes conservation of key residues identified by pBLAST while introducing changes at positions with documented positive mutational effects.
AlphaFold2-Multimer — Oligomeric Assembly
The working hypothesis is that L-protein does not operate in isolation but assembles oligomerically to create a perforation in the bacterial membrane. AlphaFold2-Multimer was used to test whether Variant 5 (P13L / I46F) can form a stable multimeric pore structure, using the homooligomer format (sequence separated by colons).
The multimer prediction generated an assembly, but with ipTM ≈ 0.12 and pLDDT < 50 throughout. The PAE maps (predominantly red) confirm that the model detects no strong inter-chain interactions. This result does not disprove the pore hypothesis — it highlights that the physical stabilization of the pore depends on the specific lipid environment of the bacterial membrane, which is absent in a vacuum-based computational model.
The discrepancy between experimental success (Lysis=1) and low in silico confidence is itself informative: it suggests that L-protein's lytic mechanism is environmentally dependent in a way that current structure prediction tools cannot fully capture without membrane context. This is a genuine limitation of the computational approach, honestly documented here.
This week introduced a core tension in computational protein engineering: models trained on sequence statistics (ESM, PepMLM) and structure prediction tools (AlphaFold) each capture different layers of biological reality. Neither alone is sufficient — the most productive workflow integrates both with experimental validation, exactly the pipeline that Aim 2 of Füzi Poiesis will require before introducing any engineered strain into Lake Budi's environment.
AlphaFold3, PeptiVerse & ESM Analysis
PepMLM vs moPPIt — Comparative Analysis
The two design strategies represent fundamentally different computational philosophies. PepMLM samples from the statistical distribution of SOD1-interacting peptides without site-specific guidance; moPPIt algorithmically steers generation toward defined residues while simultaneously optimizing multiple therapeutic objectives.
| Dimension | PepMLM-650M | moPPIt (MOG-DFM) |
|---|---|---|
| Design strategy | Statistical sampling from sequence distribution | Multi-objective guided discrete flow matching |
| Site specificity | None — conditioned on full target sequence | Explicit — residues 1–10 targeted (A4V site) |
| Best candidate | WRYGPVALALWK (pseudo-perplexity 12.71) | KCIEGFVPKGTE |
| ipTM (AlphaFold3) | 0.24 | Not submitted (moPPIt output only) |
| Binding affinity (pKd) | 6.238 (WRYGPVALALWK) | 6.264 |
| Solubility | 1.000 (fully soluble) | 0.957 |
| Hemolysis risk | 0.050 (non-hemolytic) | 0.646 (elevated risk) |
| Key limitation | No binding site guidance — perplexity ≠ structural fit | High hemolysis probability limits therapeutic viability |
PepMLM produced the more therapeutically viable candidate (WRYGPVALALWK): lower hemolysis risk, full solubility, and acceptable structural confidence. moPPIt's controlled targeting of the A4V mutation site came at the cost of elevated hemolysis (0.646), which would preclude therapeutic use without further optimization. The discrepancy between sequence-level confidence (perplexity) and structural confidence (ipTM) — WRYGPVALALWK ranked best by perplexity but lower than the control by ipTM — reveals a fundamental limitation of language model-based design: statistical plausibility of the sequence does not guarantee structural complementarity to a defined binding site.
ESM Analysis — L-protein Mutants
Part B applies ESM-based sequence analysis to the L-protein of bacteriophage MS2 — a 75-amino-acid lysis protein that causes bacterial cell death by disrupting membrane integrity. The L-protein is one of the most minimal functional proteins known, making it an ideal system for studying sequence-function landscapes: almost every position has been experimentally characterized for its effect on lysis activity.
The ESM2 log-likelihood ratio (LLR) heatmap reveals the mutational tolerance landscape of L-protein across all 75 positions. Yellow signals (positive LLR) indicate tolerated substitutions — positions where the evolutionary model predicts the mutation is consistent with functional protein sequences. Blue/teal signals (near-zero LLR) indicate neutral positions. Dark signals (strongly negative LLR) indicate positions under strong negative selection, where any substitution is expected to disrupt function.
The heatmap shows a clear structural signature: the central region (approximately positions 25–60), which corresponds to the hydrophobic transmembrane domain of L-protein responsible for membrane anchoring and pore formation, is highly conserved across most substitution types — consistent with the stringent hydrophobicity requirements of membrane-embedded segments. The N-terminal soluble region (positions 1–25) shows comparatively higher mutational tolerance, with several positions accepting hydrophobic-to-hydrophobic substitutions without predicted loss of function.
This analysis guided the selection of the five engineered variants submitted to AlphaFold3 and tested computationally for lysis activity: P13L (soluble region, high tolerance), G14S/A15G (flexible linker), L44P/A45P (transmembrane boundary), I46F (hydrophobic core reinforcement), and the combined P13L/I46F variant targeting both functional domains simultaneously.
The working hypothesis is that L-protein does not operate in isolation but assembles oligomerically to create a perforation in the bacterial membrane. AlphaFold2-Multimer was used to test whether Variant 5 (P13L / I46F) can form a stable multimeric pore structure using the homooligomer format. The prediction generated an assembly with ipTM ≈ 0.12 and pLDDT < 50 throughout. The PAE maps (predominantly red across all 5 ranked models) confirm that the model detects no strong inter-chain interactions in a vacuum context.
This result does not disprove the pore hypothesis — it highlights that physical stabilization of the pore depends on the specific lipid environment of the bacterial membrane, which is absent in a vacuum-based computational model. The discrepancy between experimental success (Lysis=1 for all five variants in the experimental data) and low in silico confidence is itself informative: L-protein's lytic mechanism is environmentally dependent in a way that current structure prediction tools cannot fully capture without membrane context. This limitation is honestly documented here as a constraint of the computational approach.
This week introduced a core tension in computational protein engineering: models trained on sequence statistics (ESM, PepMLM) and structure prediction tools (AlphaFold) each capture different layers of biological reality. Neither alone is sufficient — the most productive workflow integrates both with experimental validation, exactly the pipeline that Aim 2 of Füzi Poiesis will require before introducing any engineered protein into Lake Budi's environment. SQR and PhoA engineering for halotolerance will follow the same sequence: ESM LLR screening to identify tolerated positions → AlphaFold3 structural validation → PeptiVerse or equivalent functional screening → wet-lab validation in Halomonas elongata under Budi salinity conditions (5–15 g/L NaCl).