Week 5 — Protein Design Part II · Nicolas Escobar Fierro

Part A — SOD1 Binder Design

Designing Peptides Against Mutant SOD1

Superoxide dismutase 1 (SOD1) is a cytosolic antioxidant enzyme that converts superoxide radicals into hydrogen peroxide and oxygen. The A4V mutation — Alanine → Valine at residue 4 — causes one of the most aggressive forms of familial ALS by destabilizing the N-terminus and promoting toxic protein aggregation. The challenge: design short peptides that bind the mutant SOD1 and could serve as leads toward therapy.

Connection to Füzi Poiesis

The computational pipeline used here — language model generation, structural validation with AlphaFold3, multi-property therapeutic screening — directly maps to the protein engineering logic needed for Aim 2 of Füzi Poiesis. Before engineering SQR and PhoA for halotolerance in Lake Budi's saline conditions, the same workflow (generate candidates → validate structure → screen therapeutic/functional properties) would be required to identify which mutations are tolerated and which disrupt function.

Part A · Step 1

Peptide Generation with PepMLM

The human SOD1 sequence (UniProt P00441) was retrieved and the A4V mutation introduced. Using the PepMLM-650M model conditioned on the mutant sequence, four 12-amino-acid binder peptides were generated. The known SOD1-binding peptide FLYRWLPSRRGG was added as a reference control. Pseudo-perplexity scores indicate the model's confidence — lower is better.

Peptide	Sequence	Pseudo-Perplexity	Origin
Best AI design	WRYGPVALALWK	12.71	PepMLM-650M
AI design 2	WHYYVVALALKK	18.49	PepMLM-650M
AI design 3	WHYYVVAAAHKE	19.47	PepMLM-650M
Reference control	FLYRWLPSRRGG	—	Known SOD1 binder

Key result

WRYGPVALALWK achieved the lowest pseudo-perplexity (12.71), indicating PepMLM's highest confidence in this sequence as a plausible binder. Lower perplexity reflects a more natural fit to the evolutionary distribution of SOD1-interacting peptides learned by the model.

Part A · Step 2

Structural Validation — AlphaFold3

Each peptide was submitted to the AlphaFold3 Server alongside the A4V mutant SOD1 sequence as separate chains, modeling the protein-peptide complex. The ipTM score measures inter-chain confidence — how certain the model is about the relative positioning of protein and peptide.

AlphaFold3 FLYRWLPSRRGG control ipTM 0.35

Control peptide FLYRWLPSRRGG — ipTM 0.35 · pTM 0.83 · surface-bound, no N-terminus localization

AI peptide WRYGPVALALWK — ipTM 0.24 · pTM 0.67 · surface-bound, no N-terminus localization

Structural interpretation

Neither peptide localizes near the A4V mutation site

The known control peptide (FLYRWLPSRRGG) achieved a higher ipTM of 0.35 versus 0.24 for the PepMLM-generated WRYGPVALALWK — indicating that the AI-generated design did not match or exceed the structural confidence of the reference binder. Both peptides interact surface-bound rather than being partially buried, and neither localizes near the N-terminus where the A4V mutation sits, nor do they engage the β-barrel region or approach the dimer interface.

The discrepancy between perplexity rank (WRYGPVALALWK was most confident) and ipTM rank (control outperformed it) highlights a key limitation of sequence-only models: PepMLM optimizes for sequence plausibility given the target, not for structural complementarity with a specific binding site.

Part A · Step 3

Therapeutic Property Screening — PeptiVerse

Structural confidence alone is insufficient for therapeutic development. PeptiVerse was used to evaluate the therapeutic property profile of each PepMLM-generated peptide — predicted binding affinity, solubility, hemolysis probability, net charge, and molecular weight.

PeptiVerse output for WRYGPVALALWK — solubility 1.000, hemolysis 0.050, binding affinity 6.238 pKd, MW 1459.7 Da

Multi-property analysis

WRYGPVALALWK — selected candidate

Comparing PeptiVerse predictions with AlphaFold3 structural models reveals a partial correlation between structural confidence and predicted affinity. WRYGPVALALWK (ipTM 0.24) has one of the highest affinities at 6.238 pKd, while WHYYVVALALKK slightly exceeds this with 6.275 pKd. However, no peptide was classified as hemolytic or poorly soluble — all show solubility probability of 1.000 and hemolysis probability below 0.07.

The peptide selected for advancement is WRYGPVALALWK: it combines the highest structural confidence score (best ipTM among AI-generated designs), strong predicted affinity (6.238 pKd), maximum solubility (1.000), and very low hemolysis risk (0.05). The balance between binding confidence and therapeutic safety profile makes it the most viable candidate for further optimization.

Part A · Step 4

Controlled Design with moPPIt

moPPIt uses Multi-Objective Guided Discrete Flow Matching (MOG-DFM) to steer peptide generation toward specific residues and simultaneously optimize binding and therapeutic properties. Unlike PepMLM, which samples plausible binders conditioned only on the target sequence, moPPIt allows explicit control over the binding site and multiple objectives at once.

moPPIt result

Targeting residues 1–10 near the A4V mutation, moPPIt generated the peptide KCIEGFVPKGTE with scores: affinity 6.264, solubility 0.957, hemolysis 0.646, and additional optimization metrics 0.654 and 0.654. The fundamental difference from PepMLM lies in the design methodology: PepMLM samples candidates from statistical probability without site-specific guidance, while moPPIt algorithmically optimizes toward a defined epitope and therapeutic property targets simultaneously.

Comparative evaluation

PepMLM vs moPPIt — sampling vs controlled design

Before advancing KCIEGFVPKGTE to any clinical stage, hierarchical validation would be required: AlphaFold3 structural confirmation to verify ipTM exceeds the 0.24 baseline, PeptiVerse safety confirmation that hemolysis remains below threshold, and in vitro metabolic stability tests to evaluate resistance to protease degradation in serum.

The moPPIt approach represents a qualitatively different design philosophy — from statistical sampling to multi-objective optimization — that maps directly to the kind of rational engineering needed for proteins operating in complex biological environments like Lake Budi's anoxic sediment.

Part C — L-Protein Mutants

ESM Language Model Analysis of MS2 Lysis Protein

The lysis protein (L-protein) of bacteriophage MS2 is essential for host cell lysis — a mechanism central to understanding how phages can solve antibiotic resistance. The objective: use ESM protein language model embeddings to predict which mutations at each position would improve stability and auto-folding, then cross-validate against experimental lysis data.

ESM2 — Predicted Effects of Mutations on L-Protein Sequence (LLR). Yellow = tolerated (positive LLR), blue/green = deleterious. Full sequence METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT.

Model validation

Experimental data correlates with ESM scores

A positive correlation between empirical data and ESM computational scores is confirmed. Box and scatter plots from the notebook show that experimentally successful mutations (Lysis = 1) consistently cluster in the higher LLR score ranges. Position 39 analysis is particularly clear: all variations at that position received positive LLR values, with substitutions to Alanine (0.007531) and Aspartic Acid (0.007593) being the most prominent — both mutations with documented lytic activity in the experimental CSV.

This validates that ESM's language embeddings adequately capture the structural and functional tolerance of the protein — the evolutionary information encoded in millions of natural sequences allows the model to predict which mutations are viable before any experiment is run.

Part C — Proposed Variants

Five L-Protein Variants — Design Rationale

Five strategic variants were designed by integrating experimental data (Lysis=1, Protein Levels=1) with ESM score analysis and structural region information. The design strategy prioritizes conservation of key residues identified by pBLAST while introducing changes at positions with documented positive mutational effects.

Variant 1

P13L / S15A

Soluble Region · N-terminal

Two adjacent residues that individually showed complete success (Lysis=1, Protein=1) in the experimental CSV. Removing the proline rigidity at position 13 and replacing serine with alanine at 15 aims to create a more stable helix in the soluble region.

Variant 2

R18G / R19S

Soluble Region · Charge Stability

Hotspot identified between positions 18 and 19 in the experimental data. Replacing bulky arginine residues with glycine and serine improves chain flexibility before the membrane region, facilitating proper anchoring during assembly.

Variant 3

L44P / A45P

Transmembrane Region · Hydrophobicity

Although proline is unusual in transmembrane helices, experimental data show L-protein is extremely tolerant at positions 44 and 45. Tests the flexibility limit within the membrane while maintaining lytic capacity observed in individual assays.

Variant 4

I46F

Transmembrane Region · Hydrophobic Core

Phenylalanine is a bulky, hydrophobic amino acid. Position 46 accepts this change with Lysis=1, Protein=1. Strengthens Van der Waals interactions within the lipid bilayer, stabilizing the transmembrane domain and potentially improving pore geometry.

Variant 5

P13L / I46F

Mixed · Soluble + Transmembrane

Combination of the best individual candidates from each region. Since both function independently (Lysis=1), their combination aims to optimize both external solubility and internal membrane anchoring simultaneously.

Part C — Pore Hypothesis

AlphaFold2-Multimer — Oligomeric Assembly

The working hypothesis is that L-protein does not operate in isolation but assembles oligomerically to create a perforation in the bacterial membrane. AlphaFold2-Multimer was used to test whether Variant 5 (P13L / I46F) can form a stable multimeric pore structure, using the homooligomer format (sequence separated by colons).

AlphaFold2-Multimer results for L-protein Variant 5

AF2-Multimer — Variant 5 pore prediction. PAE maps (red = high error), pLDDT < 50, ipTM ≈ 0.12. Low confidence across all 5 ranked models.

Result interpretation

In silico pore unstable — membrane context required

The multimer prediction generated an assembly, but with ipTM ≈ 0.12 and pLDDT < 50 throughout. The PAE maps (predominantly red) confirm that the model detects no strong inter-chain interactions. This result does not disprove the pore hypothesis — it highlights that the physical stabilization of the pore depends on the specific lipid environment of the bacterial membrane, which is absent in a vacuum-based computational model.

The discrepancy between experimental success (Lysis=1) and low in silico confidence is itself informative: it suggests that L-protein's lytic mechanism is environmentally dependent in a way that current structure prediction tools cannot fully capture without membrane context. This is a genuine limitation of the computational approach, honestly documented here.

Reflection

This week introduced a core tension in computational protein engineering: models trained on sequence statistics (ESM, PepMLM) and structure prediction tools (AlphaFold) each capture different layers of biological reality. Neither alone is sufficient — the most productive workflow integrates both with experimental validation, exactly the pipeline that Aim 2 of Füzi Poiesis will require before introducing any engineered strain into Lake Budi's environment.

Part A · Visual Evidence

AlphaFold3, PeptiVerse & ESM Analysis

AlphaFold3 structure of reference control peptide FLYRWLPSRRGG bound to A4V SOD1 — ipTM 0.35, pTM 0.83

AlphaFold3 · Control peptide FLYRWLPSRRGG · ipTM 0.35 · pTM 0.83

AlphaFold3 structure of AI-designed peptide WRYGPVALALWK bound to A4V SOD1 — ipTM 0.24, pTM 0.67

AlphaFold3 · AI design WRYGPVALALWK · ipTM 0.24 · pTM 0.67

PeptiVerse results for WRYGPVALALWK: Solubility 1.000, Hemolysis 0.050, Binding Affinity 6.238 pKd, MW 1459.7 Da, Net Charge 1.76, pI 9.99, GRAVY 0.16

PeptiVerse · WRYGPVALALWK · Solubility 1.000 · Hemolysis 0.050 · Binding affinity 6.238 pKd · MW 1459.7 Da

Part A · Synthesis

PepMLM vs moPPIt — Comparative Analysis

The two design strategies represent fundamentally different computational philosophies. PepMLM samples from the statistical distribution of SOD1-interacting peptides without site-specific guidance; moPPIt algorithmically steers generation toward defined residues while simultaneously optimizing multiple therapeutic objectives.

Dimension	PepMLM-650M	moPPIt (MOG-DFM)
Design strategy	Statistical sampling from sequence distribution	Multi-objective guided discrete flow matching
Site specificity	None — conditioned on full target sequence	Explicit — residues 1–10 targeted (A4V site)
Best candidate	WRYGPVALALWK (pseudo-perplexity 12.71)	KCIEGFVPKGTE
ipTM (AlphaFold3)	0.24	Not submitted (moPPIt output only)
Binding affinity (pKd)	6.238 (WRYGPVALALWK)	6.264
Solubility	1.000 (fully soluble)	0.957
Hemolysis risk	0.050 (non-hemolytic)	0.646 (elevated risk)
Key limitation	No binding site guidance — perplexity ≠ structural fit	High hemolysis probability limits therapeutic viability

Key insight

PepMLM produced the more therapeutically viable candidate (WRYGPVALALWK): lower hemolysis risk, full solubility, and acceptable structural confidence. moPPIt's controlled targeting of the A4V mutation site came at the cost of elevated hemolysis (0.646), which would preclude therapeutic use without further optimization. The discrepancy between sequence-level confidence (perplexity) and structural confidence (ipTM) — WRYGPVALALWK ranked best by perplexity but lower than the control by ipTM — reveals a fundamental limitation of language model-based design: statistical plausibility of the sequence does not guarantee structural complementarity to a defined binding site.

Part B · MS2 Bacteriophage

ESM Analysis — L-protein Mutants

Part B applies ESM-based sequence analysis to the L-protein of bacteriophage MS2 — a 75-amino-acid lysis protein that causes bacterial cell death by disrupting membrane integrity. The L-protein is one of the most minimal functional proteins known, making it an ideal system for studying sequence-function landscapes: almost every position has been experimentally characterized for its effect on lysis activity.

ESM LLR heatmap for MS2 L-protein — 75 positions, all 20 amino acid substitutions. Yellow = tolerated (positive LLR), blue/teal = neutral, dark = deleterious. Strong conservation in transmembrane region positions 35-60.

ESM2 · MS2 L-protein · Log-likelihood ratio heatmap · All single amino acid substitutions · 75 positions × 20 amino acids

Heatmap Interpretation

Sequence-function landscape of MS2 L-protein

The ESM2 log-likelihood ratio (LLR) heatmap reveals the mutational tolerance landscape of L-protein across all 75 positions. Yellow signals (positive LLR) indicate tolerated substitutions — positions where the evolutionary model predicts the mutation is consistent with functional protein sequences. Blue/teal signals (near-zero LLR) indicate neutral positions. Dark signals (strongly negative LLR) indicate positions under strong negative selection, where any substitution is expected to disrupt function.

The heatmap shows a clear structural signature: the central region (approximately positions 25–60), which corresponds to the hydrophobic transmembrane domain of L-protein responsible for membrane anchoring and pore formation, is highly conserved across most substitution types — consistent with the stringent hydrophobicity requirements of membrane-embedded segments. The N-terminal soluble region (positions 1–25) shows comparatively higher mutational tolerance, with several positions accepting hydrophobic-to-hydrophobic substitutions without predicted loss of function.

This analysis guided the selection of the five engineered variants submitted to AlphaFold3 and tested computationally for lysis activity: P13L (soluble region, high tolerance), G14S/A15G (flexible linker), L44P/A45P (transmembrane boundary), I46F (hydrophobic core reinforcement), and the combined P13L/I46F variant targeting both functional domains simultaneously.

Part C · Pore Hypothesis

AlphaFold2-Multimer — oligomeric assembly prediction

The working hypothesis is that L-protein does not operate in isolation but assembles oligomerically to create a perforation in the bacterial membrane. AlphaFold2-Multimer was used to test whether Variant 5 (P13L / I46F) can form a stable multimeric pore structure using the homooligomer format. The prediction generated an assembly with ipTM ≈ 0.12 and pLDDT < 50 throughout. The PAE maps (predominantly red across all 5 ranked models) confirm that the model detects no strong inter-chain interactions in a vacuum context.

This result does not disprove the pore hypothesis — it highlights that physical stabilization of the pore depends on the specific lipid environment of the bacterial membrane, which is absent in a vacuum-based computational model. The discrepancy between experimental success (Lysis=1 for all five variants in the experimental data) and low in silico confidence is itself informative: L-protein's lytic mechanism is environmentally dependent in a way that current structure prediction tools cannot fully capture without membrane context. This limitation is honestly documented here as a constraint of the computational approach.

Reflection — Connecting to Füzi Poiesis

This week introduced a core tension in computational protein engineering: models trained on sequence statistics (ESM, PepMLM) and structure prediction tools (AlphaFold) each capture different layers of biological reality. Neither alone is sufficient — the most productive workflow integrates both with experimental validation, exactly the pipeline that Aim 2 of Füzi Poiesis will require before introducing any engineered protein into Lake Budi's environment. SQR and PhoA engineering for halotolerance will follow the same sequence: ESM LLR screening to identify tolerated positions → AlphaFold3 structural validation → PeptiVerse or equivalent functional screening → wet-lab validation in Halomonas elongata under Budi salinity conditions (5–15 g/L NaCl).