Week 5 HW: Protein Design Part II

This week we learned how cutting-edge AI and protein language models are used to design functional proteins and peptides “in silico”.


Part A: SOD1 Binder Peptide Design


Part 1: Generate Binders with PepMLM

Begin by retrieving the human SOD1 sequence from UniProt (P00441) and introducing the A4V mutation.

SOD1 with A4V

MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTS AGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVV HEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ


Using the PepMLM Colab linked from the HuggingFace PepMLM-650M model card, Generate four peptides of length 12 amino acids conditioned on the mutant SOD1 sequence. To your generated list, add the known SOD1-binding peptide FLYRWLPSRRGG for comparison. Record the perplexity scores that indicate PepMLM’s confidence in the binders.


Part 2: Generate Binders with PepMLM

Navigate to the AlphaFold Server: alphafoldserver.com For each peptide, submit the mutant SOD1 sequence followed by the peptide sequence as separate chains to model the protein-peptide complex. Record the ipTM score and briefly describe where the peptide appears to bind. Does it localize near the N-terminus where A4V sits? Does it engage the β-barrel region or approach the dimer interface? Does it appear surface-bound or partially buried? In a short paragraph, describe the ipTM values you observe and whether any PepMLM-generated peptide matches or exceeds the known binder.

Perplexity

SequencePerplexity
FLYRWLPSRRGG21.42
WRYVAAAIARKK14.24
WRYVAYALRWGE26.03
KRYYWVAVARAA12.95
HRYVAAAVKWKK16.60

Peptide Observations

PepMLM PepMLM

FLYRWLPSRRGG ⭐ Known Binder — Control

  • ipTM: 0.89 | pTM: 0.92
  • Distance to A4V: 22.285 Å
  • This is the known SOD1-binding peptide and serves as the baseline for all comparisons. The peptide is in the general vicinity of A4V. All PepMLM-generated peptides are evaluated against its ipTM of 0.89, pTM of 0.92, and distance of 22.285 Å.

WRYVAAAIARKK

  • ipTM: 0.85 | pTM: 0.89
  • Distance to A4V: 18.541 Å — 3.744 Å closer than the known binder
  • It is in proximity to the dimer region and engaging the B-Barrel.
  • This peptide is closer to the dimer region and approaches but does not exceed the ipTM value of the known binder. The peptide appears as a highly probably well formed b-barrel binder and intersects the surface, partially buried.
WRYVAAAIARKK_distance.png.png WRYVAAAIARKK_distance.png.png

WRYVAAAIARKK Distance

WRYVAAAIARKK_overlapping_a4v_near.png WRYVAAAIARKK_overlapping_a4v_near.png

WRYVAAAIARKK Overlapping Surface


HRYVAAAVKWKK

  • ipTM: 0.88 | pTM: 0.91
  • Distance to A4V: 27.536 Å — 5.251 Å farther than the known binder
  • Engages b-barrel region but does not localize near the N-terminus.
  • Appears partially buried. ipTM is just below the known binder at 0.88.
Control Control

HRYVAAAVKWKK

Surface Clip Surface Clip

HRYVAAAVKWKK showing surface incursion


WRYVAYALRWGE

  • ipTM: 0.68 | pTM: 0.79
  • Distance to A4V: 12.875 Å — 9.410 Å closer than the known binder
  • Approaching the A4V location 27.152 Å , not dimer interface. Surface bound. Considered near the A4V location. Lower confidence than the known binder, and demonstrates a partially folded structure.
    • The peptide is folding into a secondary structure upon binding rather than remaining as a random flexible chain.
    • This is called induced folding or folding upon binding — a hallmark of meaningful peptide-protein interactions.
    • The helix formation suggests the peptide is responding to the local environment of the SOD1 surface.
Cartoon Clip Cartoon Clip

WRYVAYALRWGE Cartoon

Surface Clip Surface Clip

WRYVAYALRWGE Surface


KRYYWVAVARAA

  • ipTM: 0.89 | pTM: 0.92
  • Distance to A4V: 17.228 Å — 5.057 Å closer than the known binder
  • No — localizes near the middle of Chain 1, not the N-terminus.
  • Engages surface in middle of region, not approaching dimer interface.
  • Surface bound — clipping view shows no intrusions.
  • ipTM matches the known binder exactly at 0.89, with a distance of 17.228 Å placing it closer to the target vicinity of A4V than the control.
kryywvaaraa_surfaceclip kryywvaaraa_surfaceclip

KRYYWVAVARAA surface

Control Distance Control Distance

KRYYWVAVARAA peptide with distance to A4V

KRYYWVAVARAA_distance.png KRYYWVAVARAA_distance.png

KRYYWVAVARAA distance (Closeup)


ipTM Summary and Comparison to Known Binder

PeptideRoleipTMpTMDistance to A4V (Å)Near A4V?
FLYRWLPSRRGG⭐ Known binder (control)0.890.9222.285Vicinity
WRYVAAAIARKKPepMLM generated0.850.8918.541Vicinity
HRYVAAAVKWKKPepMLM generated0.880.9127.536Far
WRYVAYALRWGEPepMLM generated0.680.7912.875Near
KRYYWVAVARAAPepMLM generated0.890.9217.228Vicinity

The ipTM values across the five PepMLM-generated peptides range from 0.68 to 0.89, indicating generally high predicted confidence in binding interactions. Using FLYRWLPSRRGG (ipTM 0.89, distance 22.285 Å) as the known binder control, two peptides — FLYRWLPSRRGG and KRYYWVAVARAA — match the known binder ipTM exactly at 0.89, while HRYVAAAVKWKK comes close at 0.88. However, high ipTM alone does not confirm therapeutic relevance — proximity to the A4V site matters equally. WRYVAYALRWGE carries the lowest ipTM at 0.68 yet achieves the closest proximity to the A4V mutation site at 12.875 Å — 9.410 Å closer than the known binder — and uniquely demonstrates induced folding behavior near the target. This combination of near-vicinity binding and structural reorganization makes it the most therapeutically interesting candidate despite its lower confidence score, and suggests it warrants further optimization to strengthen the binding pose while maintaining its proximity to the A4V site.

First Pass Analysis and Candidate Selection

What I found in AlphaFold 3 was that my initial peptides were primarily surface binding with varying levels of proximity to the A4V sequence location near the homodimer. WRYVAYALRWGE was not the highest scoring, but was closest to the target and demonstrated induced folding — organizing into a helical secondary structure upon binding rather than remaining flexible, which is a hallmark of meaningful peptide-protein interaction.

Higher ipTM scores did not consistently predict stronger binding affinity or closer proximity to the A4V site. FLYRWLPSRRGG and KRYYWVAVARAA matched the highest ipTM at 0.89 but were farther from the mutation site, while WRYVAYALRWGE at 0.68 was structurally the most relevant.

Selected Candidate

The peptide chosen to advance from this first pass was WRYVAYALRWGE. Despite a hemolysis probability of 0.104 — approximately 2x the known binder control — its induced folding behavior near the A4V site was the deciding factor. The structural response to the local SOD1 environment, combined with its closest proximity to the mutation site at 12.875 Å, outweighed the moderate hemolysis risk at this stage of evaluation. Further analysis via MoPPIT would follow to explore whether higher affinity candidates could be generated with a safer therapeutic profile.

Properties Properties

Part 4: Generate Optimized Peptides with moPPIt

Now, move from sampling to controlled design. moPPIt uses Multi-Objective Guided Discrete Flow Matching (MOG-DFM) to steer peptide generation toward specific residues and optimize binding and therapeutic properties simultaneously. Unlike PepMLM, which samples plausible binders conditioned on just the target sequence, moPPIt lets you choose where you want to bind and optimize multiple objectives at once.

Open the moPPit Colab linked from the HuggingFace moPPIt model card Make a copy and switch to a GPU runtime. In the notebook: Paste your A4V mutant SOD1 sequence. Choose specific residue indices on SOD1 that you want your peptide to bind (for example, residues near position 4, the dimer interface, or another surface patch). Set peptide length to 12 amino acids. Enable motif and affinity guidance (and solubility/hemolysis guidance if available). Generate peptides. After generation, briefly describe how these moPPit peptides differ from your PepMLM peptides. How would you evaluate these peptides before advancing them to clinical studies?

Control / Known Binder Reference: FLYRWLPSRRGG (PepMLM control | ipTM 0.89 | Binding pKd 5.938 | Hemolysis 0.047) All MoPPIT-generated peptides are evaluated relative to this baseline.


Numeric Summary

PeptideHemolysisSolubilityBinding (pKd)Motif Score
FLYRWLPSRRGG (control)0.0471.0005.938
RTCGLIETKKQT0.9820.8336.2980.693
KKTKTGKFCKQN0.9770.9175.7150.755
IKCGNKFKKKYH0.9570.8337.7130.632

Property-by-Property Analysis

Binding Affinity (pKd/pKi) — Strongest Property

All three MoPPIT peptides are classified as weak binders, but two exceed the control baseline significantly.

PeptidepKdvs Control
FLYRWLPSRRGG (control)5.938baseline
RTCGLIETKKQT6.298+0.360 above control
KKTKTGKFCKQN5.715−0.223 below control
IKCGNKFKKKYH7.713+1.775 above control

IKCGNKFKKKYH shows the highest binding affinity of any peptide evaluated in this entire session — exceeding the control by +1.775 pKd units and exceeding the best PepMLM candidate (WRYVAYALRWGE, 6.980) by +0.733. This is a notable result. KKTKTGKFCKQN is the only MoPPIT peptide that falls below the control baseline.


Hemolysis Probability — Critical Liability

This is the most significant finding in the MoPPIT dataset and represents a serious concern for all three peptides.

PeptideHemolysisvs ControlFlag
FLYRWLPSRRGG (control)0.047baselineSafe
RTCGLIETKKQT0.982~21x control⚠️ Critical
KKTKTGKFCKQN0.977~21x control⚠️ Critical
IKCGNKFKKKYH0.957~20x control⚠️ Critical

All three MoPPIT peptides show hemolysis probabilities approaching 1.0 — dramatically higher than the PepMLM control and well above any therapeutically acceptable threshold. This is likely driven by their highly cationic, lysine-rich sequences (KK and KKK motifs) which are known to disrupt negatively charged cell membranes through electrostatic attraction. This is a critical liability that would need to be resolved before any of these peptides could be considered viable candidates.


Solubility

PeptideSolubilityvs Control
FLYRWLPSRRGG (control)1.000baseline
RTCGLIETKKQT0.833below control
KKTKTGKFCKQN0.917below control
IKCGNKFKKKYH0.833below control

All MoPPIT peptides fall below the control solubility of 1.000. While none are insoluble, the reduction in solubility relative to the PepMLM candidates is worth noting — particularly for RTCGLIETKKQT and IKCGNKFKKKYH at 0.833.


Motif Position Score

PeptideMotif ScoreInterpretation
RTCGLIETKKQT0.693Moderate motif complementarity
KKTKTGKFCKQN0.755Highest motif complementarity
IKCGNKFKKKYH0.632Lowest motif complementarity

KKTKTGKFCKQN shows the strongest motif complementarity to the SOD1 target despite having a below-control binding affinity. This suggests the peptide is well-positioned relative to the SOD1 binding motif but may lack the side chain contacts needed to translate motif recognition into strong affinity. IKCGNKFKKKYH presents an interesting inversion — lowest motif score but highest affinity — suggesting its binding may be driven by non-specific electrostatic contacts rather than precise motif engagement.


Comparative Assessment — MoPPIT vs PepMLM

PropertyBest PepMLM (WRYVAYALRWGE)Best MoPPIT (IKCGNKFKKKYH)
Binding pKd6.9807.713
Hemolysis0.1040.957
Solubility0.9990.833
Distance to A4V12.875 Ånot yet evaluated
Motif Scorenot available0.632
Induced foldingyesnot yet evaluated

MoPPIT generates peptides with superior raw binding affinity but at the cost of dramatically elevated hemolysis risk. PepMLM candidates show more balanced profiles with safer hemolysis values and demonstrated structural proximity to the A4V site.


Overall Candidate Assessment

PeptideAffinity > Control?Hemolysis Safe?SolubilityMotifVerdict
FLYRWLPSRRGGbaselineyes1.000Control
RTCGLIETKKQTyes (+0.360)⚠️ critical0.8330.693Needs redesign
KKTKTGKFCKQNno (−0.223)⚠️ critical0.9170.755Needs redesign
IKCGNKFKKKYHyes (+1.775)⚠️ critical0.8330.632High potential, high risk

Key Takeaway

IKCGNKFKKKYH has the highest predicted binding affinity of any peptide evaluated in this session (pKd 7.713), making it a structurally interesting lead. However, its hemolysis probability of 0.957 makes it unsuitable in its current form. The immediate optimization priority for all three MoPPIT peptides is reducing cationic character — specifically reducing lysine density — to bring hemolysis probability into a safe range while preserving the affinity advantage. AlphaFold structural evaluation of these peptides against the A4V SOD1 dimer would be the recommended next step to assess whether the affinity advantage translates to meaningful proximity to the mutation site.

Additional Investigation

Objective: Identify and resolve hemolysis liability in the highest-affinity MoPPIT peptide while preserving binding affinity to SOD1 A4V.


Stage 1 — Problem Identified

The three MoPPIT-generated peptides showed critically elevated hemolysis probabilities of 0.957–0.982 — approximately 20x the known binder control (FLYRWLPSRRGG, 0.047). The cause was identified as lysine-rich sequences — high cationic density causing electrostatic attraction to and disruption of negatively charged cell membranes.

PeptideHemolysisStatus
FLYRWLPSRRGG (control)0.047Safe
RTCGLIETKKQT0.982⚠️ Critical
KKTKTGKFCKQN0.977⚠️ Critical
IKCGNKFKKKYH0.957⚠️ Critical

Despite the hemolysis liability, IKCGNKFKKKYH was selected for optimization because it showed the highest binding affinity of any peptide in the entire session at pKd 7.713 — exceeding the known binder control by +1.775 units.


Stage 2 — Substitution Strategy Designed

Three variants were designed by targeting the five lysines at positions 2, 6, 8, 9, 10:

I  K  C  G  N  K  F  K  K  K  Y  H
1  2  3  4  5  6  7  8  9  10 11 12
VariantSequenceSubstitutionsStrategy
OriginalIKCGNKFKKKYHControl baseline
Variant 1IQCGNKFKQQYHK2→Q, K9→Q, K10→QModerate K→Q reduction
Variant 2IQCGNQFQKNYHK2→Q, K6→Q, K8→Q, K9→NAggressive K→Q reduction
Variant 3IKCGNEFKKEYHK6→E, K9→ECharge balancing with glutamate

Stage 3 — Results

All three variants achieved hemolysis safety (0.035–0.037) — matching the known binder control. However binding affinity diverged significantly by strategy.

PeptideHemolysispKdNet ChargepIClassification
IKCGNKFKKKYH0.0357.7134.8310.03Medium binding
IQCGNKFKQQYH0.0376.2551.839.20Weak binding
IQCGNQFQKNYH0.0376.1650.848.21Weak binding
IKCGNEFKKEYH0.0357.2270.848.16Medium binding

Stage 4 — Key Finding

K→E substitution (glutamate) outperformed K→Q substitution (glutamine) for preserving binding affinity. Variant 3 lost only 0.486 pKd units versus ~1.5 units lost by the Q-substitution variants — because glutamate can form new complementary contacts with the SOD1 surface rather than simply removing charge.

Variant 3 also achieved a net charge of 0.84 and pI of 8.16 — the most physiologically favorable profile of all variants and comparable to the best PepMLM candidate WRYVAYALRWGE (charge 0.77).

Substitution Strategy Comparison

StrategyHemolysis Resolved?Affinity Retained?Charge Reduced?Verdict
K→Q moderate (Variant 1)yespartial (−1.458)yesWeak
K→Q aggressive (Variant 2)yespartial (−1.548)bestWeak
K→E charge balance (Variant 3)yesbest (−0.486)bestLead

Outcome

IKCGNEFKKEYH emerged as the optimized lead — retaining medium binding classification (pKd 7.227), achieving full hemolysis safety (0.035), and carrying a charge profile (0.84) and pI (8.16) that favor target selectivity over non-specific membrane disruption.

The K→E glutamate substitution strategy is the demonstrated approach for resolving cationic hemolysis liability without sacrificing binding affinity in this peptide series.


Visualization:

  1. Submit IKCGNEFKKEYH to AlphaFold Server against the A4V SOD1 homodimer to evaluate structural proximity to the A4V mutation site.
  2. Compare ipTM and distance to A4V against the best PepMLM candidate WRYVAYALRWGE (12.875 Å) to determine which pipeline produces the stronger structural result.
  3. If structural proximity is confirmed, consider a fourth generation of optimization targeting further charge refinement while monitoring affinity retention.
nonhemo3.png nonhemo3.png

IKCGNEFKKEYH Non-Hemolytic - AlphaFold

nonhemo1.png nonhemo1.png

IKCGNEFKKEYH Non-Hemolytic Surface

nonhemo2.png nonhemo2.png

IKCGNEFKKEYH Non-Hemolytic Illustration


Part C: Final Project: L-Protein Mutants

High level summary: The objective of this assignment is to improve the stability and auto-folding of the lysis protein of a MS2-phage. This mechanism is key to the understanding of how phages can potentially solve antibiotic-resistance.

Context & Motivation

The L protein of bacteriophage MS2 is a 74–75 amino acid lysis protein whose stability and auto-folding are critical to understanding how phages can solve antibiotic resistance. The CNN phage therapy case (Strathdee/Patterson) provided real-world context — phage therapy saved a life against Acinetobacter baumannii when all antibiotics failed, underlining why understanding phage lysis mechanisms matters.

“It’s estimated that by 2050, 10 million people per year — that’s one person every three seconds — is going to be dying from a superbug infection.” — Steffanie Strathdee, UC San Diego


Step 1 — Understanding the Problem

  • Established that MS2 encodes 4 proteins: Maturation (A), Coat (CP), Lysis (L), Replicase (Rep)
  • Located L protein on genome: NC_001417 nt 1678–1902
  • Identified the core challenge: L gene overlaps CP and Rep simultaneously
  • Any nucleotide mutation in L is also a mutation in a neighboring reading frame
overlappingframes.png overlappingframes.png

Overlapping Frames


Step 2 — Sequence Acquisition

  • Retrieved wildtype L protein sequence (74 aa)
  • Dataset-validated all 32 experimentally constrained positions against the wildtype
  • Attempted live fetch from UniProt (P03609) — network restricted
  • Reconstructed sequence from published Fiers 1976 data + dataset ground truth
  • Downloaded all 4 MS2 protein sequences as FASTA file

Wildtype L Protein (74 aa)

>P03609|Lysis_protein_L|NC_001417:1678-1902
METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSK
FTNQLLLSSLCLVFVTSATKQQLS

All 4 MS2 Proteins

AccessionProteinLengthGene
P03589Maturation protein A393 aamat
P69141Coat protein CP129 aacp
P03609Lysis protein L75 aalys
P00585Replicase beta subunit544 aarep

Step 3 — ORF Analysis

Mapped all three reading frames across the L gene window (nt 1620–1950):

GENE MAP (NC_001417):

  ════════════════════════╗
  Coat Protein (CP)       ║ STOP
  (nt 1335 → 1724)        ╚══════════════════════════════════════
                                    ╔══════════════════════════════
                                    ║ Replicase (Rep)
                    ╔═══════════════╩══════════════════════════╗
                    ║          L PROTEIN (LYSIS)               ║ STOP
                    ╚══════════════════════════════════════════╝

Three Overlap Zones

ZoneNucleotidesL aaRiskNotes
Zone 1 — CP ∩ Lnt 1678–1724aa 1–15HIGH ⚠Affects coat protein
Free Zone — L onlynt 1725–1760aa 16–28LOW ✓Safest mutation window
Zone 2 — L ∩ Repnt 1761–1902aa 29–75HIGH ⚠Affects replicase

Step 4 — Domain Mapping

aa:  1                    ~40  ~41                      74
     |─────────────────────|────|─────────────────────────|
     [════ N-terminal soluble ════][═══ Transmembrane domain ═══]
           (aa 1–40)                      (aa 41–74)
           DISPENSABLE for lysis          ESSENTIAL for lysis
           DnaJ chaperone binding         Membrane insertion + pore formation

Key Domain Properties

Soluble Domain (aa 1–40)

  • Dispensable for lysis function
  • Primary DnaJ chaperone interaction site
  • Spans Zone 1 and Free Zone
  • Best target for stability mutations

Transmembrane Domain (aa 41–74)

  • Essential for lysis
  • Drives membrane insertion
  • Forms oligomeric pores in cell envelope
  • Entirely within Zone 2 (Rep overlap)
  • Hydrophobic core must be preserved

TM Boundary Zone (aa 38–46) — Key Target

The soluble→TM junction is the most tractable region for stability improvement:

aa:  38   39   40   41   42   43   44   45   46   47
      L    Y    V    L    I    F    L    A    I    F
                     |────────────────────────────────
                     TM domain starts
                          ↑    ↑    ↑
                         L44  A45  I46  ← lysis=1 mutations here

Step 5 — Mutation Dataset Analysis

Parsed 57 unique non-stop mutations from experimental dataset.

Dataset Summary

Each mutation annotated with:

  • Nucleotide position and base pair change
  • Amino acid position and substitution
  • Lysis activity (0/1)
  • Protein expression level (0/1/ND)

TM Region Mutations

Mutationaa posLysisProteinNotes
L44P4411Proline kink — TM entry
A45P4511Proline kink — TM entry
I46F4611Aromatic anchor
I46N4600Polar — abolishes both
F47Y4701Protein made, no lysis
L48P4801Protein made, no lysis
K50E/N/I5001Charge changes — no lysis
V63E6301Protein made, no lysis

Key finding: Only L44P, A45P, I46F in the TM region retain lysis=1


Step 6 — Mutation Combination Generator

Built Python pipeline to generate random 2-residue combinations with three strategies:

  • random — any valid pair from full pool
  • lysis_positive — both mutations have lysis=1
  • mixed — at least one lysis+ and one lysis- per combo

Generated Mutants

MutantComboLysisNotes
MUT1S15A + K50I1/0Mixed — epistasis candidate
MUT2M1T + L56H0/0Both abolish lysis — stability study
MUT3Y39H + V40E0/0Adjacent — soluble→TM junction

MUT1 Sequence (S15A + K50I)

>MS2_L_MUT1|S15A_K50I|lysis=1/0|prot=1/1
METRFPQQSQQTPAATNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSI
FTNQLLLSSLCLVFVTSATKQQLS

MUT2 Sequence (M1T + L56H)

>MS2_L_MUT2|M1T_L56H|lysis=0/0|prot=0/1
TETRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSK
FTNQLHLSSLCLVFVTSATKQQLS

MUT3 Sequence (Y39H + V40E)

>MS2_L_MUT3|Y39H_V40E|lysis=0/0|prot=0/0
METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLHELI FLAIFLSK
FTNQLLLSSLCLVFVTSATKQQLS

Step 7 — Evaluation Pipeline

Note, in this section, Claude AI recommended a Python based scoring systme, which can be optional but informs the pipeline.

6-Gate Scoring Framework

INPUT: Mutation combination (2+ residues)
│
├─► GATE 1: Dataset Filters
│         ├── Lysis activity (0/1)
│         ├── Protein expression level (0/1)
│         └── Stop codon in combo?
│
├─► GATE 2: Sequence Checks
│         ├── GC content (40–60%)
│         ├── Isoelectric point (pI)
│         ├── GRAVY hydrophobicity score
│         └── Rare codons at N-terminus?
│
├─► GATE 3: ORF Overlap Check
│         ├── Coat protein frame intact?
│         ├── Replicase frame intact?
│         └── No new stop codons in overlapping frames?
│
├─► GATE 4: Structure Prediction
│         ├── pLDDT ≥ wildtype (ESMFold / AF2)
│         ├── pTM > 0.5
│         └── TM helix (aa 41–74) intact?
│
├─► GATE 5: Stability Scoring
│         ├── ΔΔG ≤ 0 kcal/mol (FoldX / mCSM)
│         └── Solubility score (CamSol)
│
└─► GATE 6: Multimer Check
          ├── ipTM ≥ wildtype (AF2_Multimer)
          └── TM domain interface contacts preserved?
                     │
               SHORTLIST: Top 2–5 candidates
                     │
               WET LAB VALIDATION

Key Metrics Ranked by Complexity

RankMetricToolComplexity
1–3Lysis, protein level, stop codonsDatasetInstant
4–9GC, MW, pI, GRAVY, cysteines, codonsBioPython / ExPASyMinutes
10–15ORF check, secondary structure, solubilityBenchling / PSIPREDMinutes
16–20pLDDT, pTM, TM helix, clashesESMFold / AF2Hours
21–23ΔΔG, membrane insertion energyFoldX / mCSMHours
24–28Conservation, epistasis, wet labLeviviridae MSA + benchDays

Step 8 — TM Mutation Shortlist (Top 5 Candidates)

RankComboRationale
1L44P + A45PBoth lysis=1, prot=1. Double proline at TM entry
2L44P + I46FBoth lysis=1, prot=1. Proline kink + aromatic anchor
3A45P + I46FBoth lysis=1, prot=1. Classic TM stabilization
4I46F + S49TMixed lysis — epistatic rescue candidate
5L44P + N53STM entry + core — rescue test

Step 9 — Structural Analysis Tools

ToolPurposeGate
BenchlingORF-safe mutation designGate 3
ESMFold / AF2Structure predictionGate 4
ChimeraX3D visualization, residue swappingGate 4
FoldX / mCSMΔΔG stability scoringGate 5
AF2_MultimerOligomeric assembly predictionGate 6
ProteinMPNNAI-guided sequence redesignDesign
QuikChangeWet lab site-directed mutagenesisSynthesis

Key Biological Insights

  • The L protein’s overlapping reading frames are the primary constraint on mutation design
  • The Free Zone (aa 16–28) is the only region where mutations affect L protein alone
  • The TM boundary (aa 44–46) is the most promising target — lysis=1 mutations exist there
  • L protein functions as an oligomer — monomer folding alone is insufficient
  • DnaJ chaperone interaction with the soluble domain is critical for proper folding
  • The C-terminal TM domain drives both membrane insertion and pore formation

Outstanding Steps

[ ] Run ESMFold on all 5 candidate mutants → get pLDDT scores
[ ] Run FoldX / mCSM → get ΔΔG for each candidate
[ ] Run AF2_Multimer → check dimer ipTM scores
[ ] Run Benchling ORF check → verify CP and Rep frames intact
[ ] Rank and select final top 5 candidates
[ ] Synthesize top 2 in wet lab → SDS-PAGE + lysis assay

Appendix: Pipeline Summary with Key Ai generated Prompts (Claude - Sonnet 4.6)


Stage 1 — Sequence Retrieval and Mutation Introduction

The session began with retrieving the canonical human SOD1 sequence from UniProt (P00441) and introducing the A4V point mutation — substituting Alanine for Valine at position 4 of the mature protein. This established the disease-relevant target sequence for all downstream analysis. Key concepts clarified included the numbering convention between the full canonical sequence and the mature processed form, and the biological significance of A4V as the most common fALS-linked SOD1 variant in North America.

Key prompts:

“Begin by retrieving the human SOD1 sequence from UniProt (P00441) and introducing the A4V mutation.” “What is the A4V mutation?” “What is a homodimer?”


Stage 2 — Conceptual Grounding

Before moving to computational tools, foundational questions established the biological framework: the structural location of A4V at β-strand 1 near the dimer interface, and the therapeutic rationale for designing a peptide binder — to intercept misfolded A4V SOD1 at the aggregation-prone hydrophobic surface exposed by dimer destabilization.

Key prompts:

“Remind, where is the actual critical region of A4V?” “Summarize what the fundamental purpose is to take the mutated protein and add a binder sequence.”


Stage 3 — AlphaFold Server Workflow

The AlphaFold Server workflow was established: inputting two copies of the A4V SOD1 sequence as Entity 1 to model the native homodimer, and the peptide as Entity 2. The distinction between protein chains and small molecule ligands was clarified. The rationale for five ranked models per job was explained, and the rank_0 CIF file was identified as the correct starting point.

Key prompts:

“In AlphaFold Server how to add a peptide to a protein sequence?” “When evaluating in AlphaFold should I be using one strand of the SOD1 sequence or two, to show the mutant form?” “Each export from AlphaFold includes 4 CIF files. Why?”


Stage 4 — ChimeraX Structural Evaluation

The built-in AlphaFold viewer was identified as insufficient for detailed analysis, leading to adoption of ChimeraX. A core command vocabulary was developed iteratively through troubleshooting: chain coloring, secondary structure coloring, residue labeling to landmark A4V, surface generation with transparency to assess binding depth, and distance measurement to quantify proximity to residue 4. Common errors were resolved including chain specification syntax, atom ambiguity, electrostatic surface cap persistence, and model number conflicts.

Key prompts:

“Is it better to evaluate in AlphaFold or in another visual program to get to these answers?” “When loading a model into ChimeraX and evaluating, summarize key questions for evaluating visually.” “Summarize how to best answer the questions, and what ChimeraX visualization will work best.”


Stage 5 — Peptide Observation and Scoring

Five PepMLM-generated peptides were evaluated across AlphaFold confidence metrics (ipTM, pTM) and ChimeraX structural observations (distance to A4V, structural feature engagement, binding depth). A key insight emerged: WRYVAYALRWGE — the lowest ipTM (0.68) — showed the closest proximity to A4V (12.875 Å) and uniquely demonstrated induced folding, a hallmark of meaningful peptide-protein interaction. FLYRWLPSRRGG (ipTM 0.89, distance 22.285 Å) was established as the known binder control baseline.

Key prompts:

“What if a lower ipTM has a closer proximity to the A4V location?” “In this case the peptide starts to show a helical fold.” “Update to final summary — FLYRWLPSRRGG is the control or known SOD1-binding peptide.”


Stage 6 — Physicochemical Property Analysis (PepMLM)

PepMLM peptide physicochemical properties were analyzed relative to FLYRWLPSRRGG across seven dimensions: solubility, hemolysis, binding affinity (pKd/pKi), molecular weight, net charge, isoelectric point, and hydrophobicity (GRAVY). Three peptides exceeded the control binding affinity. WRYVAYALRWGE showed the highest pKd (6.980), lowest net charge (0.77), and lowest pI (8.59) — the most favorable selectivity profile.

Key prompts:

“Analyze the results.” (physicochemical data pasted) “Revise the plot — make FLYRWLPSRRGG the baseline and first value, color in gray bar.” “Format the detailed analysis as a markdown file.”


Stage 7 — MoPPIT Peptide Generation and Analysis

Three MoPPIT-generated peptides were introduced and analyzed. All three showed critically elevated hemolysis probabilities (0.957–0.982, ~20x control) driven by lysine-rich sequences. Despite this, IKCGNKFKKKYH was identified as highest-affinity peptide of the entire session at pKd 7.713. Motif position scores were introduced as an additional evaluation dimension.

Key prompts:

“What is a motif position?” “Graph the following MoPPIT generated peptide binders.” (data pasted) “Format the analysis of MoPPIT data in Hugo markdown format.”


Stage 8 — Hemolysis Resolution

The hemolysis liability of IKCGNKFKKKYH was addressed through systematic lysine substitution. Three variants were designed and screened. The K→E glutamate substitution strategy (Variant 3: IKCGNEFKKEYH) outperformed K→Q substitution — retaining medium binding classification (pKd 7.227), reducing net charge to 0.84, and achieving full hemolysis safety (0.035) comparable to the known binder control.

Key prompts:

“Summarize hemolysis probability and what we may do to resolve.” “Recommend three peptides derived from IKCGNKFKKKYH that might lower hemolysis.” “Here are the results of an attempt to lower hemolysis.” (variant data pasted)


Stage 9 — Synthesis and Outputs

All findings were compiled into structured Hugo markdown deliverables: peptide binding observations, numeric summary tables, ipTM vs distance scatter plot, six-panel physicochemical bar chart, MoPPIT analysis, hemolysis resolution pipeline summary, and this appendix. Two lead candidates emerged from separate pipelines for further structural validation.

Key prompts:

“Plot the data points in a visual graphic, highlighting the likely candidate.” “Download summary of the attempt to achieve hemolysis safety in Hugo markdown format.” “Revise the hemolysis summary in Hugo markdown format.”


Distilled Conclusion

Two lead candidates emerged from this session across two separate peptide generation pipelines:

WRYVAYALRWGE (PepMLM) — closest structural proximity to A4V (12.875 Å), highest PepMLM binding affinity (pKd 6.980), induced folding behavior upon binding, and the most favorable charge selectivity profile (net charge 0.77). Recommended for AlphaFold dimer evaluation and structural confirmation.

IKCGNEFKKEYH (MoPPIT, optimized) — highest binding affinity of the full session after hemolysis optimization (pKd 7.227), full hemolysis safety achieved (0.035), net charge 0.84, pI 8.16. Glutamate substitution (K→E) demonstrated as the superior strategy over glutamine substitution (K→Q) for charge reduction without affinity loss.

The recommended next step for both candidates is AlphaFold Server structural evaluation against the A4V SOD1 homodimer, followed by distance-to-A4V measurement in ChimeraX to determine which pipeline produces the more structurally relevant binder.


Part 3 — Footnote Attributions

Databases and Sequence Resources

  1. UniProt — Human SOD1 canonical sequence (P00441 / SODC_HUMAN). UniProt Consortium. UniProt: the Universal Protein knowledgebase. Nucleic Acids Research. https://www.uniprot.org/uniprotkb/P00441

Structure Prediction

  1. AlphaFold Server — Structure prediction of SOD1 A4V homodimer and peptide complexes. Abramson J, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 2024. https://alphafoldserver.com

  2. AlphaFold confidence metrics (ipTM / pTM) — Evans R, et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv, 2022. https://doi.org/10.1101/2021.10.04.463034

Molecular Visualization

  1. UCSF ChimeraX — Pettersen EF, et al. UCSF ChimeraX: Structure visualization for researchers, educators, and developers. Protein Science, 2021. https://www.rbvi.ucsf.edu/chimerax/

Peptide Design and Generation

  1. PepMLM — Peptide design via masked language modeling. Truong Jr T, Bepler T. PepMLM: Target Sequence-Conditioned Generation of Peptide Binders via Masked Language Modeling. arXiv, 2023. https://arxiv.org/abs/2310.03842

  2. MoPPIT — Motif-based peptide-protein interaction tool. Source to be confirmed from course materials.

Physicochemical Property Prediction

  1. Solubility prediction — Peptide solubility probability scoring. Source dependent on tool used for property screening — confirm from course pipeline documentation.

  2. Hemolysis prediction — Peptide hemolysis probability scoring. Likely derived from HemoPI or equivalent hemolysis prediction server. Gautam A, et al. HemoPI: a server to predict and design hemolytic peptides. Journal of Translational Medicine, 2014. https://webs.iiitd.edu.in/raghava/hemopi/

  3. Binding affinity (pKd/pKi) — Peptide binding affinity prediction. Source dependent on tool used — confirm from course pipeline documentation.

  4. GRAVY score (hydrophobicity) — Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. Journal of Molecular Biology, 1982. 157(1):105–132.

  5. Isoelectric point (pI) prediction — Bjellqvist B, et al. The focusing positions of polypeptides in immobilized pH gradients can be predicted from their amino acid sequences. Electrophoresis, 1993.

Disease and Biology Context

  1. SOD1 and fALS — Rosen DR, et al. Mutations in Cu/Zn superoxide dismutase gene are associated with familial amyotrophic lateral sclerosis. Nature, 1993. 362:59–62.

  2. A4V mutation and ALS — Cudkowicz ME, et al. Epidemiology of mutations in superoxide dismutase in amyotrophic lateral sclerosis. Annals of Neurology, 1997. 41(2):210–221.

  3. SOD1 misfolding and aggregation — Banci L, et al. Atomic-resolution monitoring of protein maturation in live human cells by NMR. Nature Chemical Biology, 2013.

  4. Induced folding / folding upon binding — Dyson HJ, Wright PE. Intrinsically unstructured proteins and their functions. Nature Reviews Molecular Cell Biology, 2005. 6:197–208.

Structural Biology Concepts

  1. Greek key β-barrel topology — Richardson JS. The anatomy and taxonomy of protein structure. Advances in Protein Chemistry, 1981. 34:167–339.

  2. Protein distance thresholds and contact definition — Keskin O, et al. Principles of protein-protein interactions. Chemical Reviews, 2008. 108(4):1225–1244.

  3. Lysine-mediated membrane disruption and hemolysis — Brogden KA. Antimicrobial peptides: pore formers or metabolic inhibitors in bacteria? Nature Reviews Microbiology, 2005. 3:238–250.


Additional References

  • Fiers W. et al. (1976) Complete nucleotide sequence of bacteriophage MS2 RNA. Nature 260:500–507
  • Kastelein R.A. et al. (1982) Lysis gene expression of RNA phage MS2. Nature 295:35–41
  • Beremand & Blumenthal (1979) Overlapping genes in RNA phage. Cell 18:257–266
  • PMC10688784 — In vitro characterization of the phage lysis protein MS2-L
  • PMC554659 — The amino terminal half of the MS2-coded lysis protein is dispensable
  • CNN Phage Therapy Article — Strathdee/Patterson case study