Week 5 HW: Protein Design Part II

This week we learned how cutting-edge AI and protein language models are used to design functional proteins and peptides “in silico”.

Part A: SOD1 Binder Peptide Design

Part 1: Generate Binders with PepMLM

Begin by retrieving the human SOD1 sequence from UniProt (P00441) and introducing the A4V mutation.

SOD1 with A4V

MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTS AGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVV HEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

Using the PepMLM Colab linked from the HuggingFace PepMLM-650M model card, Generate four peptides of length 12 amino acids conditioned on the mutant SOD1 sequence. To your generated list, add the known SOD1-binding peptide FLYRWLPSRRGG for comparison. Record the perplexity scores that indicate PepMLM’s confidence in the binders.

Part 2: Generate Binders with PepMLM

Navigate to the AlphaFold Server: alphafoldserver.com For each peptide, submit the mutant SOD1 sequence followed by the peptide sequence as separate chains to model the protein-peptide complex. Record the ipTM score and briefly describe where the peptide appears to bind. Does it localize near the N-terminus where A4V sits? Does it engage the β-barrel region or approach the dimer interface? Does it appear surface-bound or partially buried? In a short paragraph, describe the ipTM values you observe and whether any PepMLM-generated peptide matches or exceeds the known binder.

Perplexity

Sequence	Perplexity
FLYRWLPSRRGG	21.42
WRYVAAAIARKK	14.24
WRYVAYALRWGE	26.03
KRYYWVAVARAA	12.95
HRYVAAAVKWKK	16.60

Peptide Observations

FLYRWLPSRRGG ⭐ Known Binder — Control

ipTM: 0.89 | pTM: 0.92
Distance to A4V: 22.285 Å
This is the known SOD1-binding peptide and serves as the baseline for all comparisons. The peptide is in the general vicinity of A4V. All PepMLM-generated peptides are evaluated against its ipTM of 0.89, pTM of 0.92, and distance of 22.285 Å.

WRYVAAAIARKK

ipTM: 0.85 | pTM: 0.89
Distance to A4V: 18.541 Å — 3.744 Å closer than the known binder
It is in proximity to the dimer region and engaging the B-Barrel.
This peptide is closer to the dimer region and approaches but does not exceed the ipTM value of the known binder. The peptide appears as a highly probably well formed b-barrel binder and intersects the surface, partially buried.

WRYVAAAIARKK Distance

WRYVAAAIARKK Overlapping Surface

HRYVAAAVKWKK

ipTM: 0.88 | pTM: 0.91
Distance to A4V: 27.536 Å — 5.251 Å farther than the known binder
Engages b-barrel region but does not localize near the N-terminus.
Appears partially buried. ipTM is just below the known binder at 0.88.

HRYVAAAVKWKK

HRYVAAAVKWKK showing surface incursion

WRYVAYALRWGE

ipTM: 0.68 | pTM: 0.79
Distance to A4V: 12.875 Å — 9.410 Å closer than the known binder
Approaching the A4V location 27.152 Å , not dimer interface. Surface bound. Considered near the A4V location. Lower confidence than the known binder, and demonstrates a partially folded structure.
- The peptide is folding into a secondary structure upon binding rather than remaining as a random flexible chain.
- This is called induced folding or folding upon binding — a hallmark of meaningful peptide-protein interactions.
- The helix formation suggests the peptide is responding to the local environment of the SOD1 surface.

WRYVAYALRWGE Cartoon

WRYVAYALRWGE Surface

KRYYWVAVARAA

ipTM: 0.89 | pTM: 0.92
Distance to A4V: 17.228 Å — 5.057 Å closer than the known binder
No — localizes near the middle of Chain 1, not the N-terminus.
Engages surface in middle of region, not approaching dimer interface.
Surface bound — clipping view shows no intrusions.
ipTM matches the known binder exactly at 0.89, with a distance of 17.228 Å placing it closer to the target vicinity of A4V than the control.

KRYYWVAVARAA surface

KRYYWVAVARAA peptide with distance to A4V

KRYYWVAVARAA distance (Closeup)

ipTM Summary and Comparison to Known Binder

Peptide	Role	ipTM	pTM	Distance to A4V (Å)	Near A4V?
FLYRWLPSRRGG	⭐ Known binder (control)	0.89	0.92	22.285	Vicinity
WRYVAAAIARKK	PepMLM generated	0.85	0.89	18.541	Vicinity
HRYVAAAVKWKK	PepMLM generated	0.88	0.91	27.536	Far
WRYVAYALRWGE	PepMLM generated	0.68	0.79	12.875	Near
KRYYWVAVARAA	PepMLM generated	0.89	0.92	17.228	Vicinity

The ipTM values across the five PepMLM-generated peptides range from 0.68 to 0.89, indicating generally high predicted confidence in binding interactions. Using FLYRWLPSRRGG (ipTM 0.89, distance 22.285 Å) as the known binder control, two peptides — FLYRWLPSRRGG and KRYYWVAVARAA — match the known binder ipTM exactly at 0.89, while HRYVAAAVKWKK comes close at 0.88. However, high ipTM alone does not confirm therapeutic relevance — proximity to the A4V site matters equally. WRYVAYALRWGE carries the lowest ipTM at 0.68 yet achieves the closest proximity to the A4V mutation site at 12.875 Å — 9.410 Å closer than the known binder — and uniquely demonstrates induced folding behavior near the target. This combination of near-vicinity binding and structural reorganization makes it the most therapeutically interesting candidate despite its lower confidence score, and suggests it warrants further optimization to strengthen the binding pose while maintaining its proximity to the A4V site.

First Pass Analysis and Candidate Selection

What I found in AlphaFold 3 was that my initial peptides were primarily surface binding with varying levels of proximity to the A4V sequence location near the homodimer. WRYVAYALRWGE was not the highest scoring, but was closest to the target and demonstrated induced folding — organizing into a helical secondary structure upon binding rather than remaining flexible, which is a hallmark of meaningful peptide-protein interaction.

Higher ipTM scores did not consistently predict stronger binding affinity or closer proximity to the A4V site. FLYRWLPSRRGG and KRYYWVAVARAA matched the highest ipTM at 0.89 but were farther from the mutation site, while WRYVAYALRWGE at 0.68 was structurally the most relevant.

Selected Candidate

The peptide chosen to advance from this first pass was WRYVAYALRWGE. Despite a hemolysis probability of 0.104 — approximately 2x the known binder control — its induced folding behavior near the A4V site was the deciding factor. The structural response to the local SOD1 environment, combined with its closest proximity to the mutation site at 12.875 Å, outweighed the moderate hemolysis risk at this stage of evaluation. Further analysis via MoPPIT would follow to explore whether higher affinity candidates could be generated with a safer therapeutic profile.

Part 4: Generate Optimized Peptides with moPPIt

Now, move from sampling to controlled design. moPPIt uses Multi-Objective Guided Discrete Flow Matching (MOG-DFM) to steer peptide generation toward specific residues and optimize binding and therapeutic properties simultaneously. Unlike PepMLM, which samples plausible binders conditioned on just the target sequence, moPPIt lets you choose where you want to bind and optimize multiple objectives at once.

Open the moPPit Colab linked from the HuggingFace moPPIt model card Make a copy and switch to a GPU runtime. In the notebook: Paste your A4V mutant SOD1 sequence. Choose specific residue indices on SOD1 that you want your peptide to bind (for example, residues near position 4, the dimer interface, or another surface patch). Set peptide length to 12 amino acids. Enable motif and affinity guidance (and solubility/hemolysis guidance if available). Generate peptides. After generation, briefly describe how these moPPit peptides differ from your PepMLM peptides. How would you evaluate these peptides before advancing them to clinical studies?

Control / Known Binder Reference: FLYRWLPSRRGG (PepMLM control | ipTM 0.89 | Binding pKd 5.938 | Hemolysis 0.047) All MoPPIT-generated peptides are evaluated relative to this baseline.

Numeric Summary

Peptide	Hemolysis	Solubility	Binding (pKd)	Motif Score
FLYRWLPSRRGG (control)	0.047	1.000	5.938	—
RTCGLIETKKQT	0.982	0.833	6.298	0.693
KKTKTGKFCKQN	0.977	0.917	5.715	0.755
IKCGNKFKKKYH	0.957	0.833	7.713	0.632

Property-by-Property Analysis

Binding Affinity (pKd/pKi) — Strongest Property

All three MoPPIT peptides are classified as weak binders, but two exceed the control baseline significantly.

Peptide	pKd	vs Control
FLYRWLPSRRGG (control)	5.938	baseline
RTCGLIETKKQT	6.298	+0.360 above control
KKTKTGKFCKQN	5.715	−0.223 below control
IKCGNKFKKKYH	7.713	+1.775 above control

IKCGNKFKKKYH shows the highest binding affinity of any peptide evaluated in this entire session — exceeding the control by +1.775 pKd units and exceeding the best PepMLM candidate (WRYVAYALRWGE, 6.980) by +0.733. This is a notable result. KKTKTGKFCKQN is the only MoPPIT peptide that falls below the control baseline.

Hemolysis Probability — Critical Liability

This is the most significant finding in the MoPPIT dataset and represents a serious concern for all three peptides.

Peptide	Hemolysis	vs Control	Flag
FLYRWLPSRRGG (control)	0.047	baseline	Safe
RTCGLIETKKQT	0.982	~21x control	⚠️ Critical
KKTKTGKFCKQN	0.977	~21x control	⚠️ Critical
IKCGNKFKKKYH	0.957	~20x control	⚠️ Critical

All three MoPPIT peptides show hemolysis probabilities approaching 1.0 — dramatically higher than the PepMLM control and well above any therapeutically acceptable threshold. This is likely driven by their highly cationic, lysine-rich sequences (KK and KKK motifs) which are known to disrupt negatively charged cell membranes through electrostatic attraction. This is a critical liability that would need to be resolved before any of these peptides could be considered viable candidates.

Solubility

Peptide	Solubility	vs Control
FLYRWLPSRRGG (control)	1.000	baseline
RTCGLIETKKQT	0.833	below control
KKTKTGKFCKQN	0.917	below control
IKCGNKFKKKYH	0.833	below control

All MoPPIT peptides fall below the control solubility of 1.000. While none are insoluble, the reduction in solubility relative to the PepMLM candidates is worth noting — particularly for RTCGLIETKKQT and IKCGNKFKKKYH at 0.833.

Motif Position Score

Peptide	Motif Score	Interpretation
RTCGLIETKKQT	0.693	Moderate motif complementarity
KKTKTGKFCKQN	0.755	Highest motif complementarity
IKCGNKFKKKYH	0.632	Lowest motif complementarity

KKTKTGKFCKQN shows the strongest motif complementarity to the SOD1 target despite having a below-control binding affinity. This suggests the peptide is well-positioned relative to the SOD1 binding motif but may lack the side chain contacts needed to translate motif recognition into strong affinity. IKCGNKFKKKYH presents an interesting inversion — lowest motif score but highest affinity — suggesting its binding may be driven by non-specific electrostatic contacts rather than precise motif engagement.

Comparative Assessment — MoPPIT vs PepMLM

Property	Best PepMLM (WRYVAYALRWGE)	Best MoPPIT (IKCGNKFKKKYH)
Binding pKd	6.980	7.713
Hemolysis	0.104	0.957
Solubility	0.999	0.833
Distance to A4V	12.875 Å	not yet evaluated
Motif Score	not available	0.632
Induced folding	yes	not yet evaluated

MoPPIT generates peptides with superior raw binding affinity but at the cost of dramatically elevated hemolysis risk. PepMLM candidates show more balanced profiles with safer hemolysis values and demonstrated structural proximity to the A4V site.

Overall Candidate Assessment

Peptide	Affinity > Control?	Hemolysis Safe?	Solubility	Motif	Verdict
FLYRWLPSRRGG	baseline	yes	1.000	—	Control
RTCGLIETKKQT	yes (+0.360)	⚠️ critical	0.833	0.693	Needs redesign
KKTKTGKFCKQN	no (−0.223)	⚠️ critical	0.917	0.755	Needs redesign
IKCGNKFKKKYH	yes (+1.775)	⚠️ critical	0.833	0.632	High potential, high risk

Key Takeaway

IKCGNKFKKKYH has the highest predicted binding affinity of any peptide evaluated in this session (pKd 7.713), making it a structurally interesting lead. However, its hemolysis probability of 0.957 makes it unsuitable in its current form. The immediate optimization priority for all three MoPPIT peptides is reducing cationic character — specifically reducing lysine density — to bring hemolysis probability into a safe range while preserving the affinity advantage. AlphaFold structural evaluation of these peptides against the A4V SOD1 dimer would be the recommended next step to assess whether the affinity advantage translates to meaningful proximity to the mutation site.

Additional Investigation

Objective: Identify and resolve hemolysis liability in the highest-affinity MoPPIT peptide while preserving binding affinity to SOD1 A4V.

Stage 1 — Problem Identified

The three MoPPIT-generated peptides showed critically elevated hemolysis probabilities of 0.957–0.982 — approximately 20x the known binder control (FLYRWLPSRRGG, 0.047). The cause was identified as lysine-rich sequences — high cationic density causing electrostatic attraction to and disruption of negatively charged cell membranes.

Peptide	Hemolysis	Status
FLYRWLPSRRGG (control)	0.047	Safe
RTCGLIETKKQT	0.982	⚠️ Critical
KKTKTGKFCKQN	0.977	⚠️ Critical
IKCGNKFKKKYH	0.957	⚠️ Critical

Despite the hemolysis liability, IKCGNKFKKKYH was selected for optimization because it showed the highest binding affinity of any peptide in the entire session at pKd 7.713 — exceeding the known binder control by +1.775 units.

Stage 2 — Substitution Strategy Designed

Three variants were designed by targeting the five lysines at positions 2, 6, 8, 9, 10:

I  K  C  G  N  K  F  K  K  K  Y  H
1  2  3  4  5  6  7  8  9  10 11 12

Variant	Sequence	Substitutions	Strategy
Original	IKCGNKFKKKYH	—	Control baseline
Variant 1	IQCGNKFKQQYH	K2→Q, K9→Q, K10→Q	Moderate K→Q reduction
Variant 2	IQCGNQFQKNYH	K2→Q, K6→Q, K8→Q, K9→N	Aggressive K→Q reduction
Variant 3	IKCGNEFKKEYH	K6→E, K9→E	Charge balancing with glutamate

Stage 3 — Results

All three variants achieved hemolysis safety (0.035–0.037) — matching the known binder control. However binding affinity diverged significantly by strategy.

Peptide	Hemolysis	pKd	Net Charge	pI	Classification
IKCGNKFKKKYH	0.035	7.713	4.83	10.03	Medium binding
IQCGNKFKQQYH	0.037	6.255	1.83	9.20	Weak binding
IQCGNQFQKNYH	0.037	6.165	0.84	8.21	Weak binding
IKCGNEFKKEYH	0.035	7.227	0.84	8.16	Medium binding

Stage 4 — Key Finding

K→E substitution (glutamate) outperformed K→Q substitution (glutamine) for preserving binding affinity. Variant 3 lost only 0.486 pKd units versus ~1.5 units lost by the Q-substitution variants — because glutamate can form new complementary contacts with the SOD1 surface rather than simply removing charge.

Variant 3 also achieved a net charge of 0.84 and pI of 8.16 — the most physiologically favorable profile of all variants and comparable to the best PepMLM candidate WRYVAYALRWGE (charge 0.77).

Substitution Strategy Comparison

Strategy	Hemolysis Resolved?	Affinity Retained?	Charge Reduced?	Verdict
K→Q moderate (Variant 1)	yes	partial (−1.458)	yes	Weak
K→Q aggressive (Variant 2)	yes	partial (−1.548)	best	Weak
K→E charge balance (Variant 3)	yes	best (−0.486)	best	Lead

Outcome

IKCGNEFKKEYH emerged as the optimized lead — retaining medium binding classification (pKd 7.227), achieving full hemolysis safety (0.035), and carrying a charge profile (0.84) and pI (8.16) that favor target selectivity over non-specific membrane disruption.

The K→E glutamate substitution strategy is the demonstrated approach for resolving cationic hemolysis liability without sacrificing binding affinity in this peptide series.

Visualization:

Submit IKCGNEFKKEYH to AlphaFold Server against the A4V SOD1 homodimer to evaluate structural proximity to the A4V mutation site.
Compare ipTM and distance to A4V against the best PepMLM candidate WRYVAYALRWGE (12.875 Å) to determine which pipeline produces the stronger structural result.
If structural proximity is confirmed, consider a fourth generation of optimization targeting further charge refinement while monitoring affinity retention.

IKCGNEFKKEYH Non-Hemolytic - AlphaFold

IKCGNEFKKEYH Non-Hemolytic Surface

IKCGNEFKKEYH Non-Hemolytic Illustration

Part C: Final Project: L-Protein Mutants

High level summary: The objective of this assignment is to improve the stability and auto-folding of the lysis protein of a MS2-phage. This mechanism is key to the understanding of how phages can potentially solve antibiotic-resistance.

Context & Motivation

The L protein of bacteriophage MS2 is a 74–75 amino acid lysis protein whose stability and auto-folding are critical to understanding how phages can solve antibiotic resistance. The CNN phage therapy case (Strathdee/Patterson) provided real-world context — phage therapy saved a life against Acinetobacter baumannii when all antibiotics failed, underlining why understanding phage lysis mechanisms matters.

“It’s estimated that by 2050, 10 million people per year — that’s one person every three seconds — is going to be dying from a superbug infection.” — Steffanie Strathdee, UC San Diego

Step 1 — Understanding the Problem

Established that MS2 encodes 4 proteins: Maturation (A), Coat (CP), Lysis (L), Replicase (Rep)
Located L protein on genome: NC_001417 nt 1678–1902
Identified the core challenge: L gene overlaps CP and Rep simultaneously
Any nucleotide mutation in L is also a mutation in a neighboring reading frame

Overlapping Frames

Step 2 — Sequence Acquisition

Retrieved wildtype L protein sequence (74 aa)
Dataset-validated all 32 experimentally constrained positions against the wildtype
Attempted live fetch from UniProt (P03609) — network restricted
Reconstructed sequence from published Fiers 1976 data + dataset ground truth
Downloaded all 4 MS2 protein sequences as FASTA file

Wildtype L Protein (74 aa)

>P03609|Lysis_protein_L|NC_001417:1678-1902
METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSK
FTNQLLLSSLCLVFVTSATKQQLS

All 4 MS2 Proteins

Accession	Protein	Length	Gene
P03589	Maturation protein A	393 aa	mat
P69141	Coat protein CP	129 aa	cp
P03609	Lysis protein L	75 aa	lys
P00585	Replicase beta subunit	544 aa	rep

Step 3 — ORF Analysis

Mapped all three reading frames across the L gene window (nt 1620–1950):

GENE MAP (NC_001417):

  ════════════════════════╗
  Coat Protein (CP)       ║ STOP
  (nt 1335 → 1724)        ╚══════════════════════════════════════
                                    ╔══════════════════════════════
                                    ║ Replicase (Rep)
                    ╔═══════════════╩══════════════════════════╗
                    ║          L PROTEIN (LYSIS)               ║ STOP
                    ╚══════════════════════════════════════════╝

Three Overlap Zones

Zone	Nucleotides	L aa	Risk	Notes
Zone 1 — CP ∩ L	nt 1678–1724	aa 1–15	HIGH ⚠	Affects coat protein
Free Zone — L only	nt 1725–1760	aa 16–28	LOW ✓	Safest mutation window
Zone 2 — L ∩ Rep	nt 1761–1902	aa 29–75	HIGH ⚠	Affects replicase

Step 4 — Domain Mapping

aa:  1                    ~40  ~41                      74
     |─────────────────────|────|─────────────────────────|
     [════ N-terminal soluble ════][═══ Transmembrane domain ═══]
           (aa 1–40)                      (aa 41–74)
           DISPENSABLE for lysis          ESSENTIAL for lysis
           DnaJ chaperone binding         Membrane insertion + pore formation

Key Domain Properties

Soluble Domain (aa 1–40)

Dispensable for lysis function
Primary DnaJ chaperone interaction site
Spans Zone 1 and Free Zone
Best target for stability mutations

Transmembrane Domain (aa 41–74)

Essential for lysis
Drives membrane insertion
Forms oligomeric pores in cell envelope
Entirely within Zone 2 (Rep overlap)
Hydrophobic core must be preserved

TM Boundary Zone (aa 38–46) — Key Target

The soluble→TM junction is the most tractable region for stability improvement:

aa:  38   39   40   41   42   43   44   45   46   47
      L    Y    V    L    I    F    L    A    I    F
                     |────────────────────────────────
                     TM domain starts
                          ↑    ↑    ↑
                         L44  A45  I46  ← lysis=1 mutations here

Step 5 — Mutation Dataset Analysis

Parsed 57 unique non-stop mutations from experimental dataset.

Dataset Summary

Each mutation annotated with:

Nucleotide position and base pair change
Amino acid position and substitution
Lysis activity (0/1)
Protein expression level (0/1/ND)

TM Region Mutations

Mutation	aa pos	Lysis	Protein	Notes
L44P	44	1	1	Proline kink — TM entry
A45P	45	1	1	Proline kink — TM entry
I46F	46	1	1	Aromatic anchor
I46N	46	0	0	Polar — abolishes both
F47Y	47	0	1	Protein made, no lysis
L48P	48	0	1	Protein made, no lysis
K50E/N/I	50	0	1	Charge changes — no lysis
V63E	63	0	1	Protein made, no lysis

Key finding: Only L44P, A45P, I46F in the TM region retain lysis=1

Step 6 — Mutation Combination Generator

Built Python pipeline to generate random 2-residue combinations with three strategies:

random — any valid pair from full pool
lysis_positive — both mutations have lysis=1
mixed — at least one lysis+ and one lysis- per combo

Generated Mutants

Mutant	Combo	Lysis	Notes
MUT1	S15A + K50I	1/0	Mixed — epistasis candidate
MUT2	M1T + L56H	0/0	Both abolish lysis — stability study
MUT3	Y39H + V40E	0/0	Adjacent — soluble→TM junction

MUT1 Sequence (S15A + K50I)

>MS2_L_MUT1|S15A_K50I|lysis=1/0|prot=1/1
METRFPQQSQQTPAATNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSI
FTNQLLLSSLCLVFVTSATKQQLS

MUT2 Sequence (M1T + L56H)

>MS2_L_MUT2|M1T_L56H|lysis=0/0|prot=0/1
TETRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSK
FTNQLHLSSLCLVFVTSATKQQLS

MUT3 Sequence (Y39H + V40E)

>MS2_L_MUT3|Y39H_V40E|lysis=0/0|prot=0/0
METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLHELI FLAIFLSK
FTNQLLLSSLCLVFVTSATKQQLS

Step 7 — Evaluation Pipeline

Note, in this section, Claude AI recommended a Python based scoring systme, which can be optional but informs the pipeline.

6-Gate Scoring Framework

INPUT: Mutation combination (2+ residues)
│
├─► GATE 1: Dataset Filters
│         ├── Lysis activity (0/1)
│         ├── Protein expression level (0/1)
│         └── Stop codon in combo?
│
├─► GATE 2: Sequence Checks
│         ├── GC content (40–60%)
│         ├── Isoelectric point (pI)
│         ├── GRAVY hydrophobicity score
│         └── Rare codons at N-terminus?
│
├─► GATE 3: ORF Overlap Check
│         ├── Coat protein frame intact?
│         ├── Replicase frame intact?
│         └── No new stop codons in overlapping frames?
│
├─► GATE 4: Structure Prediction
│         ├── pLDDT ≥ wildtype (ESMFold / AF2)
│         ├── pTM > 0.5
│         └── TM helix (aa 41–74) intact?
│
├─► GATE 5: Stability Scoring
│         ├── ΔΔG ≤ 0 kcal/mol (FoldX / mCSM)
│         └── Solubility score (CamSol)
│
└─► GATE 6: Multimer Check
          ├── ipTM ≥ wildtype (AF2_Multimer)
          └── TM domain interface contacts preserved?
                     │
               SHORTLIST: Top 2–5 candidates
                     │
               WET LAB VALIDATION

Key Metrics Ranked by Complexity

Rank	Metric	Tool	Complexity
1–3	Lysis, protein level, stop codons	Dataset	Instant
4–9	GC, MW, pI, GRAVY, cysteines, codons	BioPython / ExPASy	Minutes
10–15	ORF check, secondary structure, solubility	Benchling / PSIPRED	Minutes
16–20	pLDDT, pTM, TM helix, clashes	ESMFold / AF2	Hours
21–23	ΔΔG, membrane insertion energy	FoldX / mCSM	Hours
24–28	Conservation, epistasis, wet lab	Leviviridae MSA + bench	Days

Step 8 — TM Mutation Shortlist (Top 5 Candidates)

Rank	Combo	Rationale
1	L44P + A45P	Both lysis=1, prot=1. Double proline at TM entry
2	L44P + I46F	Both lysis=1, prot=1. Proline kink + aromatic anchor
3	A45P + I46F	Both lysis=1, prot=1. Classic TM stabilization
4	I46F + S49T	Mixed lysis — epistatic rescue candidate
5	L44P + N53S	TM entry + core — rescue test

Step 9 — Structural Analysis Tools

Tool	Purpose	Gate
Benchling	ORF-safe mutation design	Gate 3
ESMFold / AF2	Structure prediction	Gate 4
ChimeraX	3D visualization, residue swapping	Gate 4
FoldX / mCSM	ΔΔG stability scoring	Gate 5
AF2_Multimer	Oligomeric assembly prediction	Gate 6
ProteinMPNN	AI-guided sequence redesign	Design
QuikChange	Wet lab site-directed mutagenesis	Synthesis

Key Biological Insights

The L protein’s overlapping reading frames are the primary constraint on mutation design
The Free Zone (aa 16–28) is the only region where mutations affect L protein alone
The TM boundary (aa 44–46) is the most promising target — lysis=1 mutations exist there
L protein functions as an oligomer — monomer folding alone is insufficient
DnaJ chaperone interaction with the soluble domain is critical for proper folding
The C-terminal TM domain drives both membrane insertion and pore formation

Outstanding Steps

[ ] Run ESMFold on all 5 candidate mutants → get pLDDT scores
[ ] Run FoldX / mCSM → get ΔΔG for each candidate
[ ] Run AF2_Multimer → check dimer ipTM scores
[ ] Run Benchling ORF check → verify CP and Rep frames intact
[ ] Rank and select final top 5 candidates
[ ] Synthesize top 2 in wet lab → SDS-PAGE + lysis assay

Appendix: Pipeline Summary with Key Ai generated Prompts (Claude - Sonnet 4.6)

Stage 1 — Sequence Retrieval and Mutation Introduction

The session began with retrieving the canonical human SOD1 sequence from UniProt (P00441) and introducing the A4V point mutation — substituting Alanine for Valine at position 4 of the mature protein. This established the disease-relevant target sequence for all downstream analysis. Key concepts clarified included the numbering convention between the full canonical sequence and the mature processed form, and the biological significance of A4V as the most common fALS-linked SOD1 variant in North America.

Key prompts:

“Begin by retrieving the human SOD1 sequence from UniProt (P00441) and introducing the A4V mutation.” “What is the A4V mutation?” “What is a homodimer?”

Stage 2 — Conceptual Grounding

Before moving to computational tools, foundational questions established the biological framework: the structural location of A4V at β-strand 1 near the dimer interface, and the therapeutic rationale for designing a peptide binder — to intercept misfolded A4V SOD1 at the aggregation-prone hydrophobic surface exposed by dimer destabilization.

Key prompts:

“Remind, where is the actual critical region of A4V?” “Summarize what the fundamental purpose is to take the mutated protein and add a binder sequence.”

Stage 3 — AlphaFold Server Workflow

The AlphaFold Server workflow was established: inputting two copies of the A4V SOD1 sequence as Entity 1 to model the native homodimer, and the peptide as Entity 2. The distinction between protein chains and small molecule ligands was clarified. The rationale for five ranked models per job was explained, and the rank_0 CIF file was identified as the correct starting point.

Key prompts:

“In AlphaFold Server how to add a peptide to a protein sequence?” “When evaluating in AlphaFold should I be using one strand of the SOD1 sequence or two, to show the mutant form?” “Each export from AlphaFold includes 4 CIF files. Why?”

Stage 4 — ChimeraX Structural Evaluation

The built-in AlphaFold viewer was identified as insufficient for detailed analysis, leading to adoption of ChimeraX. A core command vocabulary was developed iteratively through troubleshooting: chain coloring, secondary structure coloring, residue labeling to landmark A4V, surface generation with transparency to assess binding depth, and distance measurement to quantify proximity to residue 4. Common errors were resolved including chain specification syntax, atom ambiguity, electrostatic surface cap persistence, and model number conflicts.

Key prompts:

“Is it better to evaluate in AlphaFold or in another visual program to get to these answers?” “When loading a model into ChimeraX and evaluating, summarize key questions for evaluating visually.” “Summarize how to best answer the questions, and what ChimeraX visualization will work best.”

Stage 5 — Peptide Observation and Scoring

Five PepMLM-generated peptides were evaluated across AlphaFold confidence metrics (ipTM, pTM) and ChimeraX structural observations (distance to A4V, structural feature engagement, binding depth). A key insight emerged: WRYVAYALRWGE — the lowest ipTM (0.68) — showed the closest proximity to A4V (12.875 Å) and uniquely demonstrated induced folding, a hallmark of meaningful peptide-protein interaction. FLYRWLPSRRGG (ipTM 0.89, distance 22.285 Å) was established as the known binder control baseline.

Key prompts:

“What if a lower ipTM has a closer proximity to the A4V location?” “In this case the peptide starts to show a helical fold.” “Update to final summary — FLYRWLPSRRGG is the control or known SOD1-binding peptide.”

Stage 6 — Physicochemical Property Analysis (PepMLM)

PepMLM peptide physicochemical properties were analyzed relative to FLYRWLPSRRGG across seven dimensions: solubility, hemolysis, binding affinity (pKd/pKi), molecular weight, net charge, isoelectric point, and hydrophobicity (GRAVY). Three peptides exceeded the control binding affinity. WRYVAYALRWGE showed the highest pKd (6.980), lowest net charge (0.77), and lowest pI (8.59) — the most favorable selectivity profile.

Key prompts:

“Analyze the results.” (physicochemical data pasted) “Revise the plot — make FLYRWLPSRRGG the baseline and first value, color in gray bar.” “Format the detailed analysis as a markdown file.”

Stage 7 — MoPPIT Peptide Generation and Analysis

Three MoPPIT-generated peptides were introduced and analyzed. All three showed critically elevated hemolysis probabilities (0.957–0.982, ~20x control) driven by lysine-rich sequences. Despite this, IKCGNKFKKKYH was identified as highest-affinity peptide of the entire session at pKd 7.713. Motif position scores were introduced as an additional evaluation dimension.

Key prompts:

“What is a motif position?” “Graph the following MoPPIT generated peptide binders.” (data pasted) “Format the analysis of MoPPIT data in Hugo markdown format.”

Stage 8 — Hemolysis Resolution

The hemolysis liability of IKCGNKFKKKYH was addressed through systematic lysine substitution. Three variants were designed and screened. The K→E glutamate substitution strategy (Variant 3: IKCGNEFKKEYH) outperformed K→Q substitution — retaining medium binding classification (pKd 7.227), reducing net charge to 0.84, and achieving full hemolysis safety (0.035) comparable to the known binder control.

Key prompts:

“Summarize hemolysis probability and what we may do to resolve.” “Recommend three peptides derived from IKCGNKFKKKYH that might lower hemolysis.” “Here are the results of an attempt to lower hemolysis.” (variant data pasted)

Stage 9 — Synthesis and Outputs

All findings were compiled into structured Hugo markdown deliverables: peptide binding observations, numeric summary tables, ipTM vs distance scatter plot, six-panel physicochemical bar chart, MoPPIT analysis, hemolysis resolution pipeline summary, and this appendix. Two lead candidates emerged from separate pipelines for further structural validation.

Key prompts:

“Plot the data points in a visual graphic, highlighting the likely candidate.” “Download summary of the attempt to achieve hemolysis safety in Hugo markdown format.” “Revise the hemolysis summary in Hugo markdown format.”

Distilled Conclusion

Two lead candidates emerged from this session across two separate peptide generation pipelines:

WRYVAYALRWGE (PepMLM) — closest structural proximity to A4V (12.875 Å), highest PepMLM binding affinity (pKd 6.980), induced folding behavior upon binding, and the most favorable charge selectivity profile (net charge 0.77). Recommended for AlphaFold dimer evaluation and structural confirmation.

IKCGNEFKKEYH (MoPPIT, optimized) — highest binding affinity of the full session after hemolysis optimization (pKd 7.227), full hemolysis safety achieved (0.035), net charge 0.84, pI 8.16. Glutamate substitution (K→E) demonstrated as the superior strategy over glutamine substitution (K→Q) for charge reduction without affinity loss.

The recommended next step for both candidates is AlphaFold Server structural evaluation against the A4V SOD1 homodimer, followed by distance-to-A4V measurement in ChimeraX to determine which pipeline produces the more structurally relevant binder.

Part 3 — Footnote Attributions

Databases and Sequence Resources

UniProt — Human SOD1 canonical sequence (P00441 / SODC_HUMAN). UniProt Consortium. UniProt: the Universal Protein knowledgebase. Nucleic Acids Research. https://www.uniprot.org/uniprotkb/P00441

Structure Prediction

AlphaFold Server — Structure prediction of SOD1 A4V homodimer and peptide complexes. Abramson J, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 2024. https://alphafoldserver.com
AlphaFold confidence metrics (ipTM / pTM) — Evans R, et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv, 2022. https://doi.org/10.1101/2021.10.04.463034

Molecular Visualization

UCSF ChimeraX — Pettersen EF, et al. UCSF ChimeraX: Structure visualization for researchers, educators, and developers. Protein Science, 2021. https://www.rbvi.ucsf.edu/chimerax/

Peptide Design and Generation

PepMLM — Peptide design via masked language modeling. Truong Jr T, Bepler T. PepMLM: Target Sequence-Conditioned Generation of Peptide Binders via Masked Language Modeling. arXiv, 2023. https://arxiv.org/abs/2310.03842
MoPPIT — Motif-based peptide-protein interaction tool. Source to be confirmed from course materials.

Physicochemical Property Prediction

Solubility prediction — Peptide solubility probability scoring. Source dependent on tool used for property screening — confirm from course pipeline documentation.
Hemolysis prediction — Peptide hemolysis probability scoring. Likely derived from HemoPI or equivalent hemolysis prediction server. Gautam A, et al. HemoPI: a server to predict and design hemolytic peptides. Journal of Translational Medicine, 2014. https://webs.iiitd.edu.in/raghava/hemopi/
Binding affinity (pKd/pKi) — Peptide binding affinity prediction. Source dependent on tool used — confirm from course pipeline documentation.
GRAVY score (hydrophobicity) — Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. Journal of Molecular Biology, 1982. 157(1):105–132.
Isoelectric point (pI) prediction — Bjellqvist B, et al. The focusing positions of polypeptides in immobilized pH gradients can be predicted from their amino acid sequences. Electrophoresis, 1993.

Disease and Biology Context

SOD1 and fALS — Rosen DR, et al. Mutations in Cu/Zn superoxide dismutase gene are associated with familial amyotrophic lateral sclerosis. Nature, 1993. 362:59–62.
A4V mutation and ALS — Cudkowicz ME, et al. Epidemiology of mutations in superoxide dismutase in amyotrophic lateral sclerosis. Annals of Neurology, 1997. 41(2):210–221.
SOD1 misfolding and aggregation — Banci L, et al. Atomic-resolution monitoring of protein maturation in live human cells by NMR. Nature Chemical Biology, 2013.
Induced folding / folding upon binding — Dyson HJ, Wright PE. Intrinsically unstructured proteins and their functions. Nature Reviews Molecular Cell Biology, 2005. 6:197–208.

Structural Biology Concepts

Greek key β-barrel topology — Richardson JS. The anatomy and taxonomy of protein structure. Advances in Protein Chemistry, 1981. 34:167–339.
Protein distance thresholds and contact definition — Keskin O, et al. Principles of protein-protein interactions. Chemical Reviews, 2008. 108(4):1225–1244.
Lysine-mediated membrane disruption and hemolysis — Brogden KA. Antimicrobial peptides: pore formers or metabolic inhibitors in bacteria? Nature Reviews Microbiology, 2005. 3:238–250.

Additional References

Fiers W. et al. (1976) Complete nucleotide sequence of bacteriophage MS2 RNA. Nature 260:500–507
Kastelein R.A. et al. (1982) Lysis gene expression of RNA phage MS2. Nature 295:35–41
Beremand & Blumenthal (1979) Overlapping genes in RNA phage. Cell 18:257–266
PMC10688784 — In vitro characterization of the phage lysis protein MS2-L
PMC554659 — The amino terminal half of the MS2-coded lysis protein is dispensable
CNN Phage Therapy Article — Strathdee/Patterson case study