Week 4 HW: Protein Design Part I

Part A: Conceptual Questions (Shuguang Zhang)

Q1: How many molecules of amino acids do you take with a piece of 500 grams of meat?

Meat is roughly 25% protein by weight, so 500g of meat contains ~125g of protein. The average amino acid has a molecular weight of ~110 Da (daltons), i.e. ~110 g/mol. So 125g ÷ 110 g/mol ≈ 1.14 mol of amino acid residues. Multiply by Avogadro’s number (6.022 × 10²³): roughly 6.8 × 10²³ amino acid molecules, just over one mole.

Q2: Why do humans eat beef but do not become a cow, eat fish but do not become fish?

Digestion breaks dietary proteins down into individual amino acids via proteases in the stomach and small intestine. These free amino acids enter the bloodstream as a generic pool; they carry no “cow identity.” Ribosomes then reassemble them into human proteins according to instructions from your own DNA.

Q3: Why are there only 20 natural amino acids?

The 20 amino acids cover the key physicochemical properties needed for protein function: hydrophobic, hydrophilic, charged, aromatic, small, large, flexible (glycine), rigid (proline), with minimal redundancy. The triplet genetic code (61 codons → 20 amino acids) is already near its error-tolerance optimum; adding more amino acids would mean sacrificing the codon redundancy that buffers against point mutations. And once the translation machinery (tRNA synthetases, ribosome) was locked in for 20, adding a 21st would require co-evolving multiple components simultaneously.

Q4: Can you make other non-natural amino acids? Design some new amino acids.

Yes. Non-natural amino acids (nnAAs) are routinely made and incorporated into proteins, either by chemical synthesis (solid-phase peptide synthesis) or by engineering orthogonal tRNA/synthetase pairs to read a stop codon as a nnAA (amber suppression). Examples: p-azidophenylalanine (adds a click-chemistry handle for bioconjugation), α-aminoisobutyric acid (forces helical folding, used in peptide drugs), and trifluoroleucine (fluorinated side chain that resists proteolytic degradation, useful for extending therapeutic half-life).

Q5: Where did amino acids come from before enzymes that make them, and before life started?

Amino acids form spontaneously under prebiotic conditions, no enzymes needed. Key sources: (1) Miller-Urey synthesis: electric discharges through early-Earth atmospheres produce amino acids (the original 1952 experiment yielded glycine, alanine, aspartic acid). (2) Meteorites: the Murchison meteorite contains 70+ amino acids, confirming extraterrestrial abiotic synthesis. (3) Hydrothermal vents: high-temperature reactions between CO₂, H₂, and NH₃. (4) Strecker synthesis: aldehydes + ammonia + HCN → amino acids, all plausible on the early Earth.

Q6: If you make an α-helix using D-amino acids, what handedness would you expect?

Left-handed. Natural L-amino acids form right-handed α-helices because of how the side chain sits relative to the backbone: right-handed dihedrals (φ ≈ −57°, ψ ≈ −47°) avoid steric clashes. D-amino acids are the mirror image at Cα, so the favoured angles flip (φ ≈ +57°, ψ ≈ +47°), producing a left-handed helix. The entire structure is just the mirror image of a normal α-helix.

Q7: Can you discover additional helices in proteins?

Yes. Beyond the common α-helix and 3₁₀-helix, other forms exist: the π-helix (wider, rarer, found as short insertions in ~15% of proteins), the polyproline II helix (left-handed, extended, common in collagen), and the collagen triple helix. New helical geometries could be accessed by using non-natural amino acids that explore regions of the Ramachandran plot forbidden to the 20 canonical ones.

Q8: Why are most molecular helices right-handed?

Because L-amino acids dominate biology. The Cα stereochemistry of L-amino acids means right-handed backbone dihedrals avoid steric clashes between the side chain and the preceding carbonyl oxygen; left-handed conformations are energetically penalised. The dominance of L-amino acids over D-amino acids is itself likely a frozen accident from early life: once one chirality was selected, everything downstream inherited the preference.

Q9: Why do β-sheets tend to aggregate?

β-sheets have “sticky edges”: their edge strands expose unsatisfied backbone NH and C=O groups that can hydrogen-bond with strands from other molecules, extending the sheet. Unlike α-helices (where all backbone H-bonds are satisfied internally), β-sheets are inherently open-ended. On top of that, sheets stack face-to-face via hydrophobic side chains. This combination of edge H-bonding and hydrophobic stacking drives formation of extended multi-layered aggregates, with amyloid fibrils being the extreme case.

Part B: Protein Analysis and Visualization

Part B: Protein Analysis and Visualization — PDB 1EDG

1. Protein Description and Selection

Protein: Endoglucanase A (CelCCA) — a bacterial cellulase enzyme Organism: Ruminiclostridium cellulolyticum (formerly Clostridium cellulolyticum), strain H10 UniProt: P17901 (GUNA_RUMCH) EC Number: 3.2.1.4

What it does: Endoglucanase A cleaves internal beta-1,4-glucosidic bonds in cellulose — the most abundant organic polymer on Earth. It’s one of three enzyme types needed to fully break down cellulose into glucose. The catalytic mechanism is a retaining double-displacement (Koshland mechanism), using two conserved glutamic acid residues: Glu170 (proton donor) and Glu307 (nucleophile), separated by 5.5 angstroms.

Why selected: Cellulases are central to biofuel production, textile processing, and the global carbon cycle. Understanding how nature designs enzymes to break down cellulose is directly relevant to protein engineering — we might redesign these enzymes for improved thermostability, altered substrate specificity, or integration into industrial biocatalytic workflows. This connects well to the HTGAA course themes of designing biological systems with practical applications.

2. Amino Acid Sequence

The crystallized catalytic domain is 380 residues (the full-length precursor is 475 residues including a signal peptide and C-terminal dockerin domain):

MYDASLIPNLQIPQKNIPNNDGMNFVKGLRLGWNLGNTFDAFNGTNITNELDYETSWSG
IKTTKQMIDAIKQKGFNTVRIPVSWHPHVSGSDYKISDVWMNRVQEVVNYCIDNKMYVIL
NTHHDVDKVKGYFPSSQYMASSKKYITSVWAQIAARFANYDEHLIFEGMNEPRLVGHANEW
WPELTNSDVVDSINCINQLNQDFVNTVRATGGKNASRYLMCPGYVASPDGATNDYFRMPND
ISGNNNKIIVSVHAYCPWNFAGLAMADGGTNAWNINDSKDQSEVTWFMDNIYNKYTSRGIP
VIIGECGAVDKNNLKTRVEYMSYYVAQAKARGILCILWDNNNFSGTGELFGFFDRRSCQFK
FPEIIDGMVKYAFGLIN

3. Sequence Length and Amino Acid Frequency

Length: 380 residues (catalytic domain) Molecular weight: ~43.1 kDa

Amino Acid	Count	Frequency
N (Asn)	41	10.8%
I (Ile)	29	7.6%
G (Gly)	28	7.4%
V (Val)	26	6.8%
D (Asp)	25	6.6%
S (Ser)	24	6.3%
A (Ala)	23	6.1%
K (Lys)	21	5.5%
Y (Tyr)	19	5.0%
L (Leu)	18	4.7%
F (Phe)	18	4.7%
T (Thr)	18	4.7%
P (Pro)	14	3.7%
M (Met)	13	3.4%
R (Arg)	13	3.4%
E (Glu)	13	3.4%
Q (Gln)	12	3.2%
W (Trp)	11	2.9%
H (His)	7	1.8%
C (Cys)	7	1.8%

Most frequent: Asparagine (N) — 41 residues, 10.8% Least frequent: Cysteine (C) and Histidine (H) — 7 each, 1.8%

Composition summary:

Hydrophobic residues (A,I,L,M,F,W,V,P): 152 (40.0%)
Hydrophilic residues (D,E,K,R,H,N,Q,S,T): 174 (45.8%)
Aromatic residues (F,W,Y): 48 (12.6%)

The unusually high asparagine content (10.8% vs ~4% average) likely reflects the enzyme’s need for hydrogen-bonding residues in the active site cleft to interact with cellulose substrate hydroxyl groups.

4. Sequence Homologs

Using UniProt reference clusters:

UniRef90: 7 members (>=90% identity) — close homologs within order Eubacteriales
UniRef50: 11 members (>=50% identity) — broader Eubacteriales homologs

Using RCSB PDB sequence similarity (>=30% identity, E-value < 0.001):

50 polymer entity hits in the PDB with structural similarity

The broader Glycosyl Hydrolase Family 5 contains over 57,000 sequences in UniProt (Swiss-Prot + TrEMBL), showing this is an extremely widespread enzyme family across bacteria, archaea, fungi, and plants.

5. Protein Family

CAZy family: Glycoside Hydrolase Family 5 (GH5), Clan GH-A Pfam: PF00150 — Cellulase (glycosyl hydrolase family 5) InterPro: IPR001547 — Glycoside hydrolase, family 5

GH5 is one of the largest glycoside hydrolase families, containing enzymes with diverse activities including endoglucanase, beta-mannanase, exo-1,3-glucanase, xylanase, and endoglycoceramidase. All share the (alpha/beta)8 TIM barrel fold and retaining catalytic mechanism.

The full-length protein also contains a dockerin domain (Pfam PF00404, residues 409-474), which anchors it to the cellulosome — a large multi-enzyme complex on the bacterial cell surface that coordinates cellulose degradation.

6. RCSB Structure Page

PDB ID: 1EDG Deposition date: July 7, 1995 Release date: August 17, 1996 Current version: 1.3 (last revised February 7, 2024)

Structure Quality

Metric	Value	Assessment
Resolution	1.60 A	Excellent (< 2.0 A)
R-work	0.191	Good
R-free	0.220	Good
Ramachandran outliers	0.26%	Very good
Clash score	5.39	Acceptable
Data completeness	98.0%	Very good

This is a high-quality structure. At 1.60 A resolution, individual atoms and even some hydrogen atoms can be resolved. The R-factors and validation metrics are all within acceptable ranges.

Method: X-ray diffraction Space group: P 2(1) 2(1) 2(1) (orthorhombic) Data collected: November 1994 at EMBL/DESY Hamburg synchrotron

Primary citation: Ducros V et al. (1995) “Crystal structure of the catalytic domain of a bacterial cellulase belonging to family 5.” Structure 3:939-949. PMID: 8535787

7. Other Molecules in the Structure

Ligands/ions/cofactors: None. The structure was solved in the apo (unbound) form.

Water molecules: 375 solvent atoms resolved in the crystal structure.

Disulfide bonds: None (0) Cis-peptides: 1

The absence of bound substrate or inhibitor means this structure represents the resting state of the enzyme. The active site cleft is empty and accessible.

8. Structure Classification Family

CATH classification:

3.20.20.80 — Glycosidases
- Class 3: Alpha-Beta proteins
- Architecture 20: Alpha-Beta Barrel
- Topology 20: TIM Barrel
- Homologous Superfamily 80: Glycosidases

SCOP classification:

Fold: c.1 — TIM beta/alpha-barrel
Superfamily: c.1.8 — (Trans)glycosidases

The (alpha/beta)8 TIM barrel is the most common enzyme fold in nature, found in ~10% of all known enzyme structures. It consists of 8 alternating beta-strands and alpha-helices forming a barrel, with the active site located at the C-terminal ends of the beta-strands.

9. 3D Visualization (PyMOL)

All visualizations were generated using PyMOL (command-line mode). The PDB file was fetched directly from RCSB and solvent was removed before visualization.

9a. Cartoon, Ribbon, and Ball-and-Stick Representations

Cartoon	Ribbon	Ball and Stick

Rainbow coloring from N-terminus (blue) to C-terminus (red) shows the polypeptide chain threading through the TIM barrel architecture.

9b. Color by Secondary Structure

Color key: Red = alpha-helices | Yellow = beta-sheets | Green = loops/coils

Analysis: The protein has significantly more helices than sheets. This is characteristic of the TIM barrel fold:

8 large alpha-helices (red) form the outer ring of the barrel
8 smaller beta-strands (yellow) form the inner barrel core
Extensive loop regions (green) connect the secondary structure elements and form the active site cleft at the top of the barrel

The PyMOL coloring confirms: 1,294 atoms in helices vs 279 atoms in sheets — roughly a 4.6:1 ratio of helix to sheet content.

9c. Color by Residue Type

Color key: Orange = hydrophobic (A,V,L,I,M,F,W,P) | Green = polar uncharged (S,T,Y,N,Q,C,G) | Blue = positively charged (K,R,H) | Red = negatively charged (D,E)

Analysis:

Hydrophobic residues (orange) are concentrated in the protein interior, packed between the beta-strands and alpha-helices of the TIM barrel — this is the hydrophobic core that stabilizes the fold
Polar residues (green) are distributed throughout but are especially abundant on the surface and in the active site cleft
Charged residues (blue = positive, red = negative) are predominantly on the protein surface, as expected — they interact with solvent and contribute to protein solubility
The active site region shows a mix of polar and charged residues, consistent with the enzyme’s need to bind and hydrolyze the polar cellulose substrate

9d. Surface View — Binding Pockets

Analysis: The surface view clearly reveals a prominent elongated cleft running across one face of the protein. This is the substrate-binding groove where cellulose chains bind for hydrolysis. The cleft is:

Deep and elongated — characteristic of endoglucanases, which must accommodate a long polysaccharide chain
Lined with polar (green) and charged (red, blue) residues — providing hydrogen bonds and electrostatic interactions to grip the cellulose substrate
Flanked by aromatic residues — tryptophan and tyrosine side chains form stacking interactions with the sugar rings (a hallmark of carbohydrate-active enzymes)

The catalytic residues Glu170 and Glu307 sit at the base of this cleft, positioned to cleave the glycosidic bond via the retaining mechanism.

The transparent surface overlay shows how the TIM barrel architecture creates the active site cleft at the C-terminal face of the barrel.

PyMOL Commands Used

# Load and clean
fetch 1EDG, async=0
remove solvent

# Cartoon view (rainbow N→C)
hide all; show cartoon
spectrum count, rainbow
ray 1200, 900; png cartoon_view.png, dpi=150

# Ribbon view
hide all; show ribbon
spectrum count, rainbow
ray 1200, 900; png ribbon_view.png, dpi=150

# Ball and stick
hide all; show sticks; show spheres
set sphere_scale, 0.25; set stick_radius, 0.1
spectrum count, rainbow
ray 1200, 900; png ball_and_stick_view.png, dpi=150

# Secondary structure coloring
hide all; show cartoon
color red, ss h        # helices
color yellow, ss s     # sheets
color green, ss l+''   # loops
ray 1200, 900; png secondary_structure.png, dpi=150

# Residue type coloring
hide all; show cartoon
color orange, resn ALA+VAL+LEU+ILE+MET+PHE+TRP+PRO  # hydrophobic
color green, resn SER+THR+TYR+ASN+GLN+CYS+GLY        # polar
color blue, resn LYS+ARG+HIS                          # positive
color red, resn ASP+GLU                                # negative
ray 1200, 900; png residue_type.png, dpi=150

# Surface view
hide all; show surface
# (same residue type coloring as above)
ray 1200, 900; png surface_view.png, dpi=150

# Surface + cartoon overlay
hide all; show cartoon; color gray80
show surface; set transparency, 0.7
ray 1200, 900; png surface_cartoon_overlay.png, dpi=150

Part C: Using ML-Based Protein Design Tools — PDB 1EDG

All experiments were run using the Amina CLI (v0.2.6), which provides cloud-hosted access to the same models used in the Colab notebook (ESM-1v, ESM2, ESMFold, ProteinMPNN). Our target protein is Endoglucanase A (1EDG) from R. cellulolyticum — a 380-residue cellulase with a TIM barrel fold.

C1. Protein Language Modeling

Deep Mutational Scan (ESM-1v)

Tool: amina run esm1v --mode dms with 5-model ensemble What it does: For each of the 380 positions, the model masks that residue and scores all 20 possible amino acids, producing 7,600 mutation scores. More negative scores indicate mutations that would be more damaging to protein function.

Score statistics: Mean = -3.30, Std = 3.41, Range = [-15.4, +5.1]

Most Conserved Positions (hardest to mutate)

Rank	Position	Residue	Avg Score	Significance
1	307	Glu (E)	-11.91	Catalytic nucleophile
2	79	Arg (R)	-11.19	Structural role in barrel
3	169	Asn (N)	-10.83	Adjacent to catalytic Glu170
4	254	His (H)	-10.73	Active site geometry
5	170	Glu (E)	-10.63	Catalytic proton donor
6	156	Phe (F)	-10.21	Aromatic stacking in substrate cleft
7	149	Trp (W)	-10.13	Substrate binding platform

Key observation: The two catalytic glutamic acids (Glu307 and Glu170) rank #1 and #5 respectively as the most conserved positions — the language model independently identifies the active site residues purely from evolutionary sequence patterns, without any structural information. The surrounding residues (Asn169, Phe156, Trp149) are also highly conserved, reflecting their roles in positioning the catalytic machinery and binding the cellulose substrate through aromatic stacking interactions.

Most Tolerant Positions (easiest to mutate)

Position	Residue	Avg Score	Interpretation
376	Phe (F)	+1.94	C-terminal, surface-exposed
287	Trp (W)	+1.60	Solvent-exposed, non-functional
266	Met (M)	+1.40	Surface loop region
86	Pro (P)	+1.31	Flexible loop

These tolerant positions are on the protein surface, far from the active site — consistent with the expectation that surface residues under less evolutionary constraint can accept substitutions more readily.

Latent Space Analysis (ESM2 Embeddings)

Tool: amina run esm2-embedding with ESM2-8M model Dataset: 500 random sequences from the SCOP structural domain database (astral-scopedom-seqres-gd-sel-gs-bib-40-2.08) plus our 1EDG sequence (501 total)

Clustering results: HDBSCAN identified 7 clusters, with 1EDG assigned to Cluster 5 (the largest cluster, 262 members, probability = 0.60).

Analysis: The t-SNE plot shows that protein sequences form neighborhoods based on structural/functional similarity in the ESM2 embedding space. Our 1EDG (labeled “1EDG_CelCCA”) sits within the large central cluster, which likely groups alpha/beta barrel proteins — consistent with the TIM barrel being the most common enzyme fold in nature. The moderate cluster probability (0.60) suggests 1EDG sits near the boundary of this neighborhood, which makes sense given that GH5 cellulases, while sharing the TIM barrel fold, have distinct sequence features compared to other TIM barrel enzymes.

C2. Protein Folding (ESMFold)

Folding the Native Sequence

Tool: amina run esmfold on the native 1EDG sequence (380 residues)

Results:

Mean pLDDT: 0.93 (very high confidence)
Backbone RMSD vs crystal structure: 1.27 A

The pLDDT plot shows near-perfect confidence (>0.9) across most of the protein, with brief dips around:

Residues ~245-250: A flexible loop region connecting beta-strands in the barrel
Residues ~350-360: A surface-exposed turn
Residues ~375-380: The C-terminus (expected lower confidence at chain termini)

ESMFold vs Crystal Structure Overlay

Gray = crystal structure (1EDG, 1.60 A X-ray), Blue = ESMFold prediction

The predicted structure matches the crystal structure remarkably well (RMSD = 1.27 A). The overall TIM barrel topology, helix positions, and loop conformations are all correctly predicted. The largest deviations occur in the flexible loop regions identified by the lower pLDDT scores — these are genuinely flexible in the real protein and are often poorly resolved even in crystal structures.

Is the structure resilient to mutations? The TIM barrel is one of the most robust protein folds in nature. Moderate mutations (especially in surface loops and non-core positions) would be expected to preserve the overall fold, while mutations to the hydrophobic core or conserved structural positions (identified by the DMS scan) would likely destabilize it.

C3. Protein Generation (ProteinMPNN + ESMFold)

Inverse Folding with ProteinMPNN

Tool: amina run proteinmpnn on the cleaned 1EDG crystal structure Parameters: Chain A, 8 sequences, temperature = 0.1 (near-greedy sampling), vanilla model

Results:

Rank	Score	Sequence Recovery	Key Observation
1 (best)	0.677	55.3%	Best scoring design
2	0.695	55.3%	Similar to rank 1
3	0.695	53.2%
4	0.704	53.2%
5	0.706	54.5%
6	0.708	51.3%
7	0.709	53.2%
8 (worst)	0.721	48.9%	Most divergent

Mean sequence recovery: 53.1% — meaning ProteinMPNN independently recovered ~53% of the native amino acids purely from the backbone geometry. This is typical for a well-folded globular protein.

Predicted vs Original Sequence Comparison

Aligning the best ProteinMPNN design (rank 1) against the native sequence:

Native:  MYDASLIPNLQIPQKNIPNNDGMNFVKGLRLGWNLGNTFDAFNGTNITNELDYETSWSG...
Design:  AYDPSLIPDLNIPQKPIPDNEAMRFVKSLRLGWNLGNTFDANSGSNIKNKLDYETAKAGV...
Match:   .YD.SLIP.L.IPQK.IP.N..M.FVK.LRLGWNLGNTFD.N.G.NI.N.LDYET....

Patterns in the designed sequence:

Core residues are highly conserved: Hydrophobic core positions (Val, Leu, Ile, Phe) are almost always recovered, reflecting their structural importance
Active site residues recovered: The catalytic Glu residues and surrounding aromatic platforms are retained
Surface residues diverge most: Charged and polar surface residues (Lys, Asp, Glu, Asn) show the most substitutions — consistent with the DMS tolerance analysis
Glycines conserved: Gly residues in tight turns and barrel connections are strongly recovered, as they occupy positions requiring the unique conformational flexibility of glycine

Refolding the Designed Sequence with ESMFold

Tool: amina run esmfold on the best ProteinMPNN sequence (rank 1)

Results:

Mean pLDDT: 0.92 (very high — the designed sequence is predicted to fold confidently)
Backbone RMSD vs original crystal: 0.79 A

Gray = crystal structure (1EDG), Green = ESMFold prediction of ProteinMPNN-designed sequence

The ProteinMPNN-designed sequence (with only ~55% identity to the native) folds into essentially the same structure as the original protein (RMSD = 0.79 A). Remarkably, this RMSD is even lower than the ESMFold prediction of the native sequence (1.27 A), suggesting that ProteinMPNN designs may be optimized for a more “ideal” version of the backbone geometry.

Summary Table

Comparison	RMSD (A)	pLDDT
ESMFold (native seq) vs Crystal	1.27	0.93
ESMFold (ProteinMPNN seq) vs Crystal	0.79	0.92

Tools Used (Amina CLI)

All computations ran on AminoAnalytica’s cloud infrastructure via the Amina CLI:

# C1: Deep Mutational Scan (5-model ensemble, 7600 mutations scored)
amina run esm1v -m dms -s "<sequence>" -n 5 -j 1edg_dms -o ./C1_dms/

# C1: Latent Space Embeddings (501 sequences, ESM2-8M + t-SNE + HDBSCAN)
amina run esm2-embedding -f embedding_input.fasta -m 8M -j 1edg_embed -o ./C1_embeddings/

# C2: Structure Prediction (ESMFold)
amina run esmfold -s "<sequence>" -j 1edg_fold -o ./C2_esmfold/

# C2: RMSD Comparison
amina run simple-rmsd --reference crystal.pdb --mobile esmfold.pdb -o ./C2_esmfold/

# C3: Inverse Folding (ProteinMPNN, 8 sequences, T=0.1)
amina run proteinmpnn --pdb 1edg_cleaned.pdb --chains A -n 8 --temperature 0.1 -o ./C3_proteinmpnn/

# C3: Refold designed sequence
amina run esmfold -s "<mpnn_sequence>" -j mpnn_refold -o ./C3_refold/
amina run simple-rmsd --reference crystal.pdb --mobile mpnn_refold.pdb -o ./C3_refold/

Structural overlays were visualized in PyMOL.

Part D: Group Brainstorm on Bacteriophage Engineering

Primary: Enhance the lytic activity (toxicity) of the MS2 L protein. Secondary: Improve protein stability of engineered variants.

Approach

We target the C-terminal functional domain of L — specifically residues in and around the conserved LS dipeptide motif identified by Chamakura et al. (2017) — while preserving membrane insertion and oligomerisation capacity, which Mezhyrova et al. (2023) showed is essential for the pore-forming lysis mechanism.

A key insight from the literature is that truncating L’s N-terminal domain removes its dependency on the host chaperone DnaJ and accelerates lysis by ~20 minutes. Rather than crude truncation, we use computational redesign to engineer a shorter, less basic Domain 1 that bypasses the DnaJ regulatory brake while retaining a full-length, well-folded protein.

Proposed Pipeline

L sequence
  │
  ▼
ESMFold ──────────────► Baseline monomer structure
  │
  ▼
ESM2 (log-likelihood) ► Score all single-point mutants
  │                      (avoid LS motif & TM core)
  ▼
ProteinMPNN ──────────► Redesign N-terminal domain
  │                      (shorter, less basic, DnaJ-independent)
  ▼
AlphaFold-Multimer ───► Validate oligomeric assembly
  │                      + check L–DnaJ complex disruption
  ▼
FoldSeek ─────────────► Confirm no convergence on
  │                      known problematic folds
  ▼
Ranked candidate list

Tools & Justification

Tool	Role	Why It Helps
ESMFold	Predict baseline L monomer structure	Fast single-chain prediction; no experimental structure exists for L
ESM2	Score mutant fitness via log-likelihood ratios	Captures evolutionary fitness signals even for short, poorly characterised proteins
ProteinMPNN	Redesign the dispensable N-terminal domain	Generates sequences respecting backbone geometry of the functional C-terminal half
AlphaFold-Multimer	Predict L homo-oligomers and L–DnaJ complex	Validates that redesigned variants still form oligomeric assemblies needed for membrane disruption
FoldSeek	Structure-based similarity search on final designs	Sanity check that designs don’t converge on known toxic or off-target folds

Potential Pitfalls

No solved structure for L. It is a small (75 aa) membrane-associated peptide — all structural predictions are low-confidence and may not capture the membrane-inserted conformation accurately. Nanodisc or membrane-mimetic modelling is beyond the scope of these tools.
Unknown lytic target. The actual molecular target of L in the bacterial cell envelope has never been identified. We are optimising against proxies (oligomerisation propensity, DnaJ interaction, stability) rather than the true mechanism of action. A design that scores well computationally may not translate to improved lysis in vivo.