Week 4 HW: Protein Design Part I
Part A: Conceptual Questions (Shuguang Zhang)
Q1: How many molecules of amino acids do you take with a piece of 500 grams of meat?
Meat is roughly 25% protein by weight, so 500g of meat contains ~125g of protein. The average amino acid has a molecular weight of ~110 Da (daltons), i.e. ~110 g/mol. So 125g ÷ 110 g/mol ≈ 1.14 mol of amino acid residues. Multiply by Avogadro’s number (6.022 × 10²³): roughly 6.8 × 10²³ amino acid molecules, just over one mole.
Q2: Why do humans eat beef but do not become a cow, eat fish but do not become fish?
Digestion breaks dietary proteins down into individual amino acids via proteases in the stomach and small intestine. These free amino acids enter the bloodstream as a generic pool; they carry no “cow identity.” Ribosomes then reassemble them into human proteins according to instructions from your own DNA.
Q3: Why are there only 20 natural amino acids?
The 20 amino acids cover the key physicochemical properties needed for protein function: hydrophobic, hydrophilic, charged, aromatic, small, large, flexible (glycine), rigid (proline), with minimal redundancy. The triplet genetic code (61 codons → 20 amino acids) is already near its error-tolerance optimum; adding more amino acids would mean sacrificing the codon redundancy that buffers against point mutations. And once the translation machinery (tRNA synthetases, ribosome) was locked in for 20, adding a 21st would require co-evolving multiple components simultaneously.
Q4: Can you make other non-natural amino acids? Design some new amino acids.
Yes. Non-natural amino acids (nnAAs) are routinely made and incorporated into proteins, either by chemical synthesis (solid-phase peptide synthesis) or by engineering orthogonal tRNA/synthetase pairs to read a stop codon as a nnAA (amber suppression). Examples: p-azidophenylalanine (adds a click-chemistry handle for bioconjugation), α-aminoisobutyric acid (forces helical folding, used in peptide drugs), and trifluoroleucine (fluorinated side chain that resists proteolytic degradation, useful for extending therapeutic half-life).
Q5: Where did amino acids come from before enzymes that make them, and before life started?
Amino acids form spontaneously under prebiotic conditions, no enzymes needed. Key sources: (1) Miller-Urey synthesis: electric discharges through early-Earth atmospheres produce amino acids (the original 1952 experiment yielded glycine, alanine, aspartic acid). (2) Meteorites: the Murchison meteorite contains 70+ amino acids, confirming extraterrestrial abiotic synthesis. (3) Hydrothermal vents: high-temperature reactions between CO₂, H₂, and NH₃. (4) Strecker synthesis: aldehydes + ammonia + HCN → amino acids, all plausible on the early Earth.
Q6: If you make an α-helix using D-amino acids, what handedness would you expect?
Left-handed. Natural L-amino acids form right-handed α-helices because of how the side chain sits relative to the backbone: right-handed dihedrals (φ ≈ −57°, ψ ≈ −47°) avoid steric clashes. D-amino acids are the mirror image at Cα, so the favoured angles flip (φ ≈ +57°, ψ ≈ +47°), producing a left-handed helix. The entire structure is just the mirror image of a normal α-helix.
Q7: Can you discover additional helices in proteins?
Yes. Beyond the common α-helix and 3₁₀-helix, other forms exist: the π-helix (wider, rarer, found as short insertions in ~15% of proteins), the polyproline II helix (left-handed, extended, common in collagen), and the collagen triple helix. New helical geometries could be accessed by using non-natural amino acids that explore regions of the Ramachandran plot forbidden to the 20 canonical ones.
Q8: Why are most molecular helices right-handed?
Because L-amino acids dominate biology. The Cα stereochemistry of L-amino acids means right-handed backbone dihedrals avoid steric clashes between the side chain and the preceding carbonyl oxygen; left-handed conformations are energetically penalised. The dominance of L-amino acids over D-amino acids is itself likely a frozen accident from early life: once one chirality was selected, everything downstream inherited the preference.
Q9: Why do β-sheets tend to aggregate?
β-sheets have “sticky edges”: their edge strands expose unsatisfied backbone NH and C=O groups that can hydrogen-bond with strands from other molecules, extending the sheet. Unlike α-helices (where all backbone H-bonds are satisfied internally), β-sheets are inherently open-ended. On top of that, sheets stack face-to-face via hydrophobic side chains. This combination of edge H-bonding and hydrophobic stacking drives formation of extended multi-layered aggregates, with amyloid fibrils being the extreme case.
Part B: Protein Analysis and Visualization
Part B: Protein Analysis and Visualization — PDB 1EDG
1. Protein Description and Selection
Protein: Endoglucanase A (CelCCA) — a bacterial cellulase enzyme Organism: Ruminiclostridium cellulolyticum (formerly Clostridium cellulolyticum), strain H10 UniProt: P17901 (GUNA_RUMCH) EC Number: 3.2.1.4
What it does: Endoglucanase A cleaves internal beta-1,4-glucosidic bonds in cellulose — the most abundant organic polymer on Earth. It’s one of three enzyme types needed to fully break down cellulose into glucose. The catalytic mechanism is a retaining double-displacement (Koshland mechanism), using two conserved glutamic acid residues: Glu170 (proton donor) and Glu307 (nucleophile), separated by 5.5 angstroms.
Why selected: Cellulases are central to biofuel production, textile processing, and the global carbon cycle. Understanding how nature designs enzymes to break down cellulose is directly relevant to protein engineering — we might redesign these enzymes for improved thermostability, altered substrate specificity, or integration into industrial biocatalytic workflows. This connects well to the HTGAA course themes of designing biological systems with practical applications.
2. Amino Acid Sequence
The crystallized catalytic domain is 380 residues (the full-length precursor is 475 residues including a signal peptide and C-terminal dockerin domain):
3. Sequence Length and Amino Acid Frequency
Length: 380 residues (catalytic domain) Molecular weight: ~43.1 kDa
| Amino Acid | Count | Frequency |
|---|---|---|
| N (Asn) | 41 | 10.8% |
| I (Ile) | 29 | 7.6% |
| G (Gly) | 28 | 7.4% |
| V (Val) | 26 | 6.8% |
| D (Asp) | 25 | 6.6% |
| S (Ser) | 24 | 6.3% |
| A (Ala) | 23 | 6.1% |
| K (Lys) | 21 | 5.5% |
| Y (Tyr) | 19 | 5.0% |
| L (Leu) | 18 | 4.7% |
| F (Phe) | 18 | 4.7% |
| T (Thr) | 18 | 4.7% |
| P (Pro) | 14 | 3.7% |
| M (Met) | 13 | 3.4% |
| R (Arg) | 13 | 3.4% |
| E (Glu) | 13 | 3.4% |
| Q (Gln) | 12 | 3.2% |
| W (Trp) | 11 | 2.9% |
| H (His) | 7 | 1.8% |
| C (Cys) | 7 | 1.8% |
Most frequent: Asparagine (N) — 41 residues, 10.8% Least frequent: Cysteine (C) and Histidine (H) — 7 each, 1.8%
Composition summary:
- Hydrophobic residues (A,I,L,M,F,W,V,P): 152 (40.0%)
- Hydrophilic residues (D,E,K,R,H,N,Q,S,T): 174 (45.8%)
- Aromatic residues (F,W,Y): 48 (12.6%)
The unusually high asparagine content (10.8% vs ~4% average) likely reflects the enzyme’s need for hydrogen-bonding residues in the active site cleft to interact with cellulose substrate hydroxyl groups.
4. Sequence Homologs
Using UniProt reference clusters:
- UniRef90: 7 members (>=90% identity) — close homologs within order Eubacteriales
- UniRef50: 11 members (>=50% identity) — broader Eubacteriales homologs
Using RCSB PDB sequence similarity (>=30% identity, E-value < 0.001):
- 50 polymer entity hits in the PDB with structural similarity
The broader Glycosyl Hydrolase Family 5 contains over 57,000 sequences in UniProt (Swiss-Prot + TrEMBL), showing this is an extremely widespread enzyme family across bacteria, archaea, fungi, and plants.
5. Protein Family
CAZy family: Glycoside Hydrolase Family 5 (GH5), Clan GH-A Pfam: PF00150 — Cellulase (glycosyl hydrolase family 5) InterPro: IPR001547 — Glycoside hydrolase, family 5
GH5 is one of the largest glycoside hydrolase families, containing enzymes with diverse activities including endoglucanase, beta-mannanase, exo-1,3-glucanase, xylanase, and endoglycoceramidase. All share the (alpha/beta)8 TIM barrel fold and retaining catalytic mechanism.
The full-length protein also contains a dockerin domain (Pfam PF00404, residues 409-474), which anchors it to the cellulosome — a large multi-enzyme complex on the bacterial cell surface that coordinates cellulose degradation.
6. RCSB Structure Page
PDB ID: 1EDG Deposition date: July 7, 1995 Release date: August 17, 1996 Current version: 1.3 (last revised February 7, 2024)
Structure Quality
| Metric | Value | Assessment |
|---|---|---|
| Resolution | 1.60 A | Excellent (< 2.0 A) |
| R-work | 0.191 | Good |
| R-free | 0.220 | Good |
| Ramachandran outliers | 0.26% | Very good |
| Clash score | 5.39 | Acceptable |
| Data completeness | 98.0% | Very good |
This is a high-quality structure. At 1.60 A resolution, individual atoms and even some hydrogen atoms can be resolved. The R-factors and validation metrics are all within acceptable ranges.
Method: X-ray diffraction Space group: P 2(1) 2(1) 2(1) (orthorhombic) Data collected: November 1994 at EMBL/DESY Hamburg synchrotron
Primary citation: Ducros V et al. (1995) “Crystal structure of the catalytic domain of a bacterial cellulase belonging to family 5.” Structure 3:939-949. PMID: 8535787
7. Other Molecules in the Structure
Ligands/ions/cofactors: None. The structure was solved in the apo (unbound) form.
Water molecules: 375 solvent atoms resolved in the crystal structure.
Disulfide bonds: None (0) Cis-peptides: 1
The absence of bound substrate or inhibitor means this structure represents the resting state of the enzyme. The active site cleft is empty and accessible.
8. Structure Classification Family
CATH classification:
- 3.20.20.80 — Glycosidases
- Class 3: Alpha-Beta proteins
- Architecture 20: Alpha-Beta Barrel
- Topology 20: TIM Barrel
- Homologous Superfamily 80: Glycosidases
SCOP classification:
- Fold: c.1 — TIM beta/alpha-barrel
- Superfamily: c.1.8 — (Trans)glycosidases
The (alpha/beta)8 TIM barrel is the most common enzyme fold in nature, found in ~10% of all known enzyme structures. It consists of 8 alternating beta-strands and alpha-helices forming a barrel, with the active site located at the C-terminal ends of the beta-strands.
9. 3D Visualization (PyMOL)
All visualizations were generated using PyMOL (command-line mode). The PDB file was fetched directly from RCSB and solvent was removed before visualization.
9a. Cartoon, Ribbon, and Ball-and-Stick Representations
| Cartoon | Ribbon | Ball and Stick |
|---|---|---|
Rainbow coloring from N-terminus (blue) to C-terminus (red) shows the polypeptide chain threading through the TIM barrel architecture.
9b. Color by Secondary Structure
Color key: Red = alpha-helices | Yellow = beta-sheets | Green = loops/coils
Analysis: The protein has significantly more helices than sheets. This is characteristic of the TIM barrel fold:
- 8 large alpha-helices (red) form the outer ring of the barrel
- 8 smaller beta-strands (yellow) form the inner barrel core
- Extensive loop regions (green) connect the secondary structure elements and form the active site cleft at the top of the barrel
The PyMOL coloring confirms: 1,294 atoms in helices vs 279 atoms in sheets — roughly a 4.6:1 ratio of helix to sheet content.
9c. Color by Residue Type
Color key: Orange = hydrophobic (A,V,L,I,M,F,W,P) | Green = polar uncharged (S,T,Y,N,Q,C,G) | Blue = positively charged (K,R,H) | Red = negatively charged (D,E)
Analysis:
- Hydrophobic residues (orange) are concentrated in the protein interior, packed between the beta-strands and alpha-helices of the TIM barrel — this is the hydrophobic core that stabilizes the fold
- Polar residues (green) are distributed throughout but are especially abundant on the surface and in the active site cleft
- Charged residues (blue = positive, red = negative) are predominantly on the protein surface, as expected — they interact with solvent and contribute to protein solubility
- The active site region shows a mix of polar and charged residues, consistent with the enzyme’s need to bind and hydrolyze the polar cellulose substrate
9d. Surface View — Binding Pockets
Analysis: The surface view clearly reveals a prominent elongated cleft running across one face of the protein. This is the substrate-binding groove where cellulose chains bind for hydrolysis. The cleft is:
- Deep and elongated — characteristic of endoglucanases, which must accommodate a long polysaccharide chain
- Lined with polar (green) and charged (red, blue) residues — providing hydrogen bonds and electrostatic interactions to grip the cellulose substrate
- Flanked by aromatic residues — tryptophan and tyrosine side chains form stacking interactions with the sugar rings (a hallmark of carbohydrate-active enzymes)
The catalytic residues Glu170 and Glu307 sit at the base of this cleft, positioned to cleave the glycosidic bond via the retaining mechanism.
The transparent surface overlay shows how the TIM barrel architecture creates the active site cleft at the C-terminal face of the barrel.
PyMOL Commands Used
Part C: Using ML-Based Protein Design Tools — PDB 1EDG
All experiments were run using the Amina CLI (v0.2.6), which provides cloud-hosted access to the same models used in the Colab notebook (ESM-1v, ESM2, ESMFold, ProteinMPNN). Our target protein is Endoglucanase A (1EDG) from R. cellulolyticum — a 380-residue cellulase with a TIM barrel fold.
C1. Protein Language Modeling
Deep Mutational Scan (ESM-1v)
Tool: amina run esm1v --mode dms with 5-model ensemble
What it does: For each of the 380 positions, the model masks that residue and scores all 20 possible amino acids, producing 7,600 mutation scores. More negative scores indicate mutations that would be more damaging to protein function.
Score statistics: Mean = -3.30, Std = 3.41, Range = [-15.4, +5.1]
Most Conserved Positions (hardest to mutate)
| Rank | Position | Residue | Avg Score | Significance |
|---|---|---|---|---|
| 1 | 307 | Glu (E) | -11.91 | Catalytic nucleophile |
| 2 | 79 | Arg (R) | -11.19 | Structural role in barrel |
| 3 | 169 | Asn (N) | -10.83 | Adjacent to catalytic Glu170 |
| 4 | 254 | His (H) | -10.73 | Active site geometry |
| 5 | 170 | Glu (E) | -10.63 | Catalytic proton donor |
| 6 | 156 | Phe (F) | -10.21 | Aromatic stacking in substrate cleft |
| 7 | 149 | Trp (W) | -10.13 | Substrate binding platform |
Key observation: The two catalytic glutamic acids (Glu307 and Glu170) rank #1 and #5 respectively as the most conserved positions — the language model independently identifies the active site residues purely from evolutionary sequence patterns, without any structural information. The surrounding residues (Asn169, Phe156, Trp149) are also highly conserved, reflecting their roles in positioning the catalytic machinery and binding the cellulose substrate through aromatic stacking interactions.
Most Tolerant Positions (easiest to mutate)
| Position | Residue | Avg Score | Interpretation |
|---|---|---|---|
| 376 | Phe (F) | +1.94 | C-terminal, surface-exposed |
| 287 | Trp (W) | +1.60 | Solvent-exposed, non-functional |
| 266 | Met (M) | +1.40 | Surface loop region |
| 86 | Pro (P) | +1.31 | Flexible loop |
These tolerant positions are on the protein surface, far from the active site — consistent with the expectation that surface residues under less evolutionary constraint can accept substitutions more readily.
Latent Space Analysis (ESM2 Embeddings)
Tool: amina run esm2-embedding with ESM2-8M model
Dataset: 500 random sequences from the SCOP structural domain database (astral-scopedom-seqres-gd-sel-gs-bib-40-2.08) plus our 1EDG sequence (501 total)
Clustering results: HDBSCAN identified 7 clusters, with 1EDG assigned to Cluster 5 (the largest cluster, 262 members, probability = 0.60).
Analysis: The t-SNE plot shows that protein sequences form neighborhoods based on structural/functional similarity in the ESM2 embedding space. Our 1EDG (labeled “1EDG_CelCCA”) sits within the large central cluster, which likely groups alpha/beta barrel proteins — consistent with the TIM barrel being the most common enzyme fold in nature. The moderate cluster probability (0.60) suggests 1EDG sits near the boundary of this neighborhood, which makes sense given that GH5 cellulases, while sharing the TIM barrel fold, have distinct sequence features compared to other TIM barrel enzymes.
C2. Protein Folding (ESMFold)
Folding the Native Sequence
Tool: amina run esmfold on the native 1EDG sequence (380 residues)
Results:
- Mean pLDDT: 0.93 (very high confidence)
- Backbone RMSD vs crystal structure: 1.27 A
The pLDDT plot shows near-perfect confidence (>0.9) across most of the protein, with brief dips around:
- Residues ~245-250: A flexible loop region connecting beta-strands in the barrel
- Residues ~350-360: A surface-exposed turn
- Residues ~375-380: The C-terminus (expected lower confidence at chain termini)
ESMFold vs Crystal Structure Overlay
Gray = crystal structure (1EDG, 1.60 A X-ray), Blue = ESMFold prediction
The predicted structure matches the crystal structure remarkably well (RMSD = 1.27 A). The overall TIM barrel topology, helix positions, and loop conformations are all correctly predicted. The largest deviations occur in the flexible loop regions identified by the lower pLDDT scores — these are genuinely flexible in the real protein and are often poorly resolved even in crystal structures.
Is the structure resilient to mutations? The TIM barrel is one of the most robust protein folds in nature. Moderate mutations (especially in surface loops and non-core positions) would be expected to preserve the overall fold, while mutations to the hydrophobic core or conserved structural positions (identified by the DMS scan) would likely destabilize it.
C3. Protein Generation (ProteinMPNN + ESMFold)
Inverse Folding with ProteinMPNN
Tool: amina run proteinmpnn on the cleaned 1EDG crystal structure
Parameters: Chain A, 8 sequences, temperature = 0.1 (near-greedy sampling), vanilla model
Results:
| Rank | Score | Sequence Recovery | Key Observation |
|---|---|---|---|
| 1 (best) | 0.677 | 55.3% | Best scoring design |
| 2 | 0.695 | 55.3% | Similar to rank 1 |
| 3 | 0.695 | 53.2% | |
| 4 | 0.704 | 53.2% | |
| 5 | 0.706 | 54.5% | |
| 6 | 0.708 | 51.3% | |
| 7 | 0.709 | 53.2% | |
| 8 (worst) | 0.721 | 48.9% | Most divergent |
Mean sequence recovery: 53.1% — meaning ProteinMPNN independently recovered ~53% of the native amino acids purely from the backbone geometry. This is typical for a well-folded globular protein.
Predicted vs Original Sequence Comparison
Aligning the best ProteinMPNN design (rank 1) against the native sequence:
Patterns in the designed sequence:
- Core residues are highly conserved: Hydrophobic core positions (Val, Leu, Ile, Phe) are almost always recovered, reflecting their structural importance
- Active site residues recovered: The catalytic Glu residues and surrounding aromatic platforms are retained
- Surface residues diverge most: Charged and polar surface residues (Lys, Asp, Glu, Asn) show the most substitutions — consistent with the DMS tolerance analysis
- Glycines conserved: Gly residues in tight turns and barrel connections are strongly recovered, as they occupy positions requiring the unique conformational flexibility of glycine
Refolding the Designed Sequence with ESMFold
Tool: amina run esmfold on the best ProteinMPNN sequence (rank 1)
Results:
- Mean pLDDT: 0.92 (very high — the designed sequence is predicted to fold confidently)
- Backbone RMSD vs original crystal: 0.79 A
Gray = crystal structure (1EDG), Green = ESMFold prediction of ProteinMPNN-designed sequence
The ProteinMPNN-designed sequence (with only ~55% identity to the native) folds into essentially the same structure as the original protein (RMSD = 0.79 A). Remarkably, this RMSD is even lower than the ESMFold prediction of the native sequence (1.27 A), suggesting that ProteinMPNN designs may be optimized for a more “ideal” version of the backbone geometry.
Summary Table
| Comparison | RMSD (A) | pLDDT |
|---|---|---|
| ESMFold (native seq) vs Crystal | 1.27 | 0.93 |
| ESMFold (ProteinMPNN seq) vs Crystal | 0.79 | 0.92 |
Tools Used (Amina CLI)
All computations ran on AminoAnalytica’s cloud infrastructure via the Amina CLI:
Structural overlays were visualized in PyMOL.
Part D: Group Brainstorm on Bacteriophage Engineering
Primary: Enhance the lytic activity (toxicity) of the MS2 L protein. Secondary: Improve protein stability of engineered variants.
Approach
We target the C-terminal functional domain of L — specifically residues in and around the conserved LS dipeptide motif identified by Chamakura et al. (2017) — while preserving membrane insertion and oligomerisation capacity, which Mezhyrova et al. (2023) showed is essential for the pore-forming lysis mechanism.
A key insight from the literature is that truncating L’s N-terminal domain removes its dependency on the host chaperone DnaJ and accelerates lysis by ~20 minutes. Rather than crude truncation, we use computational redesign to engineer a shorter, less basic Domain 1 that bypasses the DnaJ regulatory brake while retaining a full-length, well-folded protein.
Proposed Pipeline
Tools & Justification
| Tool | Role | Why It Helps |
|---|---|---|
| ESMFold | Predict baseline L monomer structure | Fast single-chain prediction; no experimental structure exists for L |
| ESM2 | Score mutant fitness via log-likelihood ratios | Captures evolutionary fitness signals even for short, poorly characterised proteins |
| ProteinMPNN | Redesign the dispensable N-terminal domain | Generates sequences respecting backbone geometry of the functional C-terminal half |
| AlphaFold-Multimer | Predict L homo-oligomers and L–DnaJ complex | Validates that redesigned variants still form oligomeric assemblies needed for membrane disruption |
| FoldSeek | Structure-based similarity search on final designs | Sanity check that designs don’t converge on known toxic or off-target folds |
Potential Pitfalls
No solved structure for L. It is a small (75 aa) membrane-associated peptide — all structural predictions are low-confidence and may not capture the membrane-inserted conformation accurately. Nanodisc or membrane-mimetic modelling is beyond the scope of these tools.
Unknown lytic target. The actual molecular target of L in the bacterial cell envelope has never been identified. We are optimising against proxies (oligomerisation propensity, DnaJ interaction, stability) rather than the true mechanism of action. A design that scores well computationally may not translate to improved lysis in vivo.