HTGAA Spring 2026 · Week 4
March 3, 2026

Protein Design Part I

Amino acid conceptual foundations, protein structure visualization, machine learning design tools, and bacteriophage engineering brainstorm.

PyMOL ESM2 ESMFold Group Assignment
Part A

Conceptual Questions on Amino Acids & Protein Structure

Selected 9 questions from Shuguang Zhang's foundational amino acid and protein structure curriculum. Responses below address amino acid chemistry, structural biology, and evolutionary considerations.

Question 1
How many molecules of amino acids do you take with a piece of 500 grams of meat?
Assuming meat is ~20% protein by weight and each amino acid averages ~110 Daltons (accounting for peptide bond mass loss), a 500g serving contains approximately 100g of protein. This equals roughly 909 moles of amino acids, or approximately 5.5 × 10²⁶ amino acid molecules. Biologically, after proteolysis and absorption, only the ~9 essential amino acids are absorbed intact; the rest are either synthesized de novo or used for energy.
Question 2
Why do humans eat beef but do not become a cow, eat fish but do not become fish?
Amino acids are monomeric building blocks encoding no species-specific information. Once ingested, they are depolymerized into free amino acids, stripped of their sequence context. The biological identity of an organism is encoded in the sequence of amino acids in its proteome (its genomic blueprint), not in the chemical identity of individual amino acids. Humans resynthesize ingested amino acids into human proteins via translation of human mRNA. The "instruction manual" (DNA) determines the output, not the chemical material.
Question 3
Why are there only 20 natural amino acids?
The 20 canonical amino acids represent an evolutionary optimization: a sufficient diversity to support complex protein function (different chemical properties, hydrophobicity, charge, polarity) while remaining small enough for efficient genetic encoding (20 amino acids require only 4³ = 64 codons, with redundancy). This "sweet spot" emerged ~3.5 billion years ago and became locked in by the ribosome's structure. Adding more amino acids would require longer codons (16 codons for 21+ amino acids), making mRNA translation less efficient. Evolution selected for robustness and speed over maximal chemical diversity.
Question 4
Can you make other non-natural amino acids? Design some new amino acids.
Yes. Non-canonical amino acids (ncAAs) can be engineered into proteins via amber stop codon suppression (UAG → ncAA) or through cell-free protein synthesis. Examples include para-azidophenylalanine (pAzF) for click chemistry, p-hydroxyphenylalanine (Tyr analog with OH → H) for photocrosslinking, and β-amino acids with modified backbone geometry. For Füzi Poiesis, a halophilic amino acid (e.g., highly negatively charged glutamate analog with extended side chain) could stabilize SQR in high-salinity Lake Budi conditions by increasing electrostatic interactions with water. The challenge: ribosomes evolved for 20 amino acids; ncAA incorporation is slow and error-prone.
Question 5
Where did amino acids come from before enzymes that make them, and before life started?
Prebiotic amino acids arose through abiotic synthesis: Urey-Miller experiment (1952) showed that methane, ammonia, water vapor, and hydrogen in a reducing atmosphere, stimulated by electrical discharge, spontaneously form amino acids (glycine, alanine, aspartate). Meteorites carry amino acids synthesized in interstellar space via radiolysis of ices. Before life, amino acids accumulated in the primordial ocean. The first self-replicating molecules (likely RNA, given its catalytic and genetic properties) hijacked these free amino acids as building blocks. This is the "RNA World" hypothesis: catalytic RNA (ribozymes) encoded and catalyzed their own replication and translation, later acquiring the ability to synthesize proteins.
Question 6
If you make an α-helix using D-amino acids, what handedness would you expect?
A left-handed helix. The α-helix handedness is determined by the stereochemistry of the α-carbon, not by inherent helical geometry. L-amino acids (natural, with S-configuration) produce right-handed helices (α-RH). D-amino acids (unnatural, with R-configuration) mirror this: they produce left-handed helices (α-LH). The Φ (phi) and Ψ (psi) dihedral angles are inverted. Biologically, left-handed helices are almost unknown in nature because life uses L-amino acids exclusively (homochirality). This has been experimentally verified: D-protein analogs fold into mirror-image structures but are resistant to natural proteases (a property exploited in drug design).
Question 7
Can you discover additional helices in proteins?
Yes, beyond the canonical α-helix (Pauling-Corey, 1951), several non-standard helices exist: π-helix (4 residues per turn, theoretical), 310-helix (3 residues per turn, found at helix termini), polyproline II helix (left-handed, found in unfolded regions and collagen). Novel helices could be engineered via computational design (e.g., modifying Φ/Ψ angles via noncanonical amino acids or circular permutation). For Füzi Poiesis, engineering a "salt-bridge helix" (with alternating Glu/Lys residues) could stabilize SQR in high ionic strength environments—a helix type not common in nature but theoretically sound.
Question 8
Why are most molecular helices right-handed?
Because life is homochiral—all natural amino acids are L-enantiomers. The α-helix's right-handedness is a direct consequence of L-amino acid stereochemistry. This likely resulted from Parity violation in the early universe (weak nuclear force slightly favors L-amino acids in prebiotic synthesis) combined with life's lock-in of L-amino acids due to their early availability. Once life committed to L-amino acids, any organism using D-amino acids would be outcompeted. The universality of L-amino acids means right-handed helices are the universal default. This is a frozen accident of early Earth chemistry and prebiotic chemistry rather than an inherent physical optimum.
Question 9
Why do β-sheets tend to aggregate?
β-sheets are inherently prone to aggregation because their extended backbone geometry allows inter-sheet hydrogen bonding: two β-sheets can stack face-to-face with their mainchain amide groups aligned. Unlike α-helices, which are compact and self-satisfied (their hydrogen bonds are intramolecular), β-sheets have exposed backbone amides on both faces that "want" to pair with other sheets. This is thermodynamically favorable—additional sheet-sheet H-bonds release water molecules (ΔS > 0). Evolutionarily, this property is exploited constructively (e.g., β-barrels in outer membrane proteins) but is also the root cause of amyloid diseases (Alzheimer's Aβ, prion diseases)—misfolded sheets nucleate uncontrollable aggregation.
Part B

Protein Structure Analysis: Protein G (PhiX174)

Protein G from bacteriophage PhiX174 was selected for detailed structural and computational analysis. As one of the smallest autonomous proteins (55 residues), it serves as a paradigm for understanding protein folding stability, interactive binding, and evolutionary constraint in the context of phage biology.

Protein Selection Rationale

PhiX174 is the first DNA virus to be fully sequenced (1977) and remains a central model organism in molecular biology. Protein G, its maturation protease, is essential for virion assembly. Understanding Protein G's biophysics directly informs Füzi Poiesis bacteriophage engineering: SQR and PhoA must fold independently while maintaining catalytic activity in a synthetic halophilic environment.

Part B · PyMOL Visualizations
Protein G structure representations

Protein G (55 aa, PhiX174 maturation protease) was visualized using PyMOL in multiple representations: cartoon (for secondary structure), ribbon, line, spheres (by residue type), and molecular surface. Each representation reveals different aspects of the protein's three-dimensional organization.

PyMOL Lines Representation
Lines representation showing backbone trace of Protein G. Useful for identifying secondary structure elements and overall fold topology.
PyMOL Multiple Representations
Combined visualization: cartoon (secondary structure), ribbon, and line representations of Protein G and neighboring chains in the 2BPA complex.
PyMOL Spheres by Residue Type
Space-filling model colored by residue type: blue (basic: Arg, Lys), red (acidic: Asp, Glu), green (hydrophobic: Leu, Ile, Val, Phe). Core hydrophobic patches visible in green; charged residues on surface.
PyMOL Surface Colored by Residue Type
Molecular surface colored by residue type. Surface-accessible acidic residues (red) and basic residues (blue) define the electrostatic landscape; buried hydrophobic core (green) maintains structural stability.
3am PyMOL Session with Ab-Soul
3am session: Protein G analysis with The Law (Ab-Soul, Mac Miller, Rapsody) playing in the background. This is how the work actually happens.
Structural Insights from Visualization
What PyMOL revealed about Protein G

Protein G's 55-residue sequence folds into a compact bundle with a central hydrophobic core (visible as green in the space-filling model). The molecule contains both α-helical regions (visible as thick cylinders in cartoon mode) and more extended regions that interact with neighboring proteins in the 2BPA complex. Charged residues (red/blue) cluster on the surface, providing the protein with hydrophilicity and solubility. The backbone line trace reveals the overall topology: a twisted structure with no obvious β-sheets, consistent with its role as a maturation protease in a viral coat.

Part C

Machine Learning Design Tools: ESM2, ESMFold, AlphaFold

C1 · Protein Language Modeling
Deep Mutational Scanning with ESM2

ESM-1v (Meta's protein language model) was used to generate a zero-shot deep mutational scan across Protein G's 55-residue sequence. The model computes log-likelihood ratios for all possible single-point mutations, producing a heatmap of mutation tolerance. Blue regions (low likelihood) indicate residues constrained by evolutionary pressure; orange/red regions indicate positions tolerant to substitution.

Asimov Kernel Mutation Scan Heatmap
ESM2 Deep Mutational Scan: Protein G (PhiX174). Green (tolerated) → Blue (constrained). Position-by-residue log-likelihood landscape.
Protein G Deep Mutational Scan
ESM-1v Heatmap: Protein G (PhiX174). White (no mutation), teal/blue (constrained positions), orange/red (tolerated substitutions). Log-likelihood ratio scale.
Deep Mutational Scan Interpretation
Sequence constraint landscape

Key Pattern: The N-terminal region (residues 1-20) shows consistently low tolerance (blue), indicating core structural importance. Residue T3 (threonine at position 3) shows near-zero tolerance for any substitution—likely a critical interaction partner or core hydrophobic packing residue. In contrast, surface-exposed residues (e.g., positions 40-50) tolerate multiple substitutions, particularly to other hydrophilic residues.

Specific Example: Position 45 (Glu) tolerates substitution to Asp (both acidic) and Lys (charge reversal is tolerated), but not to Pro or Gly. This suggests position 45 is constrained by local charge state, not structural geometry—a hallmark of solvent-exposed residues engaged in electrostatic interactions.

For Füzi Poiesis: A similar scan of SQR (sulfite quinone oxidoreductase) would identify positions constrained by cofactor binding (FAD, heme) versus positions available for halotolerance mutations. This is critical for Aim 2: engineering SQR for Lake Budi's high-salinity environment requires understanding which positions can be mutated to improve salt stability without disrupting catalytic function.

Latent Space Analysis
t-SNE embedding of Protein G variants

A sequence dataset of Protein G homologs and variants was embedded into 3D space using t-SNE (t-distributed stochastic neighbor embedding), with points colored by experimental fitness scores. The resulting topology reveals the protein's sequence space landscape.

t-SNE Embedding Protein G
t-SNE 3D embedding of Protein G sequence variants. Color gradient yellow (high fitness) to purple (low fitness). The continuous cloud structure indicates a smooth fitness landscape with no sharp isolated fitness peaks.
t-SNE Interpretation
Landscape topology and design implications

The t-SNE embedding reveals a continuous, unimodal fitness landscape for Protein G—there are no isolated high-fitness islands disconnected from the rest of sequence space. This is favorable for protein engineering: it suggests that adaptive walks through sequence space (iterative mutagenesis + selection) can navigate from any starting point toward higher-fitness variants without encountering impassable fitness valleys.

The high-fitness variants (yellow, top of the cloud) are not clustered in a single tight region but are distributed across a gradient, suggesting multiple independent paths to high fitness exist. This is consistent with the evolutionary plasticity of external scaffolding proteins, which often tolerate significant sequence variation as long as the procapsid interaction interface is maintained.

For Füzi Poiesis, a similar analysis of SQR and PhoA sequence space would identify which positions can be mutated to improve halotolerance (for Lake Budi salinity conditions) without disrupting catalytic activity—a critical design question for Aim 2 chassis engineering.

C2

Protein Folding Validation with ESMFold

ESMFold structure prediction applied to wild-type Protein G from PhiX174. Confidence scores (pLDDT) indicate per-residue prediction reliability across the 55-residue sequence.

C2 · Protein Folding Validation
ESMFold Prediction Confidence Analysis

Wild-type Protein G was folded using ESMFold, a fast structure prediction model trained on the ESM-2 language model. The output includes per-residue confidence scores (pLDDT) ranging from 0-100, where higher values indicate greater structural certainty.

ESMFold pLDDT Heatmap
ESMFold prediction confidence heatmap for Protein G. Residues (y-axis) vs protein sequence positions (x-axis). Color scale: green (high confidence, pLDDT > 80) to purple (low confidence, pLDDT < 40).
Clean ESMFold Confidence Visualization
Cleaned visualization of Protein G confidence scores. The predominantly green coloring (high confidence) across the entire sequence indicates ESMFold predicts a stable, well-defined structure for wild-type Protein G.
C2 Interpretation
Structural Stability & Folding Robustness

Key Finding: Protein G achieves consistently high pLDDT scores (green, >80) across the majority of residues, indicating ESMFold predicts high confidence in the folded structure. This is expected for a well-characterized viral protein with a conserved fold across homologs.

Structural Implications: The absence of low-confidence regions (blue/purple) suggests the protein has no inherently disordered or flexible domains. For bacteriophage engineering (Aim 2 of Füzi Poiesis), this means Protein G is a stable scaffold—mutations that maintain core hydrophobic packing are likely to preserve structure and function.

For Halotolerance Engineering: SQR and PhoA, being much larger proteins (>400 aa each) than Protein G (55 aa), may have flexible terminal regions. A similar ESMFold analysis of those targets would identify positions available for halotolerance mutations (high pLDDT core, low pLDDT periphery = more mutation tolerance at edges).

Part D

Group Brainstorm: Bacteriophage L Protein Engineering

📄 Group Assignment Document

Complete group proposal including pipeline schematic, tool justification, and pitfall analysis:

If the PDF viewer above doesn't load, you can download it directly or open it in a separate tab using your browser's PDF viewer.
📄 Group Assignment Document

Complete group proposal including pipeline schematic, tool justification, and pitfall analysis:

📋 View on Google Drive:
Open Group Assignment (Bacteriophage L Protein Stabilization)

Group assignment in collaboration with synthetic biology cohort. Objective: design a computational pipeline to engineer the bacteriophage L protein (lysis protein) for increased stability, higher titers, or increased toxicity. Proposal includes three-step computational workflow, pitfall analysis, and visual pipeline schematic.

Group Primary Goal

Increased L Protein Stability via DnaJ-independence. Strategy: use ESM2 to identify and disrupt chaperone-recognition motifs (exposed hydrophobic patches), validate structure preservation with ESMFold, and confirm reduced binding affinity to DnaJ using AlphaFold-Multimer.

Step 1: Sequence Scanning
ESM2 Mutational Analysis & Motif Disruption

Use ESM-1v deep mutational scan to identify constrained vs. tolerant positions across the L protein sequence. Map known DnaJ-binding motifs (typically hydrophobic patches: Leu/Ile-rich regions) and propose polar/hydrophilic substitutions (Leu → Asp, Ile → Ser) that disrupt chaperone recognition while remaining sequence-viable according to ESM2 likelihood scores.

Step 2: Structural Validation
ESMFold High-Throughput Filtering

Top candidate sequences from Step 1 are folded with ESMFold. Filter variants by: (1) pLDDT confidence > 0.7, (2) RMSD to wild-type backbone < 2Å, (3) no major unfolding. This ensures mutations disrupt chaperone binding without collapsing protein structure. ESMFold is ~100× faster than AlphaFold2, enabling rapid screening of 50-100 variants in a single compute session.

Step 3: Interaction Modeling
AlphaFold-Multimer: L Protein + DnaJ Complex

For top-ranked variants from Step 2, model the L protein-DnaJ complex using AlphaFold-Multimer. Compare predicted interface contacts, Predicted Aligned Error (PAE), and interface energy between wild-type and mutant. Prioritize variants showing significantly weakened or absent DnaJ interactions while maintaining stable L protein folding.

L Protein Stabilization Pipeline
Complete computational pipeline for L protein engineering: ESM2 sequence scanning → variant design → ESMFold structural filtering → Boltz-1 complex modeling. Potential pitfalls noted: folding ≠ function, overlapping reading frames, membrane context, altered interactions.
Pitfall #1: Overlapping Reading Frames

Bacteriophage genomes are compact; L protein-encoding DNA may share sequence space with other genes or regulatory elements in alternative reading frames. Targeted mutations could silently disrupt neighboring genes or produce non-functional L mRNA. Genomic foundation models (Evo) could assess this but are computationally prohibitive for rapid screening.

Pitfall #2: Stability ≠ Function

Structural stability (good fold, high pLDDT) does not guarantee biological function. Lytic activity requires membrane insertion, oligomerization, and catalytic geometry—all context-dependent factors not captured by ESMFold or even complex modeling. Completely abolishing DnaJ interaction might prevent proper membrane delivery. Experimental validation (lysis assays, cell lysis kinetics) is essential downstream.

Bacteriophage L Protein Engineering Pipeline
Group Pipeline Schematic: ESM2 sequence scan → ESMFold structural filtering → AlphaFold-Multimer complex modeling. Key pitfalls: genomic constraints and function-vs-stability trade-offs.
Tools Survey

Protein Engineering Toolkit

This week's lab included a systematic survey of computational tools used in protein design. The following tools were explored and documented:

PyMOL
3D molecular visualization. Cartoon, ribbon, surface, spheres, lines representations. Chain selection and labeling. Used for structural analysis of PhiX174 Protein G and 2BPA.
Asimov Kernel
AI-powered protein design platform. Mutation scan heatmaps, model score prediction across all substitutions. Used for Protein G fitness landscape mapping.
ESM-1v
Meta's protein language model. Log-likelihood ratio computation for deep mutational scanning. Zero-shot fitness prediction without experimental data.
t-SNE
Dimensionality reduction for sequence space visualization. 3D embedding of protein variant clouds. Fitness landscape topology analysis.
AlphaFold2
Structure prediction from sequence. High-confidence predictions for proteins without experimental structures. Surveyed for Füzi Poiesis protein targets.
RFdiffusion
Diffusion-based de novo protein backbone generation. Surveyed for potential application to design novel scaffolds. Not used in current Füzi Poiesis Aim 1 scope.
Reflection

What Protein G taught me

Protein G was not a protein I chose for strategic reasons—I chose it because PhiX174 is one of the great objects in the history of molecular biology, and I wanted to understand something about it deeply rather than superficially. I read about it, took notes by hand, and worked through the structure in PyMOL at 3am with Ab-Soul playing. That is documented here honestly.

What I learned that transferred directly to Füzi Poiesis: the deep mutational scan showed me that fitness landscapes are rarely binary. Most positions in Protein G tolerate some substitutions—the question is which ones, and by how much. The same question applies to SQR from Rhodobacter capsulatus and PhoA from E. coli K-12: these proteins evolved in organisms with very different physicochemical environments from Lake Budi. Before engineering them for halotolerant expression, understanding which positions are constrained (blue in the heatmap) and which are tolerant (orange) would tell me where I can introduce stabilizing mutations without disrupting catalytic function.

The t-SNE landscape was conceptually the most important result: a continuous fitness landscape means directed evolution is tractable. A rugged landscape with isolated peaks would mean engineering SQR for halotolerance requires a lucky jump—but a smooth landscape means systematic mutagenesis and selection can navigate there step by step. That distinction matters for Aim 2 experimental design.