Conceptual Questions on Amino Acids & Protein Structure
Selected 9 questions from Shuguang Zhang's foundational amino acid and protein structure curriculum. Responses below address amino acid chemistry, structural biology, and evolutionary considerations.
Protein Structure Analysis: Protein G (PhiX174)
Protein G from bacteriophage PhiX174 was selected for detailed structural and computational analysis. As one of the smallest autonomous proteins (55 residues), it serves as a paradigm for understanding protein folding stability, interactive binding, and evolutionary constraint in the context of phage biology.
PhiX174 is the first DNA virus to be fully sequenced (1977) and remains a central model organism in molecular biology. Protein G, its maturation protease, is essential for virion assembly. Understanding Protein G's biophysics directly informs Füzi Poiesis bacteriophage engineering: SQR and PhoA must fold independently while maintaining catalytic activity in a synthetic halophilic environment.
Protein G (55 aa, PhiX174 maturation protease) was visualized using PyMOL in multiple representations: cartoon (for secondary structure), ribbon, line, spheres (by residue type), and molecular surface. Each representation reveals different aspects of the protein's three-dimensional organization.
Protein G's 55-residue sequence folds into a compact bundle with a central hydrophobic core (visible as green in the space-filling model). The molecule contains both α-helical regions (visible as thick cylinders in cartoon mode) and more extended regions that interact with neighboring proteins in the 2BPA complex. Charged residues (red/blue) cluster on the surface, providing the protein with hydrophilicity and solubility. The backbone line trace reveals the overall topology: a twisted structure with no obvious β-sheets, consistent with its role as a maturation protease in a viral coat.
Machine Learning Design Tools: ESM2, ESMFold, AlphaFold
ESM-1v (Meta's protein language model) was used to generate a zero-shot deep mutational scan across Protein G's 55-residue sequence. The model computes log-likelihood ratios for all possible single-point mutations, producing a heatmap of mutation tolerance. Blue regions (low likelihood) indicate residues constrained by evolutionary pressure; orange/red regions indicate positions tolerant to substitution.
Key Pattern: The N-terminal region (residues 1-20) shows consistently low tolerance (blue), indicating core structural importance. Residue T3 (threonine at position 3) shows near-zero tolerance for any substitution—likely a critical interaction partner or core hydrophobic packing residue. In contrast, surface-exposed residues (e.g., positions 40-50) tolerate multiple substitutions, particularly to other hydrophilic residues.
Specific Example: Position 45 (Glu) tolerates substitution to Asp (both acidic) and Lys (charge reversal is tolerated), but not to Pro or Gly. This suggests position 45 is constrained by local charge state, not structural geometry—a hallmark of solvent-exposed residues engaged in electrostatic interactions.
For Füzi Poiesis: A similar scan of SQR (sulfite quinone oxidoreductase) would identify positions constrained by cofactor binding (FAD, heme) versus positions available for halotolerance mutations. This is critical for Aim 2: engineering SQR for Lake Budi's high-salinity environment requires understanding which positions can be mutated to improve salt stability without disrupting catalytic function.
A sequence dataset of Protein G homologs and variants was embedded into 3D space using t-SNE (t-distributed stochastic neighbor embedding), with points colored by experimental fitness scores. The resulting topology reveals the protein's sequence space landscape.
The t-SNE embedding reveals a continuous, unimodal fitness landscape for Protein G—there are no isolated high-fitness islands disconnected from the rest of sequence space. This is favorable for protein engineering: it suggests that adaptive walks through sequence space (iterative mutagenesis + selection) can navigate from any starting point toward higher-fitness variants without encountering impassable fitness valleys.
The high-fitness variants (yellow, top of the cloud) are not clustered in a single tight region but are distributed across a gradient, suggesting multiple independent paths to high fitness exist. This is consistent with the evolutionary plasticity of external scaffolding proteins, which often tolerate significant sequence variation as long as the procapsid interaction interface is maintained.
For Füzi Poiesis, a similar analysis of SQR and PhoA sequence space would identify which positions can be mutated to improve halotolerance (for Lake Budi salinity conditions) without disrupting catalytic activity—a critical design question for Aim 2 chassis engineering.
Protein Folding Validation with ESMFold
ESMFold structure prediction applied to wild-type Protein G from PhiX174. Confidence scores (pLDDT) indicate per-residue prediction reliability across the 55-residue sequence.
Wild-type Protein G was folded using ESMFold, a fast structure prediction model trained on the ESM-2 language model. The output includes per-residue confidence scores (pLDDT) ranging from 0-100, where higher values indicate greater structural certainty.
Key Finding: Protein G achieves consistently high pLDDT scores (green, >80) across the majority of residues, indicating ESMFold predicts high confidence in the folded structure. This is expected for a well-characterized viral protein with a conserved fold across homologs.
Structural Implications: The absence of low-confidence regions (blue/purple) suggests the protein has no inherently disordered or flexible domains. For bacteriophage engineering (Aim 2 of Füzi Poiesis), this means Protein G is a stable scaffold—mutations that maintain core hydrophobic packing are likely to preserve structure and function.
For Halotolerance Engineering: SQR and PhoA, being much larger proteins (>400 aa each) than Protein G (55 aa), may have flexible terminal regions. A similar ESMFold analysis of those targets would identify positions available for halotolerance mutations (high pLDDT core, low pLDDT periphery = more mutation tolerance at edges).
Group Brainstorm: Bacteriophage L Protein Engineering
Complete group proposal including pipeline schematic, tool justification, and pitfall analysis: