Week 4 HW: Protein Designe Part I

Part A. Conceptual Questions

How many molecules of amino acids do you take with a piece of 500 grams of meat? Meat is approximately 20% protein. • 500g of meat $\times$ 0.20 = 100g of protein. • Average molecular weight of an amino acid $\approx$ 100 Daltons (g/mol). • 100g / 100 g/mol = 1 mole of amino acids. • Using Avogadro’s number, you consume approximately $6.022 \times 10^{23}$ molecules of amino acids.
Why do humans eat beef but do not become a cow, or eat fish but do not become fish? During digestion, enzymes break down foreign proteins into their individual “bricks” (amino acids). Our body then absorbs these amino acids and uses our own DNA “instruction manual” to reassemble them into human-specific proteins. We share the same bricks, but we build a different house.
Why are there only 20 natural amino acids? This is likely an “evolutionary frozen accident.” These 20 amino acids provide enough chemical diversity (charge, size, and polarity) to fold into almost any functional shape required for life. While more exist in nature, these 20 were sufficient for the ancestor of all life.
Can you make other non-natural amino acids? Design some. Yes. Scientists use “Expanded Genetic Code” techniques to create hundreds of non-natural amino acids (ncAAs). • Design Example: p-azidophenylalanine. It contains an azide group that allows for “Click Chemistry,” letting us attach fluorescent dyes or drugs directly to a specific spot on a protein.
Where did amino acids come from before enzymes and before life started? They were created through abiotic synthesis. Experiments like the Miller-Urey experiment showed that lightning, heat, and UV radiation acting on a primitive atmosphere (methane, ammonia, water) can spontaneously create amino acids. They have also been found on meteorites, suggesting they can form in space.
If you make an $\alpha$-helix using D-amino acids, what handedness would you expect? Natural L-amino acids form right-handed $\alpha$-helices. Therefore, D-amino acids would form a left-handed helix due to the mirrored geometry of the side chains.
Can you discover additional helices in proteins? Yes. Beyond the standard $\alpha$-helix (3.6 residues per turn), there are: • $3_{10}$ helix: Tighter and more elongated. • $\pi$-helix: Wider and shorter.
Why are most molecular helices right-handed? Because life is “chiral” and almost exclusively uses L-amino acids. For L-amino acids, the right-handed twist is energetically more stable because it minimizes physical clashing (steric hindrance) between the side chains and the protein backbone.
Why do $\beta$-sheets tend to aggregate? What is the driving force? $\beta$-sheets have “sticky” edges where hydrogen bonds are exposed. The primary driving force is the Hydrophobic Effect and inter-strand hydrogen bonding. They aggregate to hide their “greasy” hydrophobic parts from water, snapping together like Lego bricks

Part B. Protein Analysis (KRAS)

Protein Selection and Description • Protein Selected: KRAS (Kirsten Rat Sarcoma Virus). • Selection Rationale: I selected KRAS because it is a fundamental “molecular switch” in human cells. It controls signaling pathways for cell growth and survival. Mutations in KRAS, particularly at the G12 position, are responsible for approximately 25% of all human cancers, making it a primary target for modern drug design and AI-driven structural analysis.
Amino Acid Sequence and Frequency • Sequence (from PDB 4DS1): MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSEDVPMVLVGNKCDLPSRTVDTKQAQDLARSYGIPFIETSAKTRQGVDDAFYTLVREIRKHKEK • Length: This specific structure (catalytic domain) is 169 amino acids long. • Most Frequent Amino Acid: Valine (V) and Leucine (L). Using the Colab frequency counter, Leucine and Valine appear most often, as they are essential for packing the hydrophobic core of the Rossmann fold.
Homologs and Family • Number of Homologs: A UniProt BLAST search reveals over 5,000 homologs. KRAS is highly conserved across eukaryotes, from yeast to humans. • Protein Family: It belongs to the Ras family of small GTPases.
RCSB Structure Details • RCSB Page: 4DS1 • Solved Date: The structure was solved in 2012. • Quality/Resolution: The resolution is $1.60\text{ \AA}$. This is an excellent quality structure (well below the $2.70\text{ \AA}$ limit), providing near-atomic detail of the binding pocket. • Other Molecules: In addition to the protein, the structure contains GDP (Guanosine-5’-Diphosphate), a Magnesium ion (MG) which is crucial for catalysis, and Water (HOH) molecules.
Structural Classification • Structure Classification Family: KRAS is classified as having a Rossmann fold architecture (Alpha/Beta). It consists of a central 6-stranded $\beta$-sheet surrounded by 5 $\alpha$-helices.

3D Molecule Visualization (PyMol)

A. Visualization Styles • Cartoon: show cartoon; hide everything else • Ribbon: show ribbon • Ball and Stick: show sticks; set stick_radius, 0.2 B. Secondary Structure • Observation: KRAS features a balanced mix of helices and sheets. It has 5 main $\alpha$-helices and 6 $\beta$-strands that form the central “floor” of the protein. • PyMol Command: color red, ss h; color yellow, ss s; color green, ss l+’' C. Residue Type (Hydrophobic vs. Hydrophilic) • Observation: When colored by residue type, you can see a clear hydrophobic core (red) where amino acids like Valine and Leucine are tucked away from water. The hydrophilic residues (blue) are spread across the surface to interact with the cellular environment. • PyMol Command: color red, hydrophobic; color blue, hydrophilic D. Surface and Binding Pockets • Surface: show surface • Binding Pockets: KRAS has a very prominent “hole” or binding pocket where the GDP molecule and the Magnesium ion sit. This pocket is formed by the P-loop and the two “Switch” regions (Switch I and Switch II). These switches change shape when KRAS is active, allowing it to interact with other proteins.

Part C. Using ML-Based Protein Design Tools

Selected Protein: KRAS (Kirsten Rat Sarcoma Virus) PDB ID: 4DS1 In this section, I explored the capabilities of modern AI models in analyzing and designing the KRAS protein. The computational experiments were conducted in a Google Colab environment utilizing a T4 GPU for efficient model inference.

C1. Protein Language Modeling (ESM2) Deep Mutational Scans (DMS) • Methodology: I used the ESM2 language model to generate an unsupervised deep mutational scan of the KRAS catalytic domain. • Key Observation: The heatmap reveals a significant mutational “cold spot” at Glycine 12 (G12), where almost all substitutions are predicted to be highly unfavorable (dark blue/purple). • Biological Significance: This aligns with clinical reality; G12 is a critical residue in the P-loop, and mutations here lock KRAS in an active state, driving tumor growth. The AI correctly predicted these structural constraints without being trained on cancer data. Latent Space Analysis • Visualizing the “Grammar”: Using t-SNE, I projected protein sequence embeddings into a 3D space to see how the model organizes biological information. • Clustering: My KRAS sequence was grouped within a cluster of other small GTPases, demonstrating that the model understands functional similarity based purely on sequence patterns. C2. Protein Folding (ESMFold) Folding the Oncogene • Result: I used ESMFold to predict the 3D structure of the KRAS sequence. • Accuracy: The predicted ribbon model shows a high-confidence Rossmann fold architecture, consisting of a central $\beta$-sheet and surrounding $\alpha$-helices, matching the experimental PDB structure. • Resilience Testing: I tested the fold’s resilience by introducing mutations. While the surface was tolerant, mutations in the hydrophobic core significantly reduced folding confidence, illustrating the delicate balance required to maintain the KRAS structural scaffold.

C3. Protein Generation (ProteinMPNN) Inverse Folding and Sequence Redesign • Approach: Using the 3D backbone of KRAS as a fixed template, I used ProteinMPNN to design new sequences that could fold into the same shape. • Probability Analysis: The resulting matrix shows bright yellow spots for residues that the AI considers essential for the fold’s stability. Many of these correspond to the internal $\beta$-strands that form the structural floor of the protein. • Validation: Re-folding these AI-generated sequences with ESMFold confirmed that they maintain the characteristic KRAS topology, proving the effectiveness of the inverse-folding pipeline for designing stable variants.

Part D. Group Brainstorm on Bacteriophage Engineering Project Title: Optimizing Phage Lysis Protein Stability using a KRAS-inspired AI Pipeline.

The Sub-problem: Thermal Instability of Lysis Proteins Bacteriophages are promising alternatives to antibiotics, but many therapeutic phages are sensitive to environmental stress, such as heat or pH changes, which causes their proteins to denature. We chose to focus on the Lysis Protein, which is responsible for rupturing the bacterial cell wall during the phage life cycle.
Proposed Computational Approach We propose applying the exact AI-driven workflow used for KRAS in Part C to design a “super-stable” version of the lysis protein: • ESM-2 for Mutation Scanning: Just as we identified the critical G12 residue in KRAS, we will use ESM-2 to generate a Deep Mutational Scan (DMS) of the lysis protein to find residues that can be mutated to increase thermodynamic stability without losing function. • ProteinMPNN for Sequence Redesign: Following the KRAS “inverse-folding” logic, we will use ProteinMPNN to redesign the protein’s hydrophobic core. The goal is to maximize the probability of a stable fold while maintaining the specific 3D geometry needed to attack the bacterial wall. • ESMFold for Structural Validation: Every new AI-generated sequence will be folded in silico. We will compare the predicted structures to the wild-type to ensure the active site remains intact.
Why These Tools? • Efficiency: Traditional laboratory “trial and error” for protein stabilization can take years. AI tools like ESMFold can provide structural insights in seconds. • Evolutionary Logic: Language models like ESM-2 capture the “grammar” of protein sequences, ensuring that our designed mutations are biologically plausible.
Potential Pitfalls • Activity-Stability Trade-off: Increasing the stability (rigidity) of a protein can sometimes reduce its enzymatic activity. A lysis protein that is too stable might not be flexible enough to function properly. • AI Hallucinations: AI models can sometimes predict sequences that look good on screen but fail to express or fold correctly in a real bacterial cell.
Schematic of the Pipeline Input: WT Lysis Sequence $\rightarrow$ ESM-2 (Stability Map) $\rightarrow$ ProteinMPNN (Sequence Redesign) $\rightarrow$ ESMFold (3D Validation) $\rightarrow$ Output: Optimized Candidate for Synthesis.