Week 4 HW: Protein design part 1

Part A. Conceptual Questions

Why do humans eat beef but do not become a cow, or eat fish but do not become fish?

This is because the genetic code acts as an algorithm that dictates how proteins are assembled specifically for each organism. When humans consume animal proteins, these are broken down into amino acids; subsequently, the body uses its own transcription and translation machinery to reorganize those amino acids according to its own DNA instructions, creating human-specific proteins rather than those of the animal consumed.

Why are there only 20 natural amino acids?

The human body relies on 20 types of natural amino acids as fundamental building blocks. While the number of types is limited, their combinatory potential is vast: a chain of just 40 amino acids can generate more unique protein variations than there are total atoms in the known universe.

Can you make other non-natural amino acids? Design some new amino acids. Yes, modern engineering allows for de novo design using non-natural materials to build biological systems from scratch. Using genome language models and AI tools, scientists can now explore “evolutionary spaces” not found in nature, designing building blocks with custom chemical properties for applications like drug delivery or new biomaterials.
Can you discover additional helices in proteins?

Yes, through what is called “helical arithmetic,” various complex structural organizations have been identified. In addition to simple helices, researchers have discovered coiled coils, coiled-coil trimers (in laminin), four-helical bundles (forming hydrophobic cores), five-helical bundles (in talin), six-helical bundles (in the coronavirus spike protein), and seven-helical bundles (in GPCR receptors).

Why do β-sheets tend to aggregate? These structures have a natural tendency to aggregate because of their longitudinal bonds between different amino acids. This property makes them extremely rigid and resistant to tension, allowing for the formation of strong, extensive surfaces like silk fibers or the robust sheets found in various biological structures.
What is the driving force for β-sheet aggregation?

The primary driving forces are chemical bonds, specifically the hydrogen bonds that form between the amino acids. These bonds determine the intricate folding patterns and the stability of the secondary structure by allowing amino acid chains to adhere to each other in an orderly fashion.

Can you use amyloid β-sheets as materials? Yes, proteins are considered extremely versatile biomaterials that can be designed with specific mechanical properties. Since beta-sheets provide exceptional strength (having inspired synthetic materials like Kevlar), they can be computationally designed and synthesized in a lab to create new functional biological materials, ranging from flexible to rigid.
Design a β-sheet motif that forms a well-ordered structure.

An effective design is the LS (Leucine-Serine) motif, identified as a critical and highly conserved component in the lysis proteins of various phages. A well-ordered design based on this motif would include: (1) a positively charged N-terminus, (2) a hydrophobic sequence rich in aromatic and beta-branched aliphatic residues, (3) the essential LS dipeptide for interaction, and (4) a phage-specific C-terminal domain.

Why are most molecular helices right-handed?

While nature predominantly favors right-handed (dextrorotatory) orientations, studying chirality is essential to understanding biological exceptions. For example, the work of Shuguang Zhang is recognized for contributing to the deciphering of left-handed RNA structures, challenging biological norms and expanding our understanding of the structural possibilities of macromolecules

Part B: Protein Analysis and Visualization

I chose “myoglobin” as the protein

I have use structure 3D by: BindCraft

https://docs.google.com/document/d/1W5RZQsCgAJ3Mi5nxPt_BVPvfOiEKKsgkBNN4VEqUDqE/edit?usp=sharing

Myoglobin is a classic all-helix protein; it is a ubiquitous type of protein, related to the heme group and has a low molecular weight, consisting of 150 amino acids. It is a protein present in cardiac and skeletal muscle. However, under destabilizing conditions, where it is found in a partially folded intermediate conformation, it forms amyloid fibrils. Additionally, it contains one heme group per molecule and has a structure similar to that of hemoglobin monomers

Another fact is that myoglobin is an iron-containing protein with a molecular mass of 18 kDa. It resembles hemoglobin, but it binds to one oxygen molecule instead of four

“Myoglobin is mostly formed of: Leucine amino acids and, in smaller quantities, Alanine, Glutamate, Lysine

Globins are a family of small globular metalloproteins of about 150 amino acids in length and characterized by 8 α-helical segments (A-H) that exert a characteristic 3-over-3 folding, which is known as the globular structure.

The globin family includes eight different types of globins, but the best described and characterized in vertebrates are hemoglobin (Hb), myoglobin (Mb), cytoglobin (Cygb), and neuroglobin (Ngb).

In RSBC: https://doi.org/10.2210/pdb101M/pdb

Resolution: 2.07 Å The solved structure contains three unique ligands: HEM, which is iron-containing Protoporphyrin IX; SO4, the Sulfate Ion; and NBN, or N-butyl Isocyanide. Regarding its structural classification family, according to RCSB, the protein belongs to the Oxygen Transport family.

Additionally, as established in our previous conversation, myoglobin is a member of the globin family, which is characterized by a specific alpha-helix folding that enables binding to the heme group.

Deep Mutational Scans I use AI Gemini Cpmand: Use ESM2 to generate an unsupervised deep mutational scan of myoglobin based on language model likelihoods. Can you explain any particular pattern? (choose a residue and a mutation that stands out)

Result AI:

What is the identified pattern for the most significant mutation? The most significant log-likelihood ratio (LLR) was found for the mutation of Methionine (M) to Tryptophan (W) at position 1 (0-indexed position 0) of the human myoglobin sequence, with an LLR of -16.6641. This highly negative LLR suggests that the ESM2 language model predicts this substitution to be highly unfavorable or unlikely, possibly due to Tryptophan’s bulky nature disrupting local packing or stability at the N-terminus. What are the summary findings from the myoglobin mutation scan? The unsupervised deep mutational scan provides a landscape of predicted mutational effects across the human myoglobin sequence. The heatmap visually represents LLRs for all possible single amino acid substitutions, where largely negative LLRs indicate positions strongly favoring the wild-type residue, and LLRs near zero or positive suggest higher mutational tolerance. Critical regions for function, such as the heme-binding pocket, are expected to show strong conservation (highly negative LLRs for substitutions). How could the language model’s predictions be compared with experimental data? The language model’s predictions (LLR values) can be compared with experimental deep mutational scanning (DMS) datasets using several strategies: Correlation Analysis: Calculate Pearson or Spearman correlation between LLRs and experimental fitness scores. ROC/Precision-Recall Curves: Evaluate the model’s ability to classify deleterious or beneficial mutations based on LLRs. Specific Hotspot Validation: Directly compare predicted LLRs for critical mutations (e.g., M1W) with their known experimental effects. Structural Context: Analyze if predicted deleterious mutations align with residues important for 3D structure, heme coordination, or core packing.

a) Myoglobin: Contains 8 α-helices (A–H)

Part C. Using ML-Based Protein Design Tools

Link: https://docs.google.com/document/d/1W5RZQsCgAJ3Mi5nxPt_BVPvfOiEKKsgkBNN4VEqUDqE/edit?usp=sharing

Part D. Group Brainstorm on Bacteriophage Engineering

Individual proposal:

Project goals

Increased Toxicity (Hard): Optimize the lytic kinetics to trigger faster host cell lysis by bypassing the DnaJ-dependent “damping” mechanism.
Increased Stability (Easy): Redesign the N-terminal and transmembrane domains to prevent proteolytic degradation and ensure robust protein accumulation

. Generative Sequence Design (Tool: Evo 2) Approach: Utilize the Evo 2 genome language model to generate a library of novel MS2 L variants. We will specifically prompt the model to design Lodj-like variants (L overcomes DnaJ) by truncating or modifying the N-terminal Domain 1, which normally slows down lysis through its interaction with host DnaJ. Reasoning: Evo 2 has demonstrated the ability to navigate novel evolutionary spaces and generate viable phages with faster lysis kinetics than natural templates. This allows us to access sequence diversity beyond the 67 unique mutations identified in natural screens. 2. Sequence Stability Optimization (Tool: ProteinMPNN) Approach: Use ProteinMPNN to perform inverse folding on the core transmembrane domain (TMD) of the generated candidates. Reasoning: Many missense mutations in Domain 1 and the TMD lead to accumulation defects due to instability. ProteinMPNN can redesign sequences to fit the specific 3D backbone required for membrane insertion while optimizing for thermodynamic stability. 3. Functional Motif Tuning (Tool: ESM-2 / ESM-3) Approach: Use the ESM-2/3 protein language models to extract embeddings and perform in silico mutagenesis on the essential Leu48-Ser49 (LS) motif. Reasoning: The LS motif is the core of the essential protein-protein interaction domain. ESM models can identify which amino acid substitutions in the surrounding Domain 2 and Domain 4 preserve the critical hydrophobic and polar character necessary for function while maximizing toxic effect. 4. Oligomerization Verification (Tool: AlphaFold-Multimer) Approach: Use AlphaFold-Multimer to predict the ability of designed variants to assemble into high-order oligomeric complexes (decamers or higher). Reasoning: MS2 L must form large membrane-disrupting clusters (clusters of at least 10 monomers) to cause cytoplasmic leakage. AlphaFold-Multimer can validate if the designed mutations at the TMD interface promote or hinder this essential assembly.

II. Potential Pitfalls The “Unknown Target” Problem: While we know MS2 L interacts with DnaJ, its definitive membrane-embedded host target is still unknown. Without a clear target structure, using tools like BindCraft or AlphaFold-Multimer to optimize heterotypic protein-protein interactions is speculative and relies on the assumption that L forms homomeric pores. Toxicity vs. Titer Trade-off: Increasing toxicity (faster lysis) often leads to lower titers. If the L protein triggers lysis too early (e.g., 20 minutes earlier in Lodj mutants), the phage may not have enough time to assemble progeny virions, potentially making the therapy less effective overall.

III. Pipeline Schematic [Input] Wild-type MS2 L Sequence (75 aa) ↓ [Step 1: Evo 2] Generate novel Lodj-style truncations/diversified C-terminal domains. ↓ [Step 2: ProteinMPNN] Redesign TMD for stability and robust membrane accumulation. ↓ [Step 3: ESM-2/3] Perform site-specific optimization of the LS motif and surrounding domains. ↓ [Step 4: AF-Multimer] Verify the variant’s ability to form high-order oligomeric clusters in the membrane. ↓ [Output] Top 5 Candidates for experimental synthesis and rebooting in E. coli C

Group Proposal:

Link: https://docs.google.com/document/d/1xj6lRZtoHY3STFA88BtKhXvekjcOXN83K8WxOr10xyI/edit?usp=sharing