Week 4 HW: Protein Design I

Part A. Conceptual Questions

I skipped 4 and 11.

1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)

Based on Claude Sonnet 4.6 using the prompt: “How many molecules of amino acids are in 500 grams of meat? Assume the average amino acid has a molecular weight of 100 Daltons”

The approach used was a dimensional analysis: We need to go from grams of meat → Daltons → number of amino acid molecules via Avogadro’s number. Meat is roughly ~20% protein by weight (the rest is water, fat, connective tissue) Average amino acid mass = 100 Daltons = 100 g/mol Avogadro’s number = 6.022 × 10²³ molecules/mol

Therefore:

  1. Mass of protein in 500g of meat = 500g × 0.20 = 100g of protein
  2. Moles of amino acids:100g / 100 g/mol = 1 mol of amino acids
  3. Number of molecules: 1 mol × 6.022×1023 = ≈ 6×10²³ molecules

So roughly 6 × 10²³ molecules (almost exactly one mole!) of amino acids is taken with a piece of 500g of meat.

2. Why do humans eat beef but do not become a cow, eat fish but do not become fish?

When humans eat meat or fish, our digestive enzymes break down all the proteins into individual amino acids. These amino acids are then absorbed into the bloodstream as simple building blocks, no cow or fish protein enters intact. The body then uses these building blocks to make human proteins, following instructions from human DNA. The amino acids themselves are identical across species, so a leucine or alanine from beef is the same as human leucine or alanine, only the blueprint differs. It’s like dismantling a LEGO cow and rebuilding it using a human instruction manual.

3. Why are there only 20 natural amino acids?

The existence of exactly 20 natural amino acids is best understood through prebiotic availability, chemical sufficiency, genetic code structure, and evolutionary history.

  • Prebiotic availability: The original 20 natural amino acids likely reflect what was chemically possible before life began. The landmark Miller-Urey experiment in 1953 demonstrated that many of the standard amino acids form spontaneously under simulated early Earth conditions. The amino acids found in the Murchison meteorite confirm that these molecules arise readily from abiotic chemistry
  • Chemical sufficiency: The 20 amino acids collectively cover the chemical diversity needed to build virtually all protein functions. providing hydrophobic residues for protein cores (leucine, valine, isoleucine), charged residues for surfaces (lysine, arginine, aspartate, glutamate), catalytic residues for enzyme active sites (histidine, cysteine, serine), and structural residues (glycine for flexibility, proline for rigidity). Additional amino acids would add complexity without meaningfully expanding functional capability.
  • Structure of the genetic code: DNA encodes amino acids using three-base codons, giving 4³ = 64 possible combinations, enough to encode 20 amino acids plus stop signals, with significant redundancy. This redundancy buffers against point mutations. Freeland & Hurst showed that the natural genetic code is extraordinarily efficient at minimizing the effects of errors, outperforming all but 1 in a million randomly generated alternative codes. This suggests there was little evolutionary pressure to expand the code further.
  • Evolutionary freezing: The genetic code is nearly universal across all life, suggesting the 20 amino acids were locked in over 3.5 billion years ago. Francis Crick’s “Frozen Accident” hypothesis. It simply became too deeply embedded in all biological machinery to change. Any mutation introducing a new amino acid would disrupt existing protein synthesis across all organisms, making such a change evolutionarily catastrophic.

References:

Doig A. J. (2017). Frozen, but no accident - why the 20 standard amino acids were selected. The FEBS journal, 284(9), 1296–1305. https://doi.org/10.1111/febs.13982

Freeland, S. J., & Hurst, L. D. (1998). The genetic code is one in a million. Journal of molecular evolution, 47(3), 238–248. https://doi.org/10.1007/pl00006381

Kirschning A. (2022). On the Evolutionary History of the Twenty Encoded Amino Acids. Chemistry (Weinheim an der Bergstrasse, Germany), 28(55), e202201419. https://doi.org/10.1002/chem.202201419

Lawless, J. G. (1973). Amino acids in the Murchison meteorite. Geochimica et Cosmochimica Acta, 37(9), 2207–2212. https://doi.org/10.1016/0016-7037(73)90017-3

4. Can you make other non-natural amino acids? Design some new amino acids.

H₂N — CH — COOH
         |
         R (side chain)

NH₂ → amino group (contains nitrogen)

COOH → carboxyl group (contains carbon and oxygen)

R → side chain is what makes each amino acid unique

I would try using ChemDraw to draw the amino acid backbone and explore other side chain molecules Copy the SIMLES

5. Where did amino acids come from before enzymes that make them, and before life started?

Before life existed, amino acids arose spontaneously from simple chemistry through three main pathways.

  • Atmospheric Chemistry (Miller-Urey): The simplest explanation is that amino acids formed directly from gases present in early Earth’s atmosphere. The Miller-Urey experiment simulated early Earth conditions using methane, ammonia, hydrogen, and water with electrical sparks mimicking lightning, and produced amino acids. SCIRP This demonstrated that no biology is needed — just simple molecules and energy. The key reaction pathway is the Strecker synthesis: an aldehyde, ammonia, and HCN react to produce a simple amino acid. SCIRP All three molecules were abundant on early Earth.
  • Hydrothermal Vents: Deep sea hydrothermal vents provide another plausible source. These environments — along with meteorite delivery — represent the two other major proposed sources of prebiotic amino acid synthesis beyond atmospheric chemistry. Springer The heat, pressure, and chemical gradients at vents can drive spontaneous synthesis of organic molecules including amino acids.
  • Extraterrestrial Delivery: Amino acids didn’t only form on Earth, they arrived from space. The Murchison meteorite, which fell in Australia in 1969, was found to contain extraterrestrial amino acids and hydrocarbons. Wikipedia This confirms that amino acid synthesis occurs across the universe and that early Earth was seeded from space.

References

Lawless, J. G. (1973). Amino acids in the Murchison meteorite. Geochimica et Cosmochimica Acta, 37(9), 2207–2212. https://doi.org/10.1016/0016-7037(73)90017-3

MILLER S. L. (1953). A production of amino acids under possible primitive earth conditions. Science (New York, N.Y.), 117(3046), 528–529. https://doi.org/10.1126/science.117.3046.528

Zhang, X., Tian, G., Gao, J. et al. Prebiotic Synthesis of Glycine from Ethanolamine in Simulated Archean Alkaline Hydrothermal Vents. Orig Life Evol Biosph 47, 413–425 (2017). https://doi.org/10.1007/s11084-016-9520-3

6. If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?

A D-amino acid α-helix would be left-handed.

Natural proteins use L-amino acids, which form right-handed α-helices. D-amino acids are exact mirror images of L-amino acids, so they twist in the opposite direction, producing a left-handed helix.

This is dictated by the Ramachandran plot, D-amino acids cannot adopt the backbone angles required for a right-handed helix without steric clashes, so they naturally favor the mirrored, left-handed conformation.

7. Can you discover additional helices in proteins?

Yes — beyond the familiar α-helix, several other helices exist in proteins, and new ones can still be discovered.

Known helices: The type of helix is determined by its hydrogen bonding pattern — specifically how many residues apart the donor and acceptor atoms are:

HelixH-bondResidues/turn
3₁₀n → n+33.0
αn → n+43.6
πn → n+54.4
PPIInone3.0

The α-helix dominates because it sits in a stability “sweet spot” for L-amino acids. The 3₁₀ helix is the second most common, often appearing at the ends of α-helices. The π-helix was long considered rare, but π-helices are found in 15% of known protein structures and are believed to be an evolutionary adaptation derived by the insertion of a single amino acid into an α-helix. The PPII helix is unique, it is a well-defined regular structure devoid of main-chain hydrogen bonds, stabilized instead primarily by water, and is functionally important despite being largely unclassified in standard databases.

Can we discover new helices? Yes, through three approaches:

  • Mining existing structural databases
  • Non-natural amino acids
  • Extremophile proteins

8. Why are most molecular helices right-handed?

Most molecular helices are right-handed because of the chirality of life’s building blocks, L-amino acids in proteins and D-sugars in DNA.. L-amino acids naturally adopt backbone angles (φ, ψ) that favor right-handed conformations on the Ramachandran plot. The right-handed helix is more stable by about 1 kcal/mol per residue, largely due to favorable hydrogen-bond interactions.

But the reason why amino acids L in the first place remains an open question. Leading hypotheses include:

  • Circularly polarized light from neutron stars preferentially destroying D-amino acids in space
  • Weak nuclear force creating a slight energy bias toward L-amino acids
  • Meteoritic amino acids showing an excess of left-handed chirality, possibly due to exposure to polarized light in space

Once life committed to L-amino acids, right-handed helices became inevitable. Strong cooperativity effects mean that L-amino acids favor right-handed helices, while D-sugars in DNA also favor right-handed conformations. Chirality is self-reinforcing across all biological molecules.

9. Why do β-sheets tend to aggregate? What is the driving force for β-sheet aggregation?

10. . Why do many amyloid diseases form β-sheets? Can you use amyloid β-sheets as materials?

11. Design a β-sheet motif that forms a well-ordered structure.


Part B: Protein Analysis and Visualization

In this part of the homework, you will be using online resources and 3D visualization software to answer questions about proteins. Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions:

Briefly describe the protein you selected and why you selected it. Identify the amino acid sequence of your protein. How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids. How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs. Does your protein belong to any protein family? Identify the structure page of your protein in RCSB When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å) Are there any other molecules in the solved structure apart from protein? Does your protein belong to any structure classification family? Open the structure of your protein in any 3D molecule visualization software: PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands) Visualize the protein as “cartoon”, “ribbon” and “ball and stick”. Color the protein by secondary structure. Does it have more helices or sheets? Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues? Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?


Part C. Using ML-Based Protein Design Tools

In this section, we will learn about the capabilities of modern protein AI models and test some of them in your chosen protein.

Copy the HTGAA_ProteinDesign2026.ipynb notebook and set up a colab instance with GPU. Choose your favorite protein from the PDB.

We will now try multiple things in the three sections below; report each of these results in your homework writeup on your HTGAA website:

C1. Protein Language Modeling

  1. Deep Mutational Scans Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods. Can you explain any particular pattern? (choose a residue and a mutation that stands out) (Bonus) Find sequences for which we have experimental scans, and compare the prediction of the language model to experiment.
  2. Latent Space Analysis Use the provided sequence dataset to embed proteins in reduced dimensionality. Analyze the different formed neighborhoods: do they approximate similar proteins? Place your protein in the resulting map and explain its position and similarity to its neighbors.

C2. Protein Folding Folding a protein Fold your protein with ESMFold. Do the predicted coordinates match your original structure? Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?

C3. Protein Generation Inverse-Folding a protein: Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one. Input this sequence into ESMFold and compare the predicted structure to your original.


Part D. Group Brainstorm on Bacteriophage Engineering