week 04 hw: protein design-part-I

cover image cover image
Part A. Conceptual Questions

Assignees for this section MIT/Harvard students Required Committed Listeners Required Answer any NINE of the following questions from Shuguang Zhang: (i.e. you can select two to skip)

  1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)
  2. Why do humans eat beef but do not become a cow, eat fish but do not become fish?
  3. Why are there only 20 natural amino acids?
  4. Can you make other non-natural amino acids? Design some new amino acids.
  5. Where did amino acids come from before enzymes that make them, and before life started?
  6. If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?
  7. Can you discover additional helices in proteins?
  8. Why are most molecular helices right-handed?
  9. Why do β-sheets tend to aggregate?
  • What is the driving force for β-sheet aggregation?
  1. Why do many amyloid diseases form β-sheets?
  • Can you use amyloid β-sheets as materials?
  1. Design a β-sheet motif that forms a well-ordered structure.

  1. Amino Acid Count in 500g Meat: Meat is roughly 20% protein by mass. (Human Nutrition - Protein, Vitamins, Minerals | Britannica, n.d.)
    • 500g meat x 0.20 = 100g protein.
    • Using an average mass of 100 Daltons (Da) per amino acid: 100g / 100 Daltons (or g/mol) = 1 moles of amino acids
    • 1x 6.022 x 1023 = 6.022 x 1023 molecules /1 mole.
  2. Why we don’t become cows: When we eat protein, our digestive system breaks it down into individual amino acids. Our body then uses its own DNA information to reassemble those amino acids into human proteins. The information which is coded by the sequence of AA is destroyed, but the building blocks or AA are reused.
  3. Why only 20 amino acids: In nature, the use of 20 amino acids is often explained as a “frozen accident” that originated in the early RNA World. This set worked well very early in Earth’s history and then became fixed. These 20 amino acids were good enough to build strong and functional proteins. Even though many other amino acids exist, this small group provides enough variety to perform many functions while remaining simple, stable, and efficient for cells to use. (Doig, 2017)
  4. Non-natural amino acids: Yes, scientists can make non-natural (unnatural) amino acids. They do this using chemical methods and special genetic tools that allow new amino acids to be added to proteins. These new amino acids can give proteins new properties that natural amino acids do not have. (Young & Schultz, 2010) For example, A new amino acid could be made by taking a normal amino acid, like alanine, and adding a fluorine atom to its side chain. This fluorinated amino acid would make proteins more stable and less likely to break down, which is useful for drug design. (Adhikari et al., n.d.)
  5. Pre-life origins of amino acids: According to Gutiérrez-Preciado, Romero, and Peimbert (2010) Before enzymes and living organisms existed, amino acids were probably formed naturally on early Earth. Energy from lightning, UV light, and volcanic heat helped simple gases react to make amino acids. Some amino acids were also brought to Earth by meteorites and comets. Together, these processes created a “primordial soup” of basic organic molecules. (Amino Acids, Evolution | Learn Science at Scitable, n.d.)
  6. D-amino acid α-helix: In nature, L-amino acids form right-handed helices. If you used only D-amino acids, the stereochemistry would be mirrored, resulting in a left-handed $\alpha$-helix. (Zotti et al., n.d.)
  7. Additional helices: Yes, additional helical structures besides the standard α-helix can be found in proteins. Studies show that other types of helices occur in many proteins, but they are often overlooked or mistaken for small distortions in α-helices. These helices are especially common in membrane proteins and are found in a significant number of known protein structures.(Vieira-Pires & Morais-Cabral, 2010)
  8. Why right-handed helices: because this shape is the most stable for the natural building blocks of life. L-amino acids and D-sugars fit together best in a right-handed twist, which allows strong hydrogen bonds and reduces crowding between atoms. Left-handed helices are usually less stable or hard to form. (Right-Handed Alpha-Helix - an Overview | ScienceDirect Topics, n.d.)
  9. β -sheet aggregation: β-sheets tend to aggregate because their edges have exposed hydrogen-bonding groups that easily stick to other β-strands. The main driving forces are hydrogen bonding between strands and the hydrophobic effect, which together make the stacked β-sheet structure very stable and allow fibrils to form.(Gsponer & Vendruscolo, 2006)
Part B: Protein Analysis and Visualization

Assignees for this section MIT/Harvard students Required Committed Listeners Required In this part of the homework, you will be using online resources and 3D visualization software to answer questions about proteins. Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions:

  1. Briefly describe the protein you selected and why you selected it.
  2. Identify the amino acid sequence of your protein.
    • How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids.
    • How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.
    • Does your protein belong to any protein family?
  3. Identify the structure page of your protein in RCSB
    • When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)
    • Are there any other molecules in the solved structure apart from protein?
    • Does your protein belong to any structure classification family?
  4. Open the structure of your protein in any 3D molecule visualization software:
    • PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands)
    • Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.
    • Color the protein by secondary structure. Does it have more helices or sheets?
    • Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?
    • Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

Part C. Using ML-Based Protein Design Tools

Assignees for this section MIT/Harvard students Required Committed Listeners Required In this section, we will learn about the capabilities of modern protein AI models and test some of them in your chosen protein.

  1. Copy the HTGAA_ProteinDesign2026.ipynb notebook and set up a colab instance with GPU.
  2. Choose your favorite protein from the PDB.
  3. We will now try multiple things in the three sections below; report each of these results in your homework writeup on your HTGAA website:

C1. Protein Language Modeling

  1. Deep Mutational Scans
  • a. Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.
  • b. Can you explain any particular pattern? (choose a residue and a mutation that stands out)
  • c. (Bonus) Find sequences for which we have experimental scans, and compare the prediction of the language model to experiment.
  1. Latent Space Analysis
  • a. Use the provided sequence dataset to embed proteins in reduced dimensionality.
  • b. Analyze the different formed neighborhoods: do they approximate similar proteins?
  • c. Place your protein in the resulting map and explain its position and similarity to its neighbors.

C2. Protein Folding

  1. Folding a protein
  2. Fold your protein with ESMFold. Do the predicted coordinates match your original structure?
  3. Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?

C3. Protein Generation

  • Inverse-Folding a protein: Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN
  1. Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one.
  2. Input this sequence into ESMFold and compare the predicted structure to your original.

Part D. Group Brainstorm on Bacteriophage Engineering

Assignees for this section MIT/Harvard students Required Committed Listeners Required

  1. Find a group of ~3–4 students
  2. Read through the Phage Reading material listed under “Reading & Resources” below
  3. Review the Bacteriophage Final Project Goals for engineering the L Protein:
  • Increased stability (easiest)
  • Higher titers (medium)
  • Higher toxicity of lysis protein (hard)
  1. Brainstorm Session
  • Choose one or two main goals from the list that you think you can address computationally (e.g., “We’ll try to stabilize the lysis protein,” or “We’ll attempt to disrupt its interaction with E. coli DnaJ.”).
  • Write a 1-page proposal (bullet points or short paragraphs) describing:
    • Which tools/approaches from recitation you propose using (e.g., “Use Protein Language Models to do in silico mutagenesis, then AlphaFold-Multimer to check complexes.”).
    • Why do you think those tools might help solve your chosen sub-problem?
    • Name one or two potential pitfalls (e.g., “We lack enough training data on phage–bacteria interactions.”).
    • Include a schematic of your pipeline.
  • This resource may be useful: HTGAA Protein Engineering Tools
  1. Each individually put your plan on your HTGAA website
  • Include your group’s short plan for engineering a bacteriophage