Week 4 HW: Protein Design Part I

Homework: Protein Design I

Assignment


Objective:

Learn basic concepts: amino acid structure, 3D protein visualization, and the variety of ML-based design tools. Brainstorm as a group how to apply these tools to engineer a better bacteriophage (setting the stage for the final project).

HTGAA Protein Engineering Tools, HTGAA Protein Engineering Feedback​


Part A. Conceptual Questions

Answer any of the following questions by Shuguang Zhang:
  1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)
  2. Why humans eat beef but do not become a cow, eat fish but do not become fish?
    • Although the saying goes we are what we eat, our genomes disagree – they are selfish like that. However, if a human eats a cow with a prion disease, the line between them fades as each returns to the Earth.
  3. Why there are only 20 natural amino acids?
  4. Can you make other non-natural amino acids? Design some new amino acids.
  5. Where did amino acids come from before enzymes that make them, and before life started?
  6. If you make an alpha-helix using D-amino acids, what handedness (right or left) would you expect?
    • The 20 primary amino acids are all L-amino acids, as are most protein building blocks of cells. Alpha-helices here will be the B-DNA, favoring right-handedness. Thus by the power of deduction that leaves D-amino acids, the exceptions, to the way of left-handedness.
  7. Can you discover additional helices in proteins?
    • yes
  8. Why most molecular helices are right-handed?
  • Most life on Earth is evolutionary rooted in B-DNA helices with a right-handed confirmation due to origin in saltwater oceans, passed on to self-replicating cells synthesized from macromolecules shaped by complementarities in form dominated by non-covalent weak interactions. source,
  1. Why do beta-sheets tend to aggregate?
  • ionic bonding, need to confirm.
  1. What is the driving force for b-sheet aggregation?
  2. Why many amyloid diseases form b-sheet?
  3. Can you use amyloid b-sheets as materials?
  4. Design a b-sheet motif that forms a well-ordered structure.

Part B: Protein Analysis and Visualization

In this part of the homework, you will be using online resources and 3D visualization software to answer questions about proteins.
Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions.
  1. Briefly describe the protein you selected and why you selected it.
  2. Identify the amino acid sequence of your protein.
  3. How long is it? What is the most frequent amino acid? You can use this notebook to count most frequent amino acid - https://colab.research.google.com/drive/1vlAU_Y84lb04e4Nnaf1axU8nQA6_QBP1?usp=sharing
  4. How many protein sequence homologs are there for your protein? Hint: Use the pBLAST tool to search for homologs and ClustalOmega to align and visualize them. Tutorial Here
  5. Does your protein belong to any protein family?
  6. Identify the structure page of your protein in RCSB
  7. When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)
  8. Are there any other molecules in the solved structure apart from protein?
  9. Does your protein belong to any structure classification family?
  10. Open the structure of your protein in any 3D molecule visualization software:
  • PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands)
    • Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.
    • Color the protein by secondary structure. Does it have more helices or sheets?
    • Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?
    • Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

Part C. Using ML-Based Protein Design Tools

In this section, we will learn about the capabilities of modern protein AI models and test some of them in your chosen protein. Copy the notebook below and set up a colab instance with GPU for this section: HTGAA_ProteinDesign2026.ipynb Choose your favorite protein from the PDB. We will now try multiple things, report each of those results in your homework page: Protein Language Models: Deep Mutational Scans Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods. Can you explain any particular pattern? (choose a residue and a mutation that stands out) (Bonus) Find sequences for which we have experimental scans, and compare the prediction of the language model to experiment Latent Space Analysis Use the provided sequence dataset to embed proteins in reduced dimensionality Analyze the different formed neighborhoods: do they approximate similar proteins? Place your protein in the resulting map and explain its position and similarity to its neighbors Attention Maps Analyze the attention maps of ESM2. Investigate if its layers correlate to the 2D map of residue distances of your protein Protein Folding: Folding a protein Fold your protein with ESMFold. Do the predicted coordinates match your original structure? Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations? Protein Generation: Inverse-Folding a protein Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one Input this sequence into ESMFold and compare the predicted structure to your original Part D. Group Brainstorm on Bacteriophage Engineering

Find a group of ~3–4 students Review the Bacteriophage Final Project Goals: Increased stability (easiest) Higher titers (medium) Higher toxicity of lysis protein (hard) Brainstorm Session Choose one or two main goals from the list that you think you can address computationally (e.g., “We’ll try to stabilize the lysis protein,” or “We’ll attempt to disrupt its interaction with E. coli DnaJ.”). Write a 1-page proposal (bullet points or short paragraphs) describing: Which tools/approaches from recitation you propose using (e.g., “Use Protein Language Models to do in silico mutagenesis, then AlphaFold-Multimer to check complexes.”). Why you think those tools might help solve your chosen sub-problem. One or two potential pitfalls (e.g., “We lack enough training data on phage–bacteria interactions.”). Include a schematic of your pipeline This resource may be useful: HTGAA Protein Engineering Tools Individually put your plan on your website page Each group’s short plan for engineering a bacteriophage Schedule time ( HTGAA Protein Engineering Feedback) to get feedback/discuss your ideas, and put the feedback on your website [Optional] Part E. Find a drug for an oncology target