Week 4 HW: Protein Design Part I
Part A. Conceptual Questions
Answer any NINE of the following questions from Shuguang Zhang: (i.e. you can select two to skip)
- How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)
meat is around 20-27 percent for beef.
500 x 0.20 =100 grams of protein
1 Dalton = 1.66054e-24 grams
1 amino acid = 110 Daltons
So protein / Dalton = 6.0221374e+25 grams
110 x 6.0221374e+25 = 6.0221374e+27 grams Of in 500 gram meat.
- Why do humans eat beef but do not become a cow, eat fish but do not become fish?
When native mRNA is injected into the human body, it could trigger a series of heterologus immune responses and be degraded by the immune system, it is similar to how the human body resists viral unvarying.
DNA also needs to be in a nucleus, wrapped in histones and other proteins, with access to polymerase and other enzymes.
The immune systems in the human body identifies things that don’t match the own cells and attack or reject them.
Hou, X., Shi, J., & Xiao, Y. (2024). mRNA medicine: Recent progresses in chemical modification, design, and engineering. Nano research, 17(10), 9015–9030. https://doi.org/10.1007/s12274-024-6978-6
- Why are there only 20 natural amino acids?
The 20 standard amino acids are frozen evolutionary choices, solidified billions of years ago to balance protein functionality with metabolic efficiency.
Doig A. J. (2017). Frozen, but no accident - why the 20 standard amino acids were selected. The FEBS journal, 284(9), 1296–1305. https://doi.org/10.1111/febs.13982
Amino acids are also created through codons which are a chain of three bases stuck together. Nature has evoked to only produce 20 because of inherited redundancy, there are surveillance codons that encode for the same amino acids.
Why are there only 20 amino acids? | MyTutor. (n.d.). Www.mytutor.co.uk.https://www.mytutor.co.uk/answers/52609/Mentoring/Oxbridge-Preparation/Why-are-there-only-20-amino-acids/
Can you make other non-natural amino acids? Design some new amino acids.
Where did amino acids come from before enzymes that make them, and before life started?
In the origins of life on Earth, amino acids were synthesised chemically from a large mixture of organic compounds. After more complex forms of life arose, the ones that were able to synthesise their own amino acid survived and thats why, today, organisms are all able to synthesise their own aminos through enzymes and metabolic pathways.
Gutiérrez-Preciado, A., Romero, H., & Peimbert, M. (2010). Amino Acids, Evolution | Learn Science at Scitable. Nature.com. https://www.nature.com/scitable/topicpage/an-evolutionary-perspective-on-amino-acids-14568445/
If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?
An a-helix using D-amino acids will form a left-handed helix, while natural L-amino acids from right-handed helices, reversing the chirality of the amino acids to the D-form results in the exact mirror image, casting the polypeptide chain to coil in the opposite left handed direction.
Alpha Helix - an overview | ScienceDirect Topics. (2018). Sciencedirect.com. https://www.sciencedirect.com/topics/medicine-and-dentistry/alpha-helix
- Can you discover additional helices in proteins?
Right handed helices are energetically more stable, it has a lower state of energy due to having fewer steric clashes between the side chain and the main chain
Alpha Helix - an overview | ScienceDirect Topics. (2018). Sciencedirect.com. https://www.sciencedirect.com/topics/medicine-and-dentistry/alpha-helix
- Why are most molecular helices right-handed?
- Why do β-sheets tend to aggregate?
Part B: Protein Analysis and Visualization
Part B: Protein Analysis and Visualization
- Briefly describe the protein you selected and why you selected it.
Lysozyme (specifically C-type) is an enzyme that attacks the protective cell walls of bacteria by chewing through the peptidoglycan layer.
I chose this protein because it is a defence protein found in baterial enzyms of tears and saliva.
- Identify the amino acid sequence of your protein.
Amino Acid Sequence: KVFGRCELAA AMKRHGLDNY RGYSLGNWVC AAKFESNFNT QATNRNTDGS TDYGILQINS RWWCNDGRTP GSRNLCNIPC SALLSSDITA VVNCAKKIVS DGNGMNAWVA WRNRCKGTDV QAWIRGCRL
- How long is it? What is the most frequent amino acid?
129 amino acids
- You can use this Colab notebook to count the frequency of amino acids.
Most Frequent Amino Acid: Glycine (G) and Arginine (R) are highly prevalent, but Alanine (A) and Glycineoften tie for the lead depending on the specific species variant. In this sequence, Alanine and Glycine appear 12 times each.
How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.
Does your protein belong to any protein family?
Lysozyme is ubiquitous across the animal kingdom (found in birds, mammals, and even some insects). It belongs to the Glycosyl hydrolase 22 family. These are enzymes that specifically hydrolyze the glycosidic bonds in complex sugars.
- Identify the structure page of your protein in RCSB
The classic high-resolution structure for this protein is found under PDB ID: 193L.
- When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)
The structure for 193L was published in 1995, though lysozyme was famously the first enzyme ever to have its structure solved via X-ray crystallography back in 1965.
It is an excellent quality structure. The resolution is 1.59 Å.
- Open the structure of your protein in any 3D molecule visualization software:
- PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands)
- Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.
- Color the protein by secondary structure. Does it have more helices or sheets?
- Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?
- Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?
Part C. Using ML-Based Protein Design Tools
In this section, we will learn about the capabilities of modern protein AI models and test some of them in your chosen protein.
Copy the HTGAA_ProteinDesign2026.ipynb notebook and set up a colab instance with GPU.
Choose your favorite protein from the PDB.
We will now try multiple things in the three sections below; report each of these results in your homework writeup on your HTGAA website:
C1. Protein Language Modeling
- Deep Mutational Scans a)Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.
b)Can you explain any particular pattern? (choose a residue and a mutation that stands out)
c)(Bonus) Find sequences for which we have experimental scans, and compare the prediction of the language model to experiment.
- Latent Space Analysis
a)Use the provided sequence dataset to embed proteins in reduced dimensionality.
b)Analyze the different formed neighborhoods: do they approximate similar proteins?
c)Place your protein in the resulting map and explain its position and similarity to its neighbors.
C2. Protein Folding
- Folding a protein
a)Fold your protein with ESMFold. Do the predicted coordinates match your original structure?
b)Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?
C3. Protein Generation
- Inverse-Folding a protein: Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN
a) Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one.
b)Input this sequence into ESMFold and compare the predicted structure to your original.
Part D. Group Brainstorm on Bacteriophage Engineering
Find a group of ~3–4 students
Read through the Phage Reading material listed under “Reading & Resources” below.
Review the Bacteriophage Final Project Goals for engineering the L Protein: Increased stability (easiest) Higher titers (medium) Higher toxicity of lysis protein (hard)
Brainstorm Session
Choose one or two main goals from the list that you think you can address computationally (e.g., “We’ll try to stabilize the lysis protein,” or “We’ll attempt to disrupt its interaction with E. coli DnaJ.”).
Write a 1-page proposal (bullet points or short paragraphs) describing:
Which tools/approaches from recitation you propose using (e.g., “Use Protein Language Models to do in silico mutagenesis, then AlphaFold-Multimer to check complexes.”).
Why do you think those tools might help solve your chosen sub-problem?
Name one or two potential pitfalls (e.g., “We lack enough training data on phage–bacteria interactions.”).
Include a schematic of your pipeline.
This resource may be useful: HTGAA Protein Engineering Tools
Each individually put your plan on your HTGAA website
Include your group’s short plan for engineering a bacteriophage