Week 4 HW: Protein Design Part 1

Part A: Conceptual Questions (9)

  1. How many molecules of amino acids do you take with a piece of 500g of meat? (avg amino acid ~100 Daltons)
  2. Why do humans eat beef but do not become a cow, eat fish but do not become fish?
  3. Why are there only 20 natural amino acids?
  4. Can you make other non-natural amino acids? Design some new amino acids.
  5. Where did amino acids come from before enzymes that make them, and before life started?
  6. If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?
  7. Can you discover additional helices in proteins?
  8. Why are most molecular helices right-handed?
  9. Why do β-sheets tend to aggregate?
  10. What is the driving force for β-sheet aggregation?
  11. Why do many amyloid diseases form β-sheets?
  12. Can you use amyloid β-sheets as materials?
  13. Design a β-sheet motif that forms a well-ordered structure.

Part B: Protein Analysis and Visualization

  1. Briefly describe the protein you selected and why you selected it.
  2. Identify the amino acid sequence of your protein.
  3. How long is it? What is the most frequent amino acid?
  4. How many protein sequence homologs are there? (Use UniProt BLAST)
  5. Does your protein belong to any protein family?
  6. Identify the structure page of your protein in RCSB.
  7. When was the structure solved? Is it good quality? (Resolution: smaller = better, aim < 2.70 Å)
  8. Are there any other molecules in the solved structure apart from protein?
  9. Does your protein belong to any structure classification family?
  10. Open the structure in 3D visualization software (PyMol):
    • Visualize as “cartoon”, “ribbon”, and “ball and stick”
    • Color by secondary structure — more helices or sheets?
    • Color by residue type — hydrophobic vs hydrophilic distribution?
    • Visualize the surface — any binding pockets?

Part C: Using ML-Based Protein Design Tools

C1. Protein Language Modeling

Deep Mutational Scans

  • Use ESM2 to generate an unsupervised deep mutational scan based on language model likelihoods
  • Can you explain any particular pattern? (choose a residue and mutation that stands out)
  • (Bonus) Compare language model predictions to experimental scans

Latent Space Analysis

  • Embed proteins in reduced dimensionality using the provided sequence dataset
  • Analyze neighborhoods — do they approximate similar proteins?
  • Place your protein in the map and explain its position and similarity to neighbors

C2. Protein Folding

  • Fold your protein with ESMFold — do predicted coordinates match the original structure?
  • Try mutations, then larger sequence changes — is the structure resilient?

C3. Protein Generation

  • Use ProteinMPNN to inverse-fold your protein backbone and propose sequence candidates
  • Analyze predicted sequence probabilities vs the original sequence
  • Input the new sequence into ESMFold and compare the predicted structure to original

Part D: Group Brainstorm on Bacteriophage Engineering