Week 4 HW: Protein Design Part 1
Part A: Conceptual Questions (9)
- How many molecules of amino acids do you take with a piece of 500g of meat? (avg amino acid ~100 Daltons)
- Why do humans eat beef but do not become a cow, eat fish but do not become fish?
- Why are there only 20 natural amino acids?
- Can you make other non-natural amino acids? Design some new amino acids.
- Where did amino acids come from before enzymes that make them, and before life started?
- If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?
- Can you discover additional helices in proteins?
- Why are most molecular helices right-handed?
- Why do β-sheets tend to aggregate?
- What is the driving force for β-sheet aggregation?
- Why do many amyloid diseases form β-sheets?
- Can you use amyloid β-sheets as materials?
- Design a β-sheet motif that forms a well-ordered structure.
Part B: Protein Analysis and Visualization
- Briefly describe the protein you selected and why you selected it.
- Identify the amino acid sequence of your protein.
- How long is it? What is the most frequent amino acid?
- How many protein sequence homologs are there? (Use UniProt BLAST)
- Does your protein belong to any protein family?
- Identify the structure page of your protein in RCSB.
- When was the structure solved? Is it good quality? (Resolution: smaller = better, aim < 2.70 Å)
- Are there any other molecules in the solved structure apart from protein?
- Does your protein belong to any structure classification family?
- Open the structure in 3D visualization software (PyMol):
- Visualize as “cartoon”, “ribbon”, and “ball and stick”
- Color by secondary structure — more helices or sheets?
- Color by residue type — hydrophobic vs hydrophilic distribution?
- Visualize the surface — any binding pockets?
Part C: Using ML-Based Protein Design Tools
C1. Protein Language Modeling
Deep Mutational Scans
- Use ESM2 to generate an unsupervised deep mutational scan based on language model likelihoods
- Can you explain any particular pattern? (choose a residue and mutation that stands out)
- (Bonus) Compare language model predictions to experimental scans
Latent Space Analysis
- Embed proteins in reduced dimensionality using the provided sequence dataset
- Analyze neighborhoods — do they approximate similar proteins?
- Place your protein in the map and explain its position and similarity to neighbors
C2. Protein Folding
- Fold your protein with ESMFold — do predicted coordinates match the original structure?
- Try mutations, then larger sequence changes — is the structure resilient?
C3. Protein Generation
- Use ProteinMPNN to inverse-fold your protein backbone and propose sequence candidates
- Analyze predicted sequence probabilities vs the original sequence
- Input the new sequence into ESMFold and compare the predicted structure to original
Part D: Group Brainstorm on Bacteriophage Engineering