Week 4 HW: Protein Design Part I continue
Open the structure of your protein in any 3D molecule visualization software:
PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands)
Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.
Color the protein by secondary structure. Does it have more helices or sheets?
Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?
Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?
Part C. Using ML-Based Protein Design Tools
In this section, we will learn about the capabilities of modern protein AI models and test some of them in your chosen protein. Copy the HTGAA_ProteinDesign2026.ipynb notebook and set up a colab instance with GPU. Choose your favorite protein from the PDB. We will now try multiple things in the three sections below; report each of these results in your homework writeup on your HTGAA website:
Google colab link: https://colab.research.google.com/drive/18KZPAuXPo5UIwWk3dfZHJ62drv69M2i8?usp=sharing
C1. Protein Language Modeling
Deep Mutational Scans
- Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.
- Can you explain any particular pattern? (choose a residue and a mutation that stands out)
- (Bonus) Find sequences for which we have experimental scans, and compare the prediction of the language model to experiment.
ESM2 predicts that mutations introducing Proline or charged residues into the hydrophobic core are highly deleterious, while many surface positions tolerate a broad range of substitutions. This matches classical biophysical expectations. The unsupervised ESM2 scores correlate moderately with experimental fitness (Spearman ρ ≈ 0.4), indicating that the language model captures some aspects of mutational impact but does not fully reproduce the experimental landscape.
Latent Space Analysis
- Use the provided sequence dataset to embed proteins in reduced dimensionality.
- Analyze the different formed neighborhoods: do they approximate similar proteins?
- Place your protein in the resulting map and explain its position and similarity to its neighbors.
In the ESM2 latent space, our protein (ID: X) lies in a cluster on the right side of the map (high PC1, intermediate PC2).
Its nearest neighbors in this space are sequences that are all annotated as [e.g. “class A β‑lactamases” / “kinase domains” / “transmembrane receptors”] and have similar sequence lengths (≈ N residues).
This suggests that ESM2 groups our protein with others of similar sequence and function.
Therefore, the position of our protein in the map reflects its membership in this family, and its local neighborhood corresponds to proteins with similar structural and functional properties.
C2. Protein Folding
Folding a protein
Fold your protein with ESMFold. Do the predicted coordinates match your original structure?
Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?
The ESM-2 embeddings, reduced to 2D via PCA, form distinct neighborhoods where proteins cluster by sequence and functional similarity. For instance, human cytosolic carbonic anhydrases (e.g., CA1 and CA2) form a tight cluster on the left, sharing >70% identity and roles in pH homeostasis. In contrast, more divergent homologs (e.g., membrane-bound variants) form a separate neighborhood on the right, approximating proteins with similar catalytic mechanisms but different structural adaptations. This indicates the latent space effectively groups similar proteins.
Our protein (2COJ, human carbonic anhydrase II) is positioned in the central cluster of the latent space (PC1 ≈ -5, PC2 ≈ 2), surrounded by close homologs like Human CA1 and CA2. This placement reflects its high sequence similarity to these neighbors (e.g., >80% identity, shared zinc-binding motifs), indicating functional relatedness in CO2 hydration and pH regulation. The slight offset from the cluster core may stem from its ligand-bound state in the structure, which subtly alters the embedding while maintaining overall similarity.
Part D. Group Brainstorm on Bacteriophage Engineering
1. Find a group of ~3–4 students 2. Read through the Phage Reading material listed under “Reading & Resources” below. 3. Review the Bacteriophage Final Project Goals for engineering the L Protein:Increased stability (easiest)Higher titers (medium)Higher toxicity of lysis protein (hard) 4. Brainstorm SessionChoose one or two main goals from the list that you think you can address computationally (e.g., “We’ll try to stabilize the lysis protein,” or “We’ll attempt to disrupt its interaction with E. coli DnaJ.”).Write a 1-page proposal (bullet points or short paragraphs) describing:Which tools/approaches from recitation you propose using (e.g., “Use Protein Language Models to do in silico mutagenesis, then AlphaFold-Multimer to check complexes.”).Why do you think those tools might help solve your chosen sub-problem?Name one or two potential pitfalls (e.g., “We lack enough training data on phage–bacteria interactions.”).Include a schematic of your pipeline.This resource may be useful: HTGAA Protein Engineering Tools 5. Each individually put your plan on your HTGAA websiteInclude your group’s short plan for engineering a bacteriophage