Week 4 HW: Protein Design Part I

Due Date

Due by start of Mar 3 Lecture

A. Conceptual Questions

Answer any NINE of the following questions from Shuguang Zhang:

How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)
Why do humans eat beef but do not become a cow, eat fish but do not become fish?
Why are there only 20 natural amino acids?
Can you make other non-natural amino acids? Design some new amino acids.
Where did amino acids come from before enzymes that make them, and before life started?
If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?
Can you discover additional helices in proteins?
Why are most molecular helices right-handed?
Why do β-sheets tend to aggregate?
- What is the driving force for β-sheet aggregation?
Why do many amyloid diseases form β-sheets?
- Can you use amyloid β-sheets as materials?
Design a β-sheet motif that forms a well-ordered structure.

B. Protein Analysis and Visualization

In this part of the homework, you will be using online resources and 3D visualization software to answer questions about proteins. Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions:

Briefly describe the protein you selected and why you selected it.
Identify the amino acid sequence of your protein.
- How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids.
- How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.
- Does your protein belong to any protein family?
Identify the structure page of your protein in RCSB
- When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)
- Are there any other molecules in the solved structure apart from protein?
- Does your protein belong to any structure classification family?
Open the structure of your protein in any 3D molecule visualization software:
- PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands)
- Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.
- Color the protein by secondary structure. Does it have more helices or sheets?
- Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?
- Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

C. Using ML-Based Protein Design Tools

In this section, we will learn about the capabilities of modern protein AI models and test some of them in your chosen protein.

Copy the HTGAA_ProteinDesign2026.ipynb notebook and set up a colab instance with GPU.
Choose your favorite protein from the PDB.
We will now try multiple things in the three sections below; report each of these results in your homework writeup on your HTGAA website:

Protein language modeling

Deep Mutational Scans
1. Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.
2. Can you explain any particular pattern? (choose a residue and a mutation that stands out)
3. (Bonus) Find sequences for which we have experimental scans, and compare the prediction of the language model to experiment.
Latent Space Analysis
1. Use the provided sequence dataset to embed proteins in reduced dimensionality.
2. Analyze the different formed neighborhoods: do they approximate similar proteins?
3. Place your protein in the resulting map and explain its position and similarity to its neighbors.

Folding a protein

Fold your protein with ESMFold. Do the predicted coordinates match your original structure?
Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?

Inverse-Folding a protein: Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN

Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one.
Input this sequence into ESMFold and compare the predicted structure to your original.

D. Group Brainstorm on Bacteriophage Engineering

Main Goals

Goal 1: Increase the stability of the MS2 lysis protein by predicting mutations of residues near the C-terminal region and surrounding the LS motif
Goal 2: Improve the N-terminal region by modifying residues to contribute to its toxic activity or add new functional regions that may increase its toxicity.

Improving MS2 lysis protein by modifying regions not related with the Leu48 and Ser49 (LS motif) and surrounding to improve protein toxicity (Chamakura et al, 2017). Predict mutations that may improve its stability. We suggest that by increasing protein stability, the protein would not require the presence of the DnaJ for its action.

Another goal is to design new accessories to the N-terminal region to improve lysis toxicity. Berkhout et al, 1985 suggest C-terminal region is key for protein activity, so taking this in consideration we can try to modify the N-terminal region to improve protein stability or add a new characteristics that may improve the toxicity of the protein

Strategy

(which tools/approaches from recitation you propose using and why do you think those tools might help solve your chosen sub-problem? )

Given our two main goals, we propose different strategies to address each objective

For the first goal, we propose using a protein language model such as ESM-2 to perform in silico deep mutational scan that evaluates the plausibility of all possible single-point mutations in the MS2 L protein. Subsequently, we will employ ESMFold or AlphaFold2 to predict the resulting 3D structural variations.

For the second goal:

Step 1: Identify and Annotate key functional regions near the C-terminal motif and LS motif

Software: Blast (For conserved domains), PeSTO (Functional motifs)

Predict mutations near the N-terminal and C-terminal site that may improve protein stability

Software: Clustal Omega (To identify hotspots for mutations)

Generate different protein candidates with mutations and evaluate their stability

Software: Alpha-Fold Multimer, Boltz-1

We propose using Alpha-Fold with a specific training set for bacteriophages

Predict accessory peptide sequences to insert in their N-terminal region and improve its toxicity

Software: FoldSeek (To find remote sequences with similar folding), EvolvePro (To suggest optimized N-terminar sequences)

Test suitability of these protein candidates by performing docking essays with a bacterial membrane model, etc.

Pitfalls

Strategy/Software	Core Limitation	Risks
Structural prediction & design (AlphaFold, FoldSeek, EvolvePro, Boltz-1)	The model can predict structures that look stable and coherent, but it does not measure real folding energy, membrane insertion, or toxicity. “Looks good” in silico ≠ “works better” in vivo.	Selection of variants that appear structurally improved but do not increase stability or toxicity — or even reduce lytic activity.
Phage-specific training / limited viral datasets