Protein Design Part I
← Return to Main Page
Protein
Design
Protein
Design
Protein
Design Part I
Exploring amino acids, structures, and ML-based design tools.

Conceptual Questions

How many molecules of amino acids do you take with a piece of 500 grams of meat?
On average, 100g of beef provide 26g of protein. So: 26g protein / 100g meat * 500g meat = 130g total protein per 500g of beef.

We have that one aminoacid is approximately 100 Daltons (100g/mol). For 130g of protein, we have 130g / 100g/mol = 1.3 moles. Using Avogadro's number: 1.3 * 6.022*10^23 = 7.826*10^23 molecules.
Why are there only 20 natural amino acids?
The set of 20 amino acids is the result of billions of years of evolution. They provide a chemically diverse set of tools to create all protein structures and functions, while staying optimized for the genetic code's 64 codon framework. Adding more would require greater cellular complexity and error rates; reducing the amount would limit chemical functionality.
Why do humans eat beef but do not become a cow?
During digestion, food is broken apart into small peptides and constituent amino acids. The food we eat doesn't dictate how these amino acids are re-linked. Our DNA encodes the specific instructions for our ribosomes to create human proteins, transmitted via mRNA.
Can you make other non-natural amino acids?
Yes, widely adopted by the drug development industry to improve the therapeutic properties of peptides. A common example is N-Methyl Amino Acid, where the hydrogen on the amino group (-NH2) is replaced with a methyl group (-NHCH3), improving bioavailability and enzymatic resistance.
Where did amino acids come from before life started?
There are two primary hypotheses:
Extraterrestrial Amino acids have been found in meteorites, suggesting they formed in space and were delivered to Earth.
Prebiotic Soup Amino acids can form from simpler molecules like methane, ammonia, and water under early Earth atmospheric conditions.
If you make an α-helix using D-amino acids, what handedness would you expect?
Due to chirality, we expect a left-handed α-helix. Nature uses L-amino acids (right-handed helices), making D-amino acids their mirror life counterparts.
Why are most molecular helices right-handed?
All life on Earth uses L-amino acids and D-sugars ("homochirality"). The energetic preference to form a right-handed twist creates a minimum energy state favoring that sense.
Why do β-sheets tend to aggregate?
They have a high inherent trend to form intermolecular hydrogen bonds. The edges present unsatisfied hydrogen bond donors/acceptors that easily pair with complementary strands.

Driving Force: The hydrogen bonds are polar. When two sheets bond, they dehydrate. The release of bound water molecules provides an entropic force for aggregation.

Structural Analysis

Selected Protein:
DMTF1

(Cyclin D-binding Myb-like transcription factor 1) is a transcription factor regulating cell growth and survival. It acts as a tumor suppressor by activating the p53 pathway. Restoring DMTF1 levels can reverse neural stem cell dormancy, potentially "reversing" brain aging.

Length 760 AA
Frequent Serine (S)
Homologs 250 found
Family DMTF1
Solved 2011 (NMR)
Class Homeodomain-like
Surface Target
Surface
Cartoon Target
Cartoon
Ribbon
Ribbon
Ball & Stick
Ball and stick

Dynamic Overlays

Sheets & Helices
Surface Ball Cartoon Ribbon

Secondary Structure

Does it have more helices or sheets?

It has more helices than sheets.

Hydrophobic & Hydrophilic
Ball Cartoon Ribbon Surface

Residue Type

Distribution of hydrophobic vs hydrophilic residues?

The surface is dominated by hydrophilic residues (blue), necessary for this nuclear transcription factor to bind DNA.

Holes surface

Surface Topology

Does it have any “holes”?

No deep holes. Instead, it contains shallow depressions and elongated surfaces, typical for protein-DNA interactions.

ML-Based Generation

Question // C1.1 Deep Mutational Scans
Heatmap
Explain particular patterns from the Heat Map.
  • Positions 230-385 feature frequent dark blue columns. Almost no mutation is well-tolerated here; a.a. are highly specific.
  • Outer edges are greener, representing greatest tolerance.
  • Amino acids like Tryptophan (W), R, M, H, F, C are almost always bad tolerated.
Question // C1.2 Latent Space Analysis
Latent Space Analysis Plot
Neighborhood approximations:
Proteins close in space often share organism, classification, and function.
DMTF1 Neighbors
DMTF1 Neighbors:
  1. d2g77a2 a.69.2.1 Ypt/Rab-GAP domain
  2. d4c3hd_ g.98.1.1 (D:) RNA polymerase I
  3. d2v8qa1 d.129.6.2 AMPK1
All are involved in cellular regulation.
Question // C2 & C3ESMFold & ProteinMPNN
ESMFold Alignment
C2. ESMFold Alignment Original structure is fully superimposed perfectly, verifying the local prediction.
Inverse Fold
C3. ProteinMPNN (Inverse Folding) Calculations show only 22% of the predicted sequence matches the actual amino acids from the original.
Inverse Align
Feeding the Inverse Fold sequence back into ESMFold yields a good predicted structure, though slightly worse than the original unaltered prediction.

Part D. Group Brainstorm on Bacteriophage Engineering

Final Task // Brainstorm Danna Betancourt, Rodrigo Arredondo, Valeria Q. Ortega, Jessica Wu
Group Brainstorm Diagram

As discussed in “Phage Therapy: Past, Present and Future”, phage therapy represents an interesting alternative to antibiotic treatments, especially as recent developments allow researchers to engineer bacteriophages and their proteins. Our final group project for HTGAA Spring 2026 focuses on improving the bacteriophage MS2’s ability to kill its host bacteria E. coli by engineering its lysis protein MS2-L.

As an interdisciplinary team with different levels of experience in biotechnology, we propose increasing the stability of MS2-L. The lysis protein relies on the chaperone DnaJ for proper protein folding, a process E. coli can disrupt. However, it has been previously demonstrated that mutations deleting the N-terminal half of the MS2-L remove its dependence on DnaJ while also accelerating bacterial lysis. We believe this direction is promising for discovering variants that have structural stability within its host.

Our proposed approach begins with ProteinMPNN to look for alternative amino acid sequences that will improve the stability of MS2-L, then the sequences can be evaluated using AlphaFold and AlphaFold-Multimer to verify compatibility with their biological function and their interaction with DnaJ, with Alphafold specialized to model oligomeric complexes like MS2 and AlphaFold-Multimer tailored to predict protein-protein interactions like the one between MS2 and DnaJ.

Lastly, we must identify promising sequences for experimentation. We can do this by comparing variants quantitatively, e.g. using a deep mutational scan to see how each variant holds up when introduced to point mutations. This will narrow our candidate list to the most promising candidates for synthesis and experimental validation, reducing costs and promoting data-informed decision-making.

Any pitfalls are tied to the reliability of our tools; computational predictions of stability may not fully reflect protein behavior. For example, AlphaFold-Multimer has a systematic bias toward interactions between ordered protein regions, with a reduced accuracy for disordered regions and transient interactions like that of a chaperone and its complex.

We are also held back with a narrow scope. Phage therapy depends on several biological variables beyond a single protein, and there is currently a lack of pharmacokinetic and pharmacodynamic studies on phage therapy. This means that we can make MS2-L more stable, but other factors could limit the effectiveness of the bacteriophage.