Protein Design part 1 Week 4

Part A. Conceptual Questions

How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)

6.6 x 10²³ molecules of amino acids approximately

Why do humans eat beef but do not become a cow, eat fish but do not become fish?

The digestive system breaks down the components of food (proteins, fats, carbohydrates) into their consitutitive parts. So proteins become amino acids which our body then uses for its own cells, we don’t just assimilate cow proteins or cells, only the nutrients.

Why are there only 20 natural amino acids?

Technically there could be more but evolution got set with ~20 very early on in life’s history so those became the standard. The chemical diversity of those 20 is enough to support life.

Where did amino acids come from before enzymes that make them, and before life started?

Before life as we know it started (on Earth), amino acids could be found in primoridal geochemical or extraterrestrial reactions. Amino acids and their chemical precursors are often found naturally occuring on meteorites, hydrothermal vents, and other such geochemically active sites. It is important to consider that without living organisms to degrade them, any geochemical means of synthesis would accumulate these relatively stable molecules.

If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?

Left handed. Biology is uses the L-amino acids but these create right handed α-helixes.

Can you discover additional helices in proteins?

rare π-helices often misidentified as α-bulges and 3₁₁-helices

Why are most molecular helices right-handed?

They are more thermodynacmically and energetically stable

Why do β-sheets tend to aggregate?

Hydrogen bonding occurs between the ends of the sheets allowing the sheets attach to each other and stack

What is the driving force for β-sheet aggregation?

Hydrophobic effects

Why do many amyloid diseases form β-sheets?

Protein aggregation in the form of sheets can cause disease but it is a thermodynamically favorable structure.

Can you use amyloid β-sheets as materials?

Yes they can be used for tissue scaffold, hydrogels, or other biological materials

Part B: Protein Analysis and Visualization

Human Cu/Zn superoxide dismutase (SOD1) is a cytosolic antioxidant enzyme that converts superoxide radicals to oxygen and hydrogen peroxide, protecting cells from oxidative stress. I selected it because it is a classic, well‑characterized enzyme that we’ve already worked with in my recombinant CuZn SOD production project (bioactive sunscreen).

UniProt P00441 (SODC_HUMAN) sequence (154 aa): MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

https://www.uniprot.org/uniprotkb/P00441/entry

Cu-ZN is 154 aa long and its most frequent residue is Glycine (G).

Using UniProt’s BLAST on P00441 returns thousands of homologous SOD1/CuZnSOD sequences from many species (order of 10³–10⁴ hits, depending on database and thresholds).

SOD1 belongs to the Cu-Zn superoxide dismutase family and contains the Sod_Cu domain (Pfam PF00080)

One high‑quality human SOD1 structure is PDB ID 2C9V: https://www.rcsb.org/structure/2C9V

Resolution: 1.07 Å, and an atomic structure that was solved in 2005

Sample visualizations:

Part C. Using ML-Based Protein Design Tools

In this section, we will learn about the capabilities of modern protein AI models and test some of them in your chosen protein.

Copy the HTGAA_ProteinDesign2026.ipynb notebook and set up a colab instance with GPU. Choose your favorite protein from the PDB. We will now try multiple things in the three sections below; report each of these results in your homework writeup on your HTGAA website: C1. Protein Language Modeling Picture Source: Bordin, Nicola et al (2023). Novel machine learning approaches revolutionize protein knowledge. Trends in Biochemical Sciences, Volume 48, Issue 4, 345 - 359 Picture Source: Bordin, Nicola et al (2023). Novel machine learning approaches revolutionize protein knowledge. Trends in Biochemical Sciences, Volume 48, Issue 4, 345 - 359

Deep Mutational Scans Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods. Can you explain any particular pattern? (choose a residue and a mutation that stands out) (Bonus) Find sequences for which we have experimental scans, and compare the prediction of the language model to experiment. Latent Space Analysis Use the provided sequence dataset to embed proteins in reduced dimensionality. Analyze the different formed neighborhoods: do they approximate similar proteins? Place your protein in the resulting map and explain its position and similarity to its neighbors. C2. Protein Folding Picture Source: Lin et al (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Picture Source: Lin et al (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model.

Folding a protein Fold your protein with ESMFold. Do the predicted coordinates match your original structure? Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations? C3. Protein Generation Picture Source: 1. Post from Sergey Ovchinnikov 2. Roney, Ovchinnikov et al (2022). State-of-the-art estimation of protein model accuracy using AlphaFold. Phys. Rev. Lett. 129, 238101 Picture Source: 1. Post from Sergey Ovchinnikov 2. Roney, Ovchinnikov et al (2022). State-of-the-art estimation of protein model accuracy using AlphaFold. Phys. Rev. Lett. 129, 238101

Inverse-Folding a protein: Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one. Input this sequence into ESMFold and compare the predicted structure to your original.

Part D. Group Brainstorm on Bacteriophage Engineering

Goal 1: Stabilize MS2 L so it remains correctly folded and membrane-competent under a wider range of conditions (temperature, expression levels), which should support more robust lysis and higher effective titers

Goal 2: Increase intrinsic toxicity by promoting oligomerization and pore formation of L in the bacterial envelope.

We’ll focus on increasing stability and increasing toxicity of the MS2 L lysis protein, using a purely computational design-and-screen pipeline before any wet-lab work.

Our first goal is increased stability: we want MS2 L to remain correctly folded and membrane-competent across a broad range of expression and environmental conditions, so that lysis is robust and supports higher effective titers.

Our second goal is increased toxicity: we aim to boost the intrinsic lytic potency of L by promoting its oligomerization and pore formation in the bacterial envelope.

To do this, we will use protein language models (such as ESM or ProtT5) as in silico mutagenesis engines, scoring single and selected double mutants of L, then ranking substitutions predicted to increase stability or fitness.

We will combine this with structure prediction and modeling, using AlphaFold on full-length and truncated L variants to obtain consistent 3D models of the soluble and transmembrane domains, and AlphaFold-Multimer or docking tools to model L oligomers within a membrane-like context.

Membrane-aware stability analysis will help constrain our designs: TMHMM-like predictors will define transmembrane boundaries and disorder regions, while Rosetta- or FoldX-style ΔΔG calculations, accessed through web servers, will refine a shortlist of promising positions around the LS motif and transmembrane helix.

We will also integrate coevolution and functional hotspot information by overlaying published near-saturating mutagenesis data for L, so that residues known to be essential for function are explicitly locked and excluded from substitution.

These tools are well suited to our problem because mutational studies show that even conservative changes near the LS motif can abolish function, meaning that we need a guided way to search sequence space that respects both evolutionary constraints and biophysical stability.

Language models provide a rapid, global scan of mutation space, while ΔΔG calculations and structure predictions help us reject variants that appear incompatible with correct helix packing or overall folding, yielding a concise list of testable stabilizing mutants.

For increased toxicity, MS2 L’s activity depends on efficient membrane insertion and formation of higher-order oligomers by its C-terminal transmembrane segment, so modeling and optimizing transmembrane helix–helix interfaces should directly affect pore stability and lytic potency.

By favoring small or helix-packing residues at predicted contact positions and avoiding unfavorable charges within the bilayer, we can design variants that, in silico, form more stable oligomers without disrupting the LS motif or essential hydrophobic patterning.

In single-gene lysis systems, lysis timing and efficiency emerge from a balance of expression level, folding, and toxicity, so improving stability should also reduce misfolding and escape mutants, enabling higher functional L levels and improved titers under comparable induction conditions.

Our overall objective is to computationally design MS2 L variants that combine improved structural stability in their membrane-inserted forms with increased