Week 4: Protein Design Part I

Part A: Conceptual Questions

Q1. How many molecules of amino acids do you take with a piece of 500 grams of meat?

500g of meat contains approximately 125g of protein. One mole of a 100-Dalton molecule weighs 100 grams and Avogadro’s number (6.022 × 10²³) is the number of molecules in one mole. So 125g of protein ÷ 100 g/mol = 1.25 moles of amino acid residues, multiplied by Avogadro’s number: 1.25 × 6.022 × 10²³ ≈ 7.5 × 10²³ amino acid molecules.

Q2. Why do humans eat beef but do not become a cow, eat fish but do not become fish?

Digestive enzymes(proteases like pepsin and trypsin) break down the cow’s proteins into individual amino acids via hydrolysis. They are absorbed into our bloodstream and our cells reassemble them into human proteins following the instructions in DNA.

Q3. Why are there only 20 natural amino acids?

The 20 amino acids cover a broad range of chemical properties, positive and negative charges, hydrophobic and hydrophilic, large and small, rigid and flexible structures. This set is chemically diverse enough to build proteins capable of catalyzing thousands of reactions and forming diverse structures. The triplet codon system (64 codons from 4 bases) naturally accommodates around 20 amino acids after accounting for stop codons and redundancy. Early life developed and evolved from this set.

Q5. Where did amino acids come from before enzymes that make them, and before life started?

Amino acids form spontaneously through non-biological chemistry. The Miller-Urey experiment (1953) demonstrated that sparking a mixture of water, methane, ammonia, and hydrogen(simulating early Earth conditions) produced multiple amino acids without any enzymes or cells. Amino acids have also been found on meteorites and detected in interstellar molecular clouds. Energy sources like lightning, UV radiation, and hydrothermal vents can drive the formation of amino acids from simple precursors such as hydrogen cyanide and ammonia. Amino acids are not uniquely biological but are basic organic chemistry that life later organized into a precise system.

Q6. If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?

It would form a left-handed α-helix. Natural L-amino acids produce right-handed α-helices because the L-configuration creates specific steric preferences in backbone angles that favor a right-handed spiral. D-amino acids are the mirror of L-amino acids, so preferences flip and the resulting helix spirals in the opposite direction.

Q7. Can you discover additional helices in proteins?

Yes. Beyond the standard α-helix, other helical forms exist. The 3₁₀ helix is tighter, with hydrogen bonds between amino acid i and i+3 (compared to i and i+4 in the α-helix), making it narrower. The π-helix is wider and looser, with i to i+5 hydrogen bonds. Polyproline helices (type I and II) have no internal hydrogen bonds.

Q8. Why are most molecular helices right-handed?

Life and protein-making exclusively uses L-amino acids, and the L-configuration creates steric preferences that favor right-handed helices. Specifically, in an L-amino acid, the side chain clashes with the backbone carbonyl in a left-handed spiral, which is energetically unfavorable.

Q9. Why do β-sheets tend to aggregate? What is the driving force for β-sheet aggregation?

β-sheets are extended, flat structures with hydrogen bond donors and acceptors exposed along their edges. These unsatisfied hydrogen bonding sites in their edges will bond to those of other β-sheets. This is inherently self-propagating because the outward facing edges drive intermolecular aggregation. The flat faces of stacked β-sheets also interact through hydrophobic packing forces.

Q10. Why do many amyloid diseases form β-sheets? Can you use amyloid β-sheets as materials?

Amyloid diseases involve proteins that misfold into β-sheet-rich structures. The β-sheet conformation is a generic, low-energy state that doesn’t require the precise side-chain packing of a native fold. When a protein misfolds and exposes β-sheet edges, it templates neighboring proteins to misfold the same way, creating a self-propagating cascade. Amyloid β-sheets can be used as materials because they have tensile strength comparable to steel, resist heat and proteases, and self-assemble spontaneously. Researchers have used them as scaffolds for tissue engineering, conductive nanowires, adhesive coatings, and hydrogel matrices for cell culture.

Part B: Protein Analysis and Visualization

Protein Selection

I selected human heavy-chain ferritin (HuHF), the primary iron-storage protein found in virtually all living organisms. I chose this because its self-assembling architecture makes it a powerful platform for nanotechnology applications including drug delivery, biosensor scaffolding, and templated nanomaterial synthesis. Ferritin represents a natural example of the design principle: simple modular subunits that autonomously organize into functional nanostructures.

Sequence Analysis

Human H-chain Ferritin — Amino Acid Sequence
UniProt ID: P02794 | PDB: 1FHA
MTTASTSQVRQNYHQDSEAAINRQINLELYASYVYLSMSYYFDRDDVALKNFAKYFLHQSHEEREHAEKLMKLQNQRGGRIFLQDIKKPDCDDWESGLNAMECALHLEKNVNQSLLELHKLATDKNDPHLCDFIETHYLNEQVKAIKELGDHVTNLRKMGAPESGLAEYLFDKHTLGDSDNES

The protein is 183 amino acids long. The most frequent amino acid is leucine (L).

Amino acid frequency analysis Amino acid frequency analysis

Homologs: BLAST search of UniProt identified at least 250 homologs at ≥90% sequence identity, including primates, other mammals, and even bacteria.

BLAST search results BLAST search results

Protein Family: Ferritin belongs to the ferritin-like superfamily (Pfam: PF00210). This superfamily also includes bacterioferritins and DNA-binding proteins from starved cells (Dps proteins). InterPro classifies it under the ferritin/ribonucleotide reductase-like domain family.

RCSB Structure

Protein Structure on RCSB: https://doi.org/10.2210/pdb1FHA/pdb

When was it solved: deposited in 1990 and released in 1992, with the primary citation published in 1991 (Lawson et al., Nature 349: 541-544). It was solved by X-ray crystallography at a resolution of 2.40 Å(good quality). The R-value of 0.205 further confirms the model fits the experimental data well.

Other molecules in the structure: The structure contains two types of non-protein molecules: iron (Fe³⁺) ions located at the ferroxidase center within each of its four-helix bundle, which is where ferritin catalyzes the oxidation of Fe²⁺ to Fe³⁺ for storage; and calcium (Ca²⁺) ions, likely present from crystallization conditions.

Structure classification: The protein is classified as a METAL BINDING PROTEIN with the enzyme classification EC 1.16.3.1 (ferroxidase). The biological assembly has octahedral symmetry (O) as a homo 24-mer (A24), with each subunit folded as a four-helix bundle.

PyMOL Visualization

Cartoon visualization of ferritin subunit Cartoon visualization of ferritin subunit Cartoon visualization

Ribbon visualization of ferritin subunit Ribbon visualization of ferritin subunit Ribbon visualization

Ball and stick visualization of ferritin subunit Ball and stick visualization of ferritin subunit Ball and stick visualization

Secondary structure coloring — red: α-helices, green: loops Secondary structure coloring — red: α-helices, green: loops Secondary structure: The protein is almost entirely alpha-helical (red) with no beta-sheets. The only non-helical regions (green) are short loops connecting the four main helices.

Residue type coloring — orange: hydrophobic, cyan: hydrophilic, blue: positive, red: negative Residue type coloring — orange: hydrophobic, cyan: hydrophilic, blue: positive, red: negative Residue type: Hydrophobic residues (orange) and hydrophilic/charged residues (cyan, blue, red) are interspersed along the helices, but the hydrophobic residues face inward toward the bundle core while charged and polar residues face outward toward solvent. This inside-out arrangement is typical of soluble proteins.

24-mer assembly with iron (magenta) and calcium (green) ions 24-mer assembly with iron (magenta) and calcium (green) ions Surface and binding pockets: The 24-mer assembly reveals a hollow spherical cage with visible pores which are the channels through which iron ions enter and exit the interior cavity. Iron atoms (magenta spheres) are visible at the ferroxidase centers within each subunit, and calcium ions (green spheres) sit at the symmetry axes.

Part C: Using ML-Based Protein Design Tools

For Part C, I chose Reflectin A1 from Doryteuthis pealeii (longfin inshore squid), the protein responsible for dynamic structural coloration in cephalopod skin.

Reflectin A1 — Amino Acid Sequence
UniProt ID: D3UA43 | GenBank: ACZ57764.1 | AlphaFold: D3UA43
MNRYLNRQRLYNMYRNKYRGVMEPMSRMTMDFQGRYMDSQGRMVDPRYYDYYGRMHDHDRYYGRSMFNQGHSMDSQRYGGWMDNPERYMDMSGYQMDMQGRWMDAQGRFNNPFGQMWHGRQGHYPGYMSSHSMYGRNMYNPYHSHYASRHFDSPERWMDMSGYQMDMQGRWMDNYGRYVNPFNHHMYGRNMCYPYGNHYYNRHMEHPERYMDMSGYQMDMQGRWMDTHGRHCNPFGQMWHNRHGYYPGHPHGRNMFQPERWMDMSGYQMDMQGRWMDNYGRYVNPFSHNYGRHMNYPGGHYNYHHGRYMNHPERHMDMSSYQMDMHGRWMDNQGRYIDNFDRNYYDYHMY
>

C1. Protein Language Modeling

Deep Mutational Scan

I used ESM-2 to perform the deep mutational scan of Reflectin A1 (350 aa) in RELATIVE mode, scoring the predicted effect of every possible single amino acid substitution at every position.

Deep mutational scan heatmap of Reflectin A1 Deep mutational scan heatmap of Reflectin A1 ESM-2 deep mutational scan heatmap. X-axis: position in sequence (1–350). Y-axis: amino acid substitutions. Color: model score (yellow = tolerated, dark blue/purple = deleterious).

The heatmap reveals regularly spaced clusters of mutation-intolerant positions approximately every 50–70 residues. This periodicity suggests the model has learned the repeating internal motif structure of the reflectin family. Cysteine(C) and tryptophan(W) substitutions are the most strongly penalized across these positions, likely because introducing disulfide-forming or large aromatic residues at these sites would disrupt the native fold.

Standout residue at position 268 Standout residue at position 268

Standout residue: Position 268 mutated to tryptophan (W) scored −9.70, the most deleterious prediction in the scan. It is also intolerant to methionine, leucine, lysine, and isoleucine, which suggests the original residue at this position is structurally or functionally critical and cannot be replaced without disrupting the protein.

Latent Space Analysis

I embedded 15,177 proteins from the ASTRAL SCOP structural classification database using ESM-2 mean token embeddings (320 dimensions), then reduced them to 3D with t-SNE for visualization. The resulting plot shows clear clustering — proteins with similar sequences and structures group into distinct neighborhoods, confirming that ESM-2’s learned representations capture meaningful structural relationships.

3D t-SNE of ASTRAL SCOP protein embeddings 3D t-SNE of ASTRAL SCOP protein embeddings 3D t-SNE visualization of 15,177 ASTRAL SCOP protein embeddings colored by TSNE3 component. Distinct clusters correspond to different protein fold families.

I then embedded Reflectin A1 into the same space and projected it alongside the ASTRAL dataset.

Reflectin A1 (red) in the protein latent space Reflectin A1 (red) in the protein latent space Reflectin A1 (red dot) plotted in the ESM-2 latent space alongside the ASTRAL dataset (blue). It sits near the periphery of the main cluster.

Reflectin A1 zoomed in Reflectin A1 zoomed in Zoomed view showing Reflectin A1’s position relative to nearby ASTRAL proteins.

Top 5 nearest neighbors Top 5 nearest neighbors

Using Euclidean distance in the original 320-dimensional embedding space, the 5 nearest neighbors to Reflectin A1 are:

  1. Tachylectin-2, Japanese horseshoe crab (dist: 3.81)
  2. Rabbit protein, automated match (dist: 3.87)
  3. Hemolytic lectin CEL-III, sea cucumber (dist: 4.09)
  4. Methionine-rich 2S albumin, sunflower (dist: 4.25)
  5. Clostridium thermocellum protein, automated match (dist: 4.27)

Two of the five closest neighbors are lectins with repetitive structural motifs used for carbohydrate binding. Reflectin also contains repetitive internal motifs, so ESM-2 may be picking up on this shared repetitive sequence character despite the proteins being functionally unrelated. The methionine-rich neighbor is also notable since reflectin is unusually rich in methionine. The relatively large distances across all five neighbors (3.8–4.3) confirm that reflectin has no close structural homologs in the database, consistent with it being a structurally novel protein unique to cephalopods.

C2. Protein Folding

Folding with ESMFold

I folded Reflectin A1 using ESMFold and compared the result to the AlphaFold prediction.

pTMpLDDT
Original0.15739.15

ESMFold produced very low confidence scores: a pLDDT of 39.15 (out of 100) and pTM of 0.157 (out of 1). Values below 50 pLDDT indicate intrinsic disorder. The structure is almost entirely red when colored by confidence, meaning ESMFold cannot confidently predict any region of the fold.

ESMFold prediction of Reflectin A1 colored by pLDDT ESMFold prediction of Reflectin A1 colored by pLDDT ESMFold prediction colored by pLDDT confidence. Red = very low confidence, yellow = slightly higher confidence. The protein is almost entirely red.

AlphaFold prediction of Reflectin A1 AlphaFold prediction of Reflectin A1 AlphaFold prediction for comparison. Both models show the protein is largely disordered, but AlphaFold identifies a few small β-sheet domains (blue) within the repetitive regions that ESMFold misses entirely.

Mutation Resilience

I tested whether the predicted structure is sensitive to mutations by single substitutions and large segment replacements:

VariantMutationpTMpLDDT
Original0.15739.15
Q268WSingle residue (position 268)0.15739.59
W81ASingle residue (position 81)0.15739.52
Ala segmentPositions 100–130 → all alanine0.15238.60
Gly segmentPositions 200–230 → all glycine0.15336.48

Q268W mutation Q268W mutation Q268W single mutation: virtually identical to original.

W81A mutation W81A mutation W81A single mutation: virtually identical to original.

Alanine segment mutation Alanine segment mutation Positions 100–130 replaced with all alanines. A helix appears in the alanine region, but confidence remains low.

Glycine segment mutation Glycine segment mutation Positions 200–230 replaced with all glycines. A large disordered loop appears where the glycines were inserted.

Single mutations had essentially no effect on the predicted structure where pTM remained identical at 0.157 and pLDDT changed by less than 0.5 points. Even replacing 31 consecutive residues with alanines or glycines only reduced pLDDT by 0.5–2.7 points. The glycine segment had the largest effect (pLDDT dropped from 39.15 to 36.48), likely because glycine is the most flexible residue and further destabilized an already disordered region. The structure appears highly resilient to mutation because there is so little predicted structure to begin with. ESMFold already has near-zero confidence in the original sequence, so mutations cannot meaningfully change much.

C3. Protein Generation (Inverse Folding)

I used ProteinMPNN (v_48_020, sampling temperature 0.1) to perform inverse folding on the AlphaFold-predicted backbone of Reflectin A1.

Amino acid probability heatmap from ProteinMPNN Amino acid probability heatmap from ProteinMPNN ProteinMPNN amino acid probabilities at each position. Bright spots indicate positions where the model is highly confident about which residue should occupy that position.

The designed sequence had a sequence recovery of only 16.9%, meaning ProteinMPNN’s ideal sequence for this backbone is extremely different to the actual reflectin sequence. Reflectin sequence is optimized for self-assembly and dynamic nanostructure formation by sticking to other reflectin molecules. Many reflectin proteins come together and stack into organized layers, and it’s those layers that create the structural color (iridescence) in squid skin. The spacing between layers determines what wavelength of light gets reflected. ProteinMPNN designs for structural stability rather than the intermolecular interactions.

Folding the Designed Sequence with ESMFold

I folded the ProteinMPNN-designed sequence with ESMFold to compare it to the original:

pTMpLDDT
Original Reflectin A10.15739.15
ProteinMPNN-designed sequence0.17372.86

ESMFold prediction of ProteinMPNN-designed sequence ESMFold prediction of ProteinMPNN-designed sequence ESMFold prediction of the ProteinMPNN-designed sequence. The structure shows substantially more secondary structure elements (helices and sheets) compared to the original.

The ProteinMPNN-designed sequence folds with dramatically higher confidence (pLDDT 72.86 vs 39.15). The designed sequence contains many more prolines, glycines, and lysines residues that promote stable secondary structures and far fewer of reflectin’s characteristic methionine-rich and aromatic motifs. This highlights a fundamental tension in protein engineering: a sequence that folds well as a monomer is not necessarily a sequence that functions correctly.