Week 4 Review: Protein Design Part I

Week 4 — Protein Design Part I

Foundations of protein chemistry: alphabet, structure, and design

A topic guide on the amino-acid alphabet, secondary structure, chirality, β-sheet aggregation, and the principles of designing structured peptides — written as a stand-alone primer rather than a homework Q&A. This page is Part A of the Week 4 deliverable for HTGAA Spring 2026; Parts B, C, and D (sequence/structure analysis, ML protein design, MS2 L-protein engineering proposal) will be added as those sessions complete.

Course: HTGAA Spring 2026 Lecture (Tues, Feb 24, 2026): Thras Karydis, Jon Kaufman — Protein Design Part I Recitation (Wed, Feb 25): Allan Costa — Protein folding Author: Fiona (Committed Listener BioPunk)


Why this matters

Proteins do nearly everything in biology: catalysis, structure, transport, signaling, immunity. For most of the field’s history, “designing” a protein meant rational point-mutation campaigns on a known scaffold — slow, limited in scope, and dependent on hard-won structural intuition. The last five years collapsed that timeline. Modern protein language models read the fitness landscape from hundreds of millions of natural sequences; structure predictors fold sequences in seconds; inverse-folding networks rewrite sequences for any backbone. Together they make computational protein engineering tractable for problems that used to require multi-year wet-lab campaigns.

Before we get to those tools (Parts B, C, D), we need the substrate they operate on: the amino-acid alphabet, the geometric constraints on protein structure, and the thermodynamics of folding and aggregation. That’s Part A. The conceptual questions in this section come from Shuguang Zhang’s HTGAA prompt set, but the answers are organized as a connected story rather than an isolated Q&A — so a reader landing on this page cold can follow it as a topic guide.

The thread running through everything: proteins are constrained polymers. Their building blocks are stereochemically restricted (only L-amino acids in nature, only ~20 of them, only one ribosomally-installable scaffold per residue). Those constraints set what proteins can do — which is the same as what synthetic biologists can engineer.


1. The amino-acid alphabet

1.1 The shared scaffold

Every amino acid except proline shares the same backbone scaffold: an α-carbon flanked by an amino group, a carboxyl group, a hydrogen, and a variable side chain (R). Peptide bonds link them via dehydration condensation; the resulting peptide bond is planar (partial double-bond character locks the C–N axis). That constrains the protein backbone to two free torsion angles per residue — φ (around N–Cα) and ψ (around Cα–C) — and the entire shape of secondary structure follows from the geometry of those allowed angles.

The 20 side chains span the chemistries biology needs: hydrophobic (A, V, L, I, M, F, W), polar uncharged (S, T, N, Q, Y, C), positively charged (K, R, H), negatively charged (D, E), and special-case residues — Gly (achiral, maximum flexibility), Pro (rigid imino acid that disrupts helices), Cys (forms disulfides). Proline is the structural exception: its side chain loops back to the backbone nitrogen forming a pyrrolidine ring, locking φ at roughly −60° ± 15°. That single deviation makes proline a helix-breaker, a Type II’ turn nucleator (when D-Pro is used), and the dominant residue of the polyproline II helix that builds collagen.

All 20 amino acids in life are L-stereoisomers. Glycine is achiral. This single geometric constraint is the most consequential one in protein structure: it’s why the α-helix in evolved proteins is right-handed, why D-amino-acid peptides invert handedness, and why every helical biological structure ultimately inherits the same handedness from its building blocks. We come back to this in Section 2.

1.2 The quantitative scale of protein chemistry

A useful intuition for protein chemistry comes from working out the throughput. If you eat 500 g of meat and treat it as pure protein with average residue mass 100 Da:

500 g ÷ 100 g/mol = 5 mol → 5 × 6.022 × 10²³ ≈ 3 × 10²⁴ amino acid residues.

Two caveats. Real beef is only ~20% protein by mass (~70% water, ~10% fat), so adjusted for actual protein content the number drops to ~6 × 10²³ residues — still close to Avogadro’s number. And these are residues incorporated in protein, not free amino acids; they’re released by digestive proteolysis before they enter the bloodstream. A single 500 g serving carries roughly Avogadro’s-number-worth of amino-acid monomers, which calibrates intuition for the throughput of protein chemistry on the human scale.

1.3 Why species identity is genomic, not dietary

A frequent question — why don’t humans become cows from eating beef? — has a clean answer: species identity is genomic, not dietary.

When you eat beef, digestive proteases such as trypsin hydrolyze cow proteins into free amino acids and short peptides before they cross into the bloodstream. Those amino acids enter a species-agnostic pool — a Phe is a Phe whether it came from a cow, a tuna, or a yeast cell. What happens next is the key: your ribosomes don’t read cow mRNA. They read your mRNA, encoded by your genome, and assemble your proteins from the recovered amino-acid pool. The cow contributed the building blocks; the assembly blueprint is yours.

Two qualifiers. Essential amino acids (Phe, Trp, Lys, Met, Thr, Val, Leu, Ile, and His — the last conditionally essential) are residues humans cannot synthesize in adequate amounts and must obtain from diet — diet matters for what you can build, not for what you become. And post-translational modifications (glycosylation, hydroxylation, etc.) are also genome-encoded — cow collagen and your collagen differ at the modification level too, not just at the sequence level. The principle is template-directed assembly: diet supplies monomers, identity comes from the blueprint.

1.4 Why ~20 canonical amino acids

All known life uses the same canonical 20 amino acids in ribosomal protein synthesis, with two genetically-encoded exceptions: selenocysteine (Sec) is encoded by recoded UGA in selenoproteins across all three domains; pyrrolysine (Pyl) is encoded by recoded UAG in some methanogenic archaea. No organism has ever been found using a fundamentally different ribosomal alphabet — strong evidence that 20 + Sec + Pyl is a deep evolutionary attractor, not just an unsampled corner.

Why this particular set? Three mutually-reinforcing constraints filter the space:

  1. Prebiotic availability. Miller–Urey (1953) experiments and the Murchison meteorite (1969) both produced ~10 of the canonical 20 — Gly, Ala, Asp, Glu, Val, Leu, Ile, Pro, Ser, Thr — under abiotic conditions. The simpler half of the alphabet was available before life had to encode anything.
  2. Ribosomal compatibility. The peptidyl-transferase center of the ribosome has stereochemical and steric tolerances limiting which side chains it can incorporate accurately. β-amino acids and very bulky residues are translated poorly even today. Whatever life encoded had to fit through this geometric filter.
  3. Biosynthetic cost vs. chemical novelty. Once translation existed, selection favored adding amino acids that gave new chemistry not yet in the set, even at high biosynthetic expense. Trp and Tyr — the only sources of aromatic chemistry — are biosynthetically expensive and were almost certainly late additions. Trifonov (2000) reconstructs an inferred temporal order: small-and-cheap first, complex-and-functional later.

A fourth, complementary constraint is codon space: a 4-base, 3-position genetic code yields 64 codons; subtract three stops and you have 61 sense codons distributed across 20 amino acids with redundancy. Many more than ~20 would over-strain the ribosome’s codon-recognition machinery; many fewer would waste coding capacity.

These constraints together explain why these 20, but the relative weighting among them remains debated — Higgs & Pudritz (2009) emphasize prebiotic availability; Wong (2005) emphasizes codon-coevolution; Crick’s “frozen accident” (1968) holds that the specific 20 was partly arbitrary at the moment of code lock-in.

Engineered exceptions are real. Schultz, Chin, and the Church group’s Genomically Recoded Organism work (Lajoie 2013) have built orthogonal aminoacyl-tRNA synthetase / tRNA pairs that let engineered organisms incorporate over 200 non-canonical amino acids ribosomally. The ribosome is permissive enough; evolution simply didn’t go there.

So: 20 is not a physical limit. It is the biological-historical attractor produced by prebiotic supply, ribosomal compatibility, and selection-cost trade-offs, layered onto a finite codon space.

1.5 Designing non-canonical amino acids

Engineered ribosomal incorporation has produced over 200 non-canonical amino acids (NCAAs). The field is largely Peter Schultz’s (Scripps) and Jason Chin’s (MRC-LMB); the Church group’s Genomically Recoded Organism (Lajoie 2013) removed all UAG stop codons from E. coli to free that codon for NCAA assignment without translational crosstalk.

A novel amino acid must clear three filters to be installable:

  1. Stereochemical compatibility with the ribosome — must be α-amino (canonical backbone scaffold), with side-chain bulk roughly within the natural envelope. β-amino acids and very large side chains translate poorly.
  2. Orthogonal aminoacyl-tRNA synthetase / tRNA pair — a dedicated aaRS that doesn’t cross-react with native AAs, paired with a tRNA decoding a non-natural codon. Methanocaldococcus jannaschii TyrRS (Schultz) and archaeal pyrrolysyl-tRNA synthetase (Chin) are the workhorses.
  3. A free codon to assign — typically a reassigned stop codon (UAG amber, UGA opal).

Established NCAAs include p-azidophenylalanine (click chemistry), p-iodophenylalanine (X-ray phasing), photo-caged Lys/Tyr/Cys (light-activated control), and selenomethionine (heavy-atom for crystallography).

Two design proposals:

  • Photoswitch-Phe (azoPhe). Phenylalanine analog with the para-H replaced by an azobenzene group. UV flips the azobenzene between trans (extended) and cis (~3 Å shorter); visible light reverses it. Use case: optically toggleable local geometry — caged active sites, photoswitchable PPI inhibitors, kinetic studies of folding intermediates. Real precedent: Beharry & Woolley (2011).
  • Bipyridyl-Ala (BiPyAla). Alanine analog with a 2,2′-bipyridyl side chain replacing the methyl. Creates a bidentate metal-chelation site at a single residue. Drop Cu²⁺ or Fe²⁺ in and you have a redox-active or catalytic center installed on an arbitrary protein scaffold. A first-generation version exists in the Schultz lab; higher-affinity and pH-tuned variants remain open design space.

NCAAs are not the first tool for protein engineering — point mutations to canonical residues are simpler and the modern ML toolkit handles them well. NCAAs become useful when you want to covalently lock a fold (engineered disulfide, photo-crosslink) or install a sensor (FRET pair, EPR spin label) without disrupting the native sequence. Relevant for the L-protein engineering work later in the course.


2. Secondary structure and the chirality of life

Plot every residue’s (φ, ψ) torsions on a 2D plot — the Ramachandran plot — and only ~10% of the space is sterically allowed for L-amino acids. Most is forbidden by atomic clashes between the side chain and the backbone. Three allowed regions dominate: the lower-left (right-handed α-helix), the upper-left (β-sheet), and a small upper-right region (left-handed α-helix) tolerated only by glycine. Secondary structure is what you get when consecutive residues stack inside one of these allowed islands.

This single fact — the Ramachandran plot for L-amino acids — explains both the dominance of right-handed helices and the handedness of mirror-image polypeptides.

2.1 Why most molecular helices are right-handed

Life is homochiral, and the homochirality of biological building blocks fixes helical handedness across all macromolecular structures.

For protein α-helices specifically: living organisms use L-amino acids exclusively in ribosomal protein synthesis. L-side chains clash sterically with the backbone in any conformation other than the right-handed α-helix; the (φ, ψ) coordinates of a left-handed α-helix sit in a sterically forbidden region of the L-amino-acid Ramachandran plot. Result: every α-helix in every natural protein on Earth is right-handed, with rare local exceptions at glycine residues (Gly is achiral and tolerates either handedness).

The same homochirality argument generalizes beyond proteins. DNA uses D-deoxyribose; the canonical B-DNA double helix is right-handed (Z-DNA, a left-handed form, is a rare local conformation in special sequences). RNA uses D-ribose; the A-form helix is right-handed. Collagen is a striking edge case: each polyproline II chain is itself left-handed, but three such chains coil together into a right-handed triple super-helix. The chain-level handedness is still dictated by L-amino acid stereochemistry.

The deeper question — why life is homochiral in the first place — remains unsettled. Candidate explanations include parity violation in weak-force interactions producing tiny enantiomeric excess, asymmetric photolysis by circularly polarized light from neutron stars, and stochastic symmetry-breaking amplified by autocatalytic networks. Whatever the original cause, once L-amino acids and D-sugars were locked in by early biology, all derived helical structures inherited the corresponding handedness.

So molecular helices are right-handed in life because biological building blocks are homochiral, and that homochirality determines which Ramachandran (or analogous geometric) region is accessible.

2.2 D-amino acids and mirror-image proteins

If you build a polypeptide entirely from D-amino acids instead of L, the Ramachandran plot mirrors. The “L-allowed” lower-left region maps to the upper-right; helices wind in the opposite sense — left-handed.

The defining structural features of the α-helix are otherwise preserved: the i → i+4 hydrogen-bond pattern, the 3.6-residue periodicity, and the 5.4 Å pitch all hold. Only the handedness flips. Stephen Kent’s group (U. Chicago) chemically synthesized full-length D-VEGF — the D-amino-acid mirror of vascular endothelial growth factor — and crystallized it in the expected mirror-image structure with left-handed helices. Some natural antimicrobial peptides containing D-residues also exhibit local left-handed helical character.

A racemic polypeptide (mixed L and D residues) typically forms neither helix — consistent stereochemistry across consecutive residues is required to satisfy the Ramachandran constraints. For an accessible primer on racemic and mirror-image polypeptides, see Mirror Image Proteins (Mandal & Kent, 2017) and the Wikipedia article on Racemic crystallography.

A design-relevant insight from this: D-amino-acid peptides are proteolysis-resistant because natural proteases evolved to cleave L-peptides. Mirror-image therapeutic peptides are an active research area, particularly for oral or systemic delivery where stability is rate-limiting.


3. β-sheets, aggregation, and amyloid

A β-strand is an extended near-zigzag backbone with (φ, ψ) ≈ (−120°, +120°). Strands H-bond laterally to each other to form a β-sheet — parallel (strands run N→C in the same direction; H-bonds slant) or antiparallel (strands run opposite; H-bonds are roughly perpendicular and shorter, making antiparallel sheets slightly more stable). Side chains alternate above and below the sheet plane. Amphipathic β-strands (alternating hydrophobic-hydrophilic patterning at i, i+2) drive sheet formation in aqueous environments.

Two structural facts about β-sheets matter for everything that follows: edges are unsatisfied and faces can be hydrophobic. Both feed the aggregation thermodynamics in §3.1.

3.1 Why β-sheets aggregate

β-sheets aggregate because they are structurally incomplete on their edges. Each strand carries unpaired backbone hydrogen-bond donors (N–H) and acceptors (C=O), so a single β-sheet exposes a row of “lonely” hydrogen bonds along its edge. Adding a second β-sheet alongside lets those bonds pair up — the system’s energy drops and the configuration becomes more stable.

The driving force has two main components. Satisfying the unpaired hydrogen bonds (enthalpic): each new H-bond between sheets releases ~3–5 kcal/mol of heat. The hydrophobic effect (entropic): β-sheet faces often present alternating hydrophobic side chains; burying these against another sheet’s hydrophobic face releases ordered water molecules from around them, raising solvent entropy. This is the classical hydrophobic effect (Kauzmann 1959, Tanford 1962) — counter-intuitively, it’s entropy, not enthalpy, that drives most hydrophobic interactions in water.

There is one unfavorable contribution: the protein chain loses conformational entropy when it locks into a rigid stacked structure. This cost is outweighed by the gains above, but it’s why aggregation is concentration- and time-dependent rather than instantaneous.

Together, these forces drive sequences toward the cross-β amyloid fold — stacked β-sheets with strands oriented perpendicular to the fiber axis, side chains interdigitated between sheets, and all backbone hydrogen bonds satisfied along the fiber length. For many proteins, cross-β is the deepest free-energy minimum on the folding landscape, deeper even than the native fold. Once a sequence enters this minimum, escape requires aggressive denaturation — this is why amyloid is essentially irreversible under physiological conditions.

This also explains why amyloid-forming proteins span enormous sequence space: aggregation is driven by backbone properties (always present) plus generic hydrophobic contributions, not specific side-chain chemistry. Any sufficiently long, marginally soluble polypeptide can in principle form cross-β, given time and concentration. Knowles, Vendruscolo & Dobson (2014) is the canonical review.

3.2 Amyloid disease, propagation, and materials

Why amyloid diseases form β-sheets

Disease-associated amyloidogenic proteins reflect the genericness of the cross-β attractor: Aβ in Alzheimer’s, α-synuclein in Parkinson’s, prion protein PrP in CJD/BSE, islet amyloid polypeptide in type 2 diabetes, transthyretin in cardiac amyloidosis, tau in tauopathies, huntingtin in Huntington’s. Different sequences, different organs, same cross-β architecture.

Pathology has two drivers. Toxic oligomeric intermediates — small soluble oligomers on the aggregation pathway, not the mature fibers — are the primary cytotoxic species. They permeabilize membranes, disrupt mitochondria, and trigger inflammation. Tissue mechanical damage comes from accumulated mature fibers, producing the histological plaques.

Seeding and templated nucleation

The progressive nature of amyloid disease is explained by seeding. Aggregation is bottlenecked by formation of the first cross-β nucleus; once seeded, monomers dock onto existing fibril edges and get templated into the same geometry. Aggregation becomes autocatalytic:

  • Prion infectivity (Prusiner 1997 Nobel) — PrPSc converts native PrPC by template-directed refolding. A protein alone is the infectious agent.
  • Cell-to-cell spread of α-synuclein, tau, and Aβ — pre-formed fibrils released from one neuron seed misfolding in the next, explaining anatomical-pathway progression over years.
  • Secondary nucleation (Knowles 2009, Cohen 2013) — fibrils catalyze new nucleation on their surfaces, driving explosive late-stage growth.

Aging accelerates the cycle by weakening protein quality-control machinery (chaperones, proteasome, autophagy), letting rare nucleation events escape.

Therapeutic strategies — capping the chain

The same edge-pairing logic that drives aggregation suggests how to stop it. Five general approaches:

  1. N-methylated capper peptides. A short β-strand peptide docks onto the fibril edge but has methylated backbone N–Hs on its outward face — no donors for the next incoming monomer. Chain terminates. Demonstrated for Aβ, IAPP, and tau (Soto, Gazit, Eisenberg groups).
  2. Native-state stabilizers. Keep monomers out of the aggregation pool. Tafamidis (FDA-approved) stabilizes the transthyretin tetramer so it never dissociates into amyloidogenic monomers — the most successful clinical anti-amyloid strategy to date.
  3. Anti-oligomer antibodies target toxic intermediates. Aducanumab and lecanemab in Alzheimer’s; modest, contested efficacy but the mechanism is right.
  4. Anti-fibril-surface chaperones block secondary nucleation. The Brichos domain coats Aβ fibril surfaces; endogenous Hsp104 (yeast) and Hsp70/40 (mammals) disaggregate via ATP.
  5. Small-molecule disaggregators like EGCG, curcumin, and resveratrol bind hydrophobic stacking surfaces and remodel amyloid in vitro; clinical translation has been limited by bioavailability.

Amyloid as engineering material

The properties that make amyloid medically dangerous also make it materially valuable: a Young’s modulus of ~10 GPa (silk-comparable), tensile strength rivaling steel by mass, resistance to proteolysis / heat / solvents, and spontaneous self-assembly from solution. Amyloid is an active engineering platform.

Natural functional amyloids show biology already exploits the architecture: curli fibers in E. coli biofilms, Pmel17 organizing melanin in melanosomes, fungal HET-s prion-like signaling, hormone storage in pituitary granules. Engineered applications include drug-delivery scaffolds, biosensors, conductive nanowires (cytochrome-c-amyloid hybrids, MIT Lu lab), and 3D-printable hydrogels. Recombinant silk proteins (e.g., AMSilk, Spiber) use the closely related β-sheet–crystallite architecture of natural silks, which is structurally distinct from canonical cross-β amyloid but exploits the same edge-pairing thermodynamics.

The elegant corollary: the same capping logic that prevents disease in vivo lets materials engineers tune fiber length and end-group chemistry in vitro. Cappers become a design parameter. The disease-relevant stability of cross-β is exactly the property that makes it useful — you just need to control where the chain stops.


4. Designing β-sheet motifs

A “well-ordered” β-sheet motif folds into a single intended sheet without unfolding or recruiting another sheet to aggregate. Four design rules govern the geometry:

  1. Alternating hydrophobic-hydrophilic patterning at i, i+2 along each strand. The hydrophobic face must be buried (against another strand or core) or capped to prevent recruiting another sheet.
  2. Tight turns. β-hairpins (two antiparallel strands) require Type I’ or II’ turns — the only turn types that fit antiparallel topology without backbone strain. Asn-Gly is the most common natural Type I’ turn nucleator; D-Pro-Gly is the strongest synthetic Type II’ turn (Stanger & Gellman 1998).
  3. Edge capping. Place charged or aromatic residues on strand edges to disrupt lateral hydrogen bonding to other sheets — the single most important rule for preventing aggregation.
  4. Length and topology. β-hairpins (~12–16 residues) are the smallest stable motif. Larger antiparallel β-meanders (Greek-key, β-barrel) give thermostability.

A β-turn is a 4-residue motif that reverses the direction of the polypeptide chain — every β-hairpin is built on one. Turns are classified by the (φ, ψ) torsions of the two central residues (positions i+1 and i+2 of the four-residue stretch).

Typeφ(i+1)ψ(i+1)φ(i+2)ψ(i+2)Required residues
I−60°−30°−90°flexible
II−60°+120°+80°Gly at i+2
I’+60°+30°+90°Gly at i+1
II’+60°−120°−80°Gly or D-Pro at i+1

The “prime” types are mirror images of the unprimed — same angles with signs flipped. Antiparallel β-hairpins force the chain to U-turn with specific handedness, so Type I’ and II’ turns are the natural fit; Types I and II produce slightly strained hairpin geometry. Both prime types require a positive φ at i+1 — sterically forbidden for L-amino acids (the same Ramachandran exclusion that forbids the left-handed α-helix in §2.1). Only glycine (achiral) tolerates positive φ in natural sequence, which is why every Type I’ turn has Gly at i+1. Type II’ turns have Gly or D-Pro at i+1; D-Pro forces a precise positive-φ geometry that even Gly can’t always nail. That’s why the strongest synthetic β-hairpin nucleator is D-Pro-Gly (Stanger & Gellman 1998).

Worked design — a stable β-hairpin

Starting scaffold: the GB1 hairpin (residues 41–56 of Streptococcus protein G B1 domain), GEWTYDDATKTFTVTE. This 16-residue peptide folds reversibly in water with two antiparallel β-strands joined by a Type I’ turn (D-D-A-T motif).

Designed variant: KEWTYDDATKVFTVTE

  • Replaces the N-terminal Gly with Lys to give the strand edge a positive charge, suppressing lateral aggregation.
  • Retains the GB1 Type I’ turn, a known good nucleator.
  • Strands carry alternating hydrophobic (W, Y, F, V) / polar (T, K, E) residues, with aromatics (W, Y, F) stacking diagonally across the hairpin to provide hydrophobic-core packing and a built-in fluorescence reporter.

Maximum-stability variant (requires chemical synthesis): replace the Type I’ turn with a D-Pro-Gly Type II’ turn, e.g., KEWTY-(D-Pro)-G-AT-KVFTVTE. The D-Pro at i+1 of the turn adopts the positive-φ geometry that is energetically optimal for antiparallel hairpin nucleation. D-amino-acid incorporation requires either solid-phase peptide synthesis or an orthogonal aaRS-tRNA pair (the design toolkit from §1.5).

Verification. Round-trip designability check: ESMFold the sequence; align to the GB1 parent backbone (PDB 1PGB residues 41–56); aim for Cα RMSD < 1.5 Å and mean pLDDT > 80. ProteinMPNN inverse-fold the validated backbone to confirm the design lies in the high-likelihood region of the model. Caveat: both ESMFold and AlphaFold are less reliable on short stand-alone peptides than on full domains, so for a 16-residue hairpin the computational round-trip is a sanity check, not a confirmation. Wet-lab validation is what closes the loop: CD spectroscopy (β-sheet signature at 218 nm), NMR ¹H-¹⁵N HSQC chemical-shift dispersion, and thermal melt for two-state cooperativity.

For a slightly larger and more thermostable motif, a four-strand antiparallel β-meander based on a Greek-key topology (~40 residues) is a good starting point. The standard test scaffold is ubiquitin’s β-grasp fold (PDB 1UBQ).


  1. Anfinsen, C. B. (1973). Principles that govern the folding of protein chains. Science 181: 223–230. — The Nobel-lecture statement of the sequence-determines-structure thesis. Every modern protein-ML model is a computational instantiation of this idea.
  2. Jumper, J. et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 596: 583–589. — The structure-prediction breakthrough. Read for the architecture (Evoformer + structure module) and the pLDDT/PAE confidence outputs you’ll use in Part C.
  3. Lin, Z. et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379: 1123–1130. — ESM-2 + ESMFold: how a language model alone (no MSA) can fold proteins. The engine inside the HTGAA Colab notebook for Part C.
  4. Knowles, T. P. J., Vendruscolo, M., Dobson, C. M. (2014). The amyloid state and its association with protein misfolding diseases. Nat Rev Mol Cell Biol 15: 384–396. — The canonical review of cross-β as a deep free-energy minimum and its biological consequences.

Course resources & recordings


Parts B, C, D (loading…)

  • Part B — pick a protein, sequence and structural analysis: UniProt BLAST, RCSB structure quality, SCOP classification, PyMOL visualization (cartoon, ribbon, secondary structure / residue type coloring, surface pockets).
  • Part C — modern protein-ML stack applied to the chosen target: ESM-2 unsupervised deep mutational scan; latent-space embedding; ESMFold structure prediction and mutational resilience; ProteinMPNN inverse folding with round-trip designability check.
  • Part D — Bacteriophage MS2 L-protein engineering proposal: stability and auto-folding optimization using PLM-based in silico mutagenesis, AlphaFold-Multimer for L–DnaJ complex validation, and Rosetta/FoldX ddG scoring on top candidates.