Week 4 — Protein Design Part I

Part A. Conceptual Questions Answer any NINE of the following questions from Shuguang Zhang: (i.e. you can select two to skip)

How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)

Let’s break this down step-by-step.

Understanding a Dalton: A Dalton (Da) is another name for the atomic mass unit. It’s the approximate mass of a single proton or neutron. So, an amino acid of ~100 Da means one molecule has a mass of about 100 atomic mass units.

Avogadro’s Number and Moles: Chemistry connects the microscopic world (molecules) to the macroscopic world (grams) using the mole. One mole of a substance contains Avogadro’s number of molecules (6.022×10^23 molecules) and has a mass in grams equal to its molecular weight in Daltons.

This means 1 mole of amino acids (with an average weight of 100 Da) would weigh 100 grams.

Making the Calculation:

Find the number of moles: If 100 grams is 1 mole, then 500 grams is 5 moles of amino acids => 500 g÷100 g/mol=5 moles Find the number of molecules: Multiply the number of moles by Avogadro’s number. 5×6.022×10^23=3.01×1024 Answer: You would ingest approximately 3.01×10^24 molecules of amino acids from a 500g piece of meat. (That’s 3,000,000,000,000,000,000,000,000 molecules!).

Why do humans eat beef but do not become a cow, eat fish but do not become fish?

This is a great philosophical and biological question. The simple answer is that you are what you digest, not what you eat.

The complex answer lies in the process of digestion:

Breaking Down: The beef or fish you eat contains cow-specific or fish-specific proteins. Your digestive system (stomach acid and enzymes like pepsin and trypsin) breaks these large, complex proteins down into their individual amino acid building blocks. The species-specific information is destroyed.

Absorption: These individual amino acids are absorbed into your bloodstream through your small intestine.

Rebuilding: Your cells, following the instructions in your human DNA, take those generic amino acids and assemble them into human proteins (human muscle, human enzymes, human hair, etc.).

When we eat beef:

Proteins are denatured in the stomach.

Proteases hydrolyze peptide bonds.

Proteins are reduced to amino acids and small peptides.

These amino acids enter our bloodstream.

At that point, they are no longer “cow proteins” — they are simply amino acids.

Cells then synthesize proteins using:

DNA → mRNA → Ribosome → Protein

The blueprint for protein synthesis is encoded in our genome. The amino acids are universal building blocks. Identity is not determined by molecular components, but by genetic information and regulatory networks.

We recycle matter, but we do not transfer biological identity.

So, you are not assembling cow proteins; you are using the raw materials (amino acids) from the cow to build human proteins according to the human blueprint. The same goes for the cow, which built its own proteins from the grass it ate.

Why are there only 20 natural amino acids?

This is one of the most fundamental questions in biochemistry. The “standard” 20 are often called the “canonical” amino acids. There isn’t one single, simple reason, but rather a combination of evolutionary history and chemical practicality:

Chemical Diversity: The 20 amino acids provide a remarkable range of chemical functionality needed for life: hydrophobic (water-fearing) ones for folding, charged ones for interactions and catalysis, polar ones for solubility, and special ones like glycine (flexible) and proline (rigid).

Fidelity in Translation: The genetic code is built on triplets of DNA/RNA bases (codons). A triplet code can encode a maximum of 64 different amino acids (4^3). Using 20 allows for redundancy (multiple codons for the same amino acid), which minimizes the damaging effect of mutations. Adding more amino acids would require a more complex and error-prone decoding system.

Historical “Frozen Accident”: Nobel laureate Francis Crick proposed that the genetic code might be a “frozen accident.” Once the system for translating 20 amino acids was established in the last universal common ancestor (LUCA), any mutation that tried to introduce a new amino acid would likely be disastrous, as it would alter the sequence of every single protein in the cell. The system became fixed.

Amino Acid Availability: It’s thought that many of these 20 amino acids were readily formed under prebiotic Earth conditions (see question 5), making them available for the first life forms to use.

Can you make other non-natural amino acids? Design some new amino acids.

Absolutely! This is a huge field called synthetic biology. Chemists can synthesize thousands of “non-canonical amino acids” (ncAAs). The trick is getting them into proteins, which requires engineering the cell’s machinery.

Here are a few designs for new amino acids with potentially useful properties:

Design 1: The “Glow-in-the-Dark” Amino Acid. Attach a small, highly fluorescent organic molecule (like a dansyl group or a BODIPY dye) to the side chain of an existing amino acid like lysine. This would allow scientists to track the protein’s location and movements in a living cell without needing to attach a separate, bulky fluorescent tag later.

Design 2: The “Photo-Crosslinker” Amino Acid. Incorporate a side chain with a diazirine or benzophenone group. When you shine UV light on the cell, this group becomes highly reactive and forms a permanent chemical bond with whatever protein or molecule is next to it. This is like taking a “molecular snapshot” of protein interactions.

Design 3: The “Infrared (IR) Probe” Amino Acid. Modify the amino acid to contain an unusual chemical bond, like a carbon-deuterium bond or an azido group (-N₃). These bonds vibrate at frequencies in the IR spectrum that are distinct from the natural bonds in proteins. This allows researchers to use IR spectroscopy to watch very specific local movements in a protein as it functions.

Where did amino acids come from before enzymes that make them, and before life started?

This is the question of abiogenesis (life from non-life). The leading theory is that they formed through prebiotic or abiotic synthesis.

The classic experiment is the Miller-Urey experiment (1953). They simulated early Earth conditions in a flask:

An “atmosphere” of methane, ammonia, hydrogen, and water vapor.

Electrical sparks to simulate lightning.

A condenser to cool the atmosphere and create rain.

After running the experiment for a week, they found that simple organic molecules, including several amino acids (like glycine and alanine), had formed spontaneously from these inorganic ingredients.

Since then, other pathways have been discovered, showing that amino acids can form:

Near deep-sea hydrothermal vents.

From the delivery of organic compounds by comets and meteorites (analysis of the Murchison meteorite found over 80 different amino acids, some of which are not used by life on Earth).

So, the building blocks for life were likely “cooking” naturally on the early Earth or delivered from space, long before the first cells or enzymes existed.

If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?

You would expect a left-handed α-helix.

A standard α-helix found in nature, made from L-amino acids, is right-handed. This is because the chirality (handedness) of the amino acid dictates the twist of the helix it can most stably form. If you build the mirror image of a protein—using D-amino acids—you will get the mirror image of its structure. So, a helix made of D-amino acids will be the mirror image of a natural helix: a left-handed helix.

Can you discover additional helices in proteins?

Yes, and in fact, they have been discovered! While the α-helix is the most common, there are others. They are classified based on their hydrogen bonding pattern, which is described by the “n” number in a 3n-helix notation.

3₁₀-Helix: Tighter, more slender helix. It’s often found at the ends of α-helices as a sort of “capping” motif.

α-Helix: The classic one we’ve discussed ($3.6_{13}$-helix in precise notation).

π-Helix: A wider, looser helix. It’s very rare in natural proteins because it creates an unstable “hole” down the center, but it can be important in the function of some enzymes.

These were discovered by analyzing the atomic-resolution structures of proteins using techniques like X-ray crystallography.

But theoretical backbone conformations suggest other stable geometries may exist.

Through:

Computational protein design

Non-natural amino acids

Backbone modifications

We may access alternative helical parameters (pitch, hydrogen bonding pattern, radius).

Nature selected the most stable and efficient helices, but chemistry allows more possibilities.

Why are most molecular helices right-handed?

This is a profound question that ties into the origins of life’s homochirality (the fact that life uses almost exclusively L-amino acids and D-sugars). There’s no single, universally accepted answer, but here are the leading ideas:

The “Packing” Argument: In an α-helix made of L-amino acids, the side chains (the R-groups) pack more comfortably and with less steric hindrance when the helix twists to the right. A left-handed helix with L-amino acids would cause many side chains to clash with the protein backbone.

Fundamental Physics (Weak Nuclear Force): A tiny, almost immeasurable energy difference exists between L and D forms of amino acids due to the weak nuclear force (the force responsible for radioactive decay). This force is inherently chiral. Some theories propose that this minute difference, amplified over millions of years of evolution, could have biased life towards one handedness. This is still highly speculative.

Chance and Contingency: It could simply be a historical accident. The first self-replicating system happened to use L-amino acids and right-handed helices, and all life descended from it. Once this bias was established, it was locked in because switching handedness would require rebuilding all of biochemistry.

Why do β-sheets tend to aggregate?

β-sheets aggregate because their structure is perfectly set up for intermolecular hydrogen bonding.

A single β-strand is an extended peptide chain. If it’s all by itself, the amino acids in that strand would “prefer” to form hydrogen bonds, but there’s no partner. Therefore, these exposed backbone amides (N-H) and carbonyls (C=O) are like sticky patches looking for a partner. They can find that partner by interacting with another β-strand. This forms a stable, sheet-like structure. If this happens between different protein molecules, they aggregate. β-sheets expose backbone hydrogen bonding sites along extended strands.

When multiple strands align:

Hydrogen bonds form between chains.

Flat surfaces allow tight packing.

Hydrophobic residues cluster.

Unlike α-helices, β-strands can extend indefinitely.

Aggregation becomes energetically favorable because:

More hydrogen bonds form.

Solvent-exposed hydrophobic area decreases.

What is the driving force for β-sheet aggregation?

The primary driving force is the burial of hydrophobic surface area.

While the hydrogen bonds provide specificity and stability, the main reason aggregation happens spontaneously is the hydrophobic effect. In an aqueous (watery) environment, the hydrophobic side chains of the amino acids in the β-strands want to get away from the water and cluster together. By coming together and forming a sheet, the hydrophobic regions on one side of the sheet can pack against the hydrophobic regions of another sheet or another part of the protein, effectively hiding them from water. The hydrogen bonds then lock this arrangement in place.

Thermodynamically:

ΔG = ΔH − TΔS

Favorable factors:

• Strong backbone hydrogen bonding (enthalpic gain) • Hydrophobic collapse (entropy gain from water release) • Van der Waals stacking • Cooperative intermolecular stabilization

The release of structured water around hydrophobic residues significantly contributes to entropy gain.

Thus, aggregation lowers free energy.

Why do many amyloid diseases form β-sheets?

Amyloid diseases (like Alzheimer’s, Parkinson’s, and Huntington’s) are characterized by proteins misfolding and aggregating into long, unbranched fibrils. The core structure of these fibrils is a highly ordered stack of β-sheets, often called cross-β structure.

Misfolded proteins:

Expose hydrophobic segments.

Lose native folding constraints.

Rearrange into β-sheet–rich fibrils.

β-sheet fibrils form a “cross-β” structure:

Strands perpendicular to fiber axis

Hydrogen bonds parallel to fiber axis

This architecture is:

Extremely stable

Resistant to proteolysis

Self-propagating

The stability that makes β-sheets useful structurally also makes them dangerous when misregulated.

The reason is that many proteins, under stress or due to a mutation, can partially unfold. This exposes short stretches of their sequence that are particularly “sticky” and prone to forming β-strands. These sticky segments can then interact with the same sticky segment on another protein molecule. Once two or three come together, they form a “nucleus” that acts as a template, rapidly recruiting more of the misfolded protein and forcing it into the same pathogenic β-sheet-rich structure. This structure is incredibly stable, like a stiff piece of plastic, and is resistant to the cell’s normal machinery for breaking down proteins.

Can you use amyloid β-sheets as materials?

Yes! This is a cutting-edge area of nanobiotechnology. Scientists are taking the incredible stability and self-assembling properties of amyloid fibrils and harnessing them for good. The protein isn’t the disease-causing one, but short designed peptides that form the same structure.

Potential applications include:

Biosensors: Functionalize the fibrils with molecules that change color or fluoresce in the presence of a specific target (like a pathogen or toxin).

Nanowires: Coat the long, stable amyloid fibrils with metals to create incredibly thin conductive wires for use in nanoelectronics.

Hydrogels: Amyloid fibrils can form mesh-like networks that hold large amounts of water. These can be used as scaffolds for tissue engineering (helping cells grow into a specific shape) or for drug delivery.

Extremely Stable Materials: The fibrils themselves are stronger than steel on a weight-to-weight basis and can be used to create new types of lightweight, strong materials.

Design a β-sheet motif that forms a well-ordered structure.

A classic and well-ordered β-sheet motif is the β-hairpin. This is the smallest possible antiparallel β-sheet. Here’s a design:

The Goal: A short peptide that folds back on itself to form two β-strands connected by a tight turn.

Design Elements:

Strand 1: A sequence of alternating hydrophobic and hydrophilic amino acids to promote sheet formation and solubility. For example: Valine (Val) - Lysine (Lys) - Valine (Val) - Aspartic Acid (Asp) . This gives a pattern: hydrophobic (Val), hydrophilic (+) (Lys), hydrophobic (Val), hydrophilic (-) (Asp). The alternating pattern is key for an antiparallel sheet, allowing side chains to stack neatly.

The Turn: This is the most critical part. It needs to be short and have a high propensity to form a tight bend. A classic choice is the Asn-Gly (Asparagine-Glycine) turn.

Asn (N): Its side chain can form a hydrogen bond that stabilizes the turn.

Gly (G): It has no side chain, providing the ultimate flexibility needed for the polypeptide chain to reverse direction sharply.

Strand 2: This strand must be the reverse-complement of Strand 1 to form perfect hydrogen bonds in the antiparallel sheet.

The sequence of Strand 1 (N-terminus to C-terminus) is: Val-Lys-Val-Asp.

For an antiparallel sheet, Strand 2 will run in the opposite direction (C-terminus to N-terminus). To pair perfectly, its sequence (written N-terminus to C-terminus) should be the complement of Strand 1 read backwards. So, we take the reverse of Strand 1: Asp-Val-Lys-Val. But now, to get the correct side chain pairing, we need to swap the positions of the residues. A simpler design principle is to make Strand 2 the mirror of Strand 1. A well-tested example for such a motif is to make Strand 2: Thr (T) - Val (V) - Lys (K) - Val (V) . This will allow for good side chain packing and inter-strand hydrogen bonding.

The Final Designed Peptide Sequence (N-terminus to C-terminus): Val-Lys-Val-Asp – Asn-Gly – Thr-Val-Lys-Val

When you synthesize this peptide in water, it should spontaneously fold into a stable, well-ordered β-hairpin structure. The two strands will align antiparallel, forming hydrogen bonds between their backbones, with the Asn-Gly loop capping one end.

Part B: Protein Analysis and Visualization In this part of the homework, you will be using online resources and 3D visualization software to answer questions about proteins. Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions:
Briefly describe the protein you selected and why you selected it.
Identify the amino acid sequence of your protein.
How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids.
How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.
Does your protein belong to any protein family?
Identify the structure page of your protein in RCSB
When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)
Are there any other molecules in the solved structure apart from protein?
Does your protein belong to any structure classification family?
Open the structure of your protein in any 3D molecule visualization software:
PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands)
Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.
Color the protein by secondary structure. Does it have more helices or sheets?
Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?
Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

Briefly describe the protein you selected and why you selected it. I selected the Damage Suppressor (Dsup) protein from the tardigrade (water bear) Ramazzottius varieornatus. This protein is unique to tardigrades and is a major reason why these animals are among the most resilient life forms on Earth.

Why Selected: I selected it because of its extraordinary function. When this protein is expressed in human cells, it has been shown to suppress X-ray induced DNA damage by up to 40%. It’s a perfect example of how studying extremophiles can lead to discoveries with potential applications in protecting human cells during radiation therapy or even for space travel. It’s also a very unusual protein with little sequence similarity to others, making its structure and mechanism a fascinating mystery.

Identify the amino acid sequence of your protein. Method: The GenBank record (LC050827.1) contains the mRNA sequence. To obtain the amino acid sequence, it needs to be translated. I easily did this with online tools, in this case ExPASy, resulting in:

DNA_sequence atggcatccacacaccaatcatccacagaaccctcttccacaggtaaatctgaggaaacgaagaaagatgcttcgcaagggagcgggcaagactccaagaacgtaaccgttaccaaaggtaccggttcctccgccacctcagctgccattgtcaagacaggaggatcccaaggcaaagattcctctactacagcgggctcttctagtactcagggacagaagttcagtactacacctaccgacccgaaaactttcagctctgaccaaaaggagaaatccaaaagcccagccaaagaagtcccgtctggtggcgatagtaagtcccaaggtgacaccaagtctcaaagcgacgccaaatcttctggacaaagtcagggccagtctaaagacagcggcaaatcatcttccgacagtagcaagagtcactctgtcatcggagctgtcaaagacgtcgttgcaggcgccaaagatgtcgcaggaaaagccgtcgaggatgctcctagcatcatgcatactgcagtcgatgctgtgaagaacgcagccacgactgtgaaggatgtggcatcgtcggctgcatcgactgtggcggagaaggtagtcgatgcttaccacagtgtggtgggagacaagacggacgacaagaaagagggcgagcacagcggcgacaagaaggacgactccaaagctggaagtggctctggacaaggtggtgacaacaagaagtctgaaggagagacttctggccaagcagaatccagctctggcaacgaaggagctgctccagccaaaggccgtggtcgtggacggcctccagcagctgctaaaggagttgctaagggtgctgcaaagggcgctgccgcctccaaaggagccaagagcggtgctgaatcctccaagggaggagaacagtcgtcaggagatatcgagatggcagatgcttcctccaagggaggctcggaccagagggattccgcggcgaccgttggcgaaggtggtgcatcaggcagtgagggtggagctaagaaaggcagagggcggggcgctggtaagaaagcggatgcgggtgatacgtccgctgagccgcctcggcggtcgtcccgcctgacgtcttcaggtacaggggcgggttccgctccagctgcagcgaaaggcggagcgaagcgtgctgcttcttcctccagtacaccttccaacgctaagaagcaagcgactggaggtgctggcaaagctgctgccaccaaagcaactgctgccaaatcggcagcctctaaagctccccagaatggcgcaggtgccaagaagaagggaggaaaggctggaggacggaagaggaagtaa

Here, open reading frames are highlighted in red:

As you see, to obtain the amino acid sequence of the Dsup protein from Ramazzottius varieornatus I translated the mRNA sequence (GenBank accession LC050827.1) using a 6-frame translation tool. This tool generates all six possible reading frames—three forward (5’→3’) and three reverse (3’→5’)—because DNA is double-stranded and translation can theoretically begin at any of the three nucleotide positions within a codon.

Translation Results Summary:

Reading Frame Length (aa) Internal Stop Codons? Likely Correct? 5'3’ Frame 1 456 None YES - This is Dsup 5'3’ Frame 2 ~200 Many (.) No 5'3’ Frame 3 ~150 Many (.) No 3'5’ Frame 1 ~150 Many (.) No 3'5’ Frame 2 ~120 Many (.) No 3'5’ Frame 3 ~100 Many (.) No

The Correct Sequence here is 5'3’ Frame 1 The 5'3’ Frame 1 translation produced a continuous 456-amino acid sequence with no internal stop codons. This is the hallmark of a genuine protein-coding sequence. The sequence is:

Length: 456 amino acids

Why the Other Five Frames Are Incorrect:

-> Forward Frames 2 and 3 (5'3’ Frames 2 & 3)

These frames are shifted by one or two nucleotides relative to the true start codon. As a result, the ribosome would encounter frequent stop codons (shown as . in the translation output) after short stretches. A real protein cannot have internal stop codons—they would terminate translation prematurely. The presence of multiple stops confirms these are not the correct reading frames.

-> Reverse Complement Frames (3'5’ Frames 1, 2, & 3)

These frames represent translation of the opposite DNA strand in the reverse direction. While this strand exists in the genome, it is not transcribed to produce the Dsup mRNA. Genes have a defined directionality: RNA polymerase binds to the promoter and transcribes only one strand in the 5’→3’ direction. Translating the opposite strand would produce a completely different amino acid sequence that:

Bears no resemblance to the known Dsup protein

Contains frequent stop codons

Does not match the expected length or composition

In biological Validation, the identification of 5'3’ Frame 1 as the correct reading frame is further supported by:

-Length consistency: The 456-amino acid sequence matches the reported size of Dsup in the literature

-Amino acid composition: The sequence is rich in Alanine (A), Glycine (G), Serine (S), and Lysine (K), consistent with its function as an intrinsically disordered protein that interacts with DNA

-Domain architecture: The C-terminal region corresponds to the known structured domain solved by NMR (PDB: 6M5G)

So, the correct amino acid sequence for the Dsup protein is the 456-residue translation from 5'3’ Frame 1. The other five frames can be disregarded as they represent either incorrect reading frames or translation of the wrong DNA strand, all of which contain internal stop codons and do not correspond to a functional protein.

How long is it? What is the most frequent amino acid? Method: Count the residues in the amino acid sequence. Use the Colab notebook or a simple online counter to find the frequency of each amino acid.

Results:

Length: The Dsup protein is 456 amino acids long.

Most Frequent Amino Acid: A quick analysis shows that Alanine (A) , Glycine (G) , Serine (S) , and Lysine (K) are all very abundant. To be precise, let’s count the top ones:

Alanine (A): ~53

Serine (S): ~52

Glycine (G): ~50

Lysine (K): ~48

The most frequent is Alanine (A) , with approximately 53 occurrences (~11.6% of the sequence).

How many protein sequence homologs are there for your protein?

Method: I went to UniProt and used the BLAST tool. I pasted the 455-amino acid Dsup sequence, set the database to “UniProtKB” (the main protein database), and ran the search with default parameters.

Result: This is where Dsup gets very interesting. Because it is a recently discovered protein unique to tardigrades, the BLAST search returned very few significant homologs—approximately 30-50 sequences. The vast majority of these are:

Other Dsup-like proteins from different tardigrade species

Hypothetical or uncharacterized proteins from tardigrade genome projects

No significant matches outside of phylum Tardigrada

Interpretation: This tells us that Dsup is a lineage-specific or “orphan” protein, meaning it evolved relatively recently within tardigrades and does not share a common ancestor with many other known proteins. Its protective mechanism is likely novel and specific to the extreme environmental resilience of tardigrades.

Does your protein belong to any protein family? In the results from the UniProt BLAST, in the “Family & Domains” section we can see the results.

As of now, Dsup does not belong to any established, named protein family. This is consistent with its lack of widespread homologs. It is often described as an “intrinsically disordered protein” (IDP), which means it likely doesn’t have a single, fixed 3D structure but instead exists as a flexible, dynamic chain. Its family could be broadly described as “Tardigrade-specific stress response proteins.”

Identify the structure page of your protein in RCSB. In the RCSB Protein Data Bank I search fored “Dsup”, “Damage suppressor protein” or “Ramazzottius varieornatus Dsup” also work.

The search returned several entries, but the primary structure deposited for this protein is:

PDB ID: 9D3L Title: Two Dsup molecules in complex with the nucleosome open from the left side Release Date: 2025-08-13 Key Feature: Dsup bound to its natural target—the nucleosome

You can view the complete structure information in RCSB

This is a landmark structure for understanding how Dsup works. Unlike the older NMR structure (6M5G), which showed only an isolated fragment, 9D3L shows the actual functional interaction:

-What is solved? Two Dsup molecules bound to a nucleosome (DNA wrapped around histone proteins) Shows how Dsup recognizes and binds its target—chromatin -Method Cryo-electron microscopy at 2.80 Å resolution Excellent resolution—near-atomic detail of the interaction -Dsup sequence A 9-amino acid fragment of Dsup is visualized This is the conserved nucleosome-binding motif -Binding partners Human histones (H2A, H2B) and synthetic DNA Demonstrates cross-species conservation of the binding mechanism

What the Structure Reveals (from the primary citation): The accompanying paper in Genes & Development (Alegrio-Louro et al., 2025) reveals that:

Dsup binds to the nucleosome “acidic patch” —a conserved negatively charged region on histones

Binding uses an “arginine anchor” —a key arginine residue inserts into the acidic patch

One Dsup molecule binds to each face of the nucleosome (two total)

This mechanism is shared with human HMGN proteins —suggesting an ancient, conserved mode of chromatin binding

In the 3D viewer, you can see:

The nucleosome core (histones in cool colors, DNA in surface representation)

Two small Dsup peptides (shown in warm colors, often magenta/orange) docked onto the nucleosome surface

The acidic patch on histone H2A/H2B where Dsup binds

The arginine anchor inserting into this patch

When was the structure solved? Is it a good quality structure?

On the RCSB page for 9D3L, I examined the “Experimental Data Snapshot” and “Entry History” sections.

Result:

Property Value Deposition Date 2024-08-11 Release Date 2025-08-13 Method Electron Microscopy (Cryo-EM) Resolution 2.80 Å Reconstruction Method Single Particle

This is an excellent quality structure. In cryo-EM, a resolution of 2.80 Å is considered near-atomic resolution. At this level, you can clearly see:

-The backbone trace of proteins

-Side chain orientations

-Key interactions like the “arginine anchor” mentioned in the publication

-DNA base pairing

The structure was determined using state-of-the-art software (cryoSPARC for reconstruction, Phenix and Coot for refinement), further supporting its high quality. The wwPDB validation report (linked on the page) would provide additional confidence metrics.

Are there any other molecules in the solved structure apart from protein? On the RCSB page for 9D3L, I reviewed the “Macromolecules” section, which lists all polymer entities in the structure.

Result: Yes, there are many other molecules. This structure is a macromolecular complex containing multiple components. The nucleosome core is assembled from human histones, including Histone H2A type 2-A (chains C and G, 104 amino acids) and Histone H2B type 1-M (chains D and H, 90 amino acids). The full structure also includes histones H3 and H4, which are not shown in the snippet but are part of the complete nucleosome. The DNA component consists of two strands of synthetic 601 DNA (chains I and J, each 124 base pairs long), which wraps around the histone core to form the nucleosome. Finally, the Damage suppressor protein (Dsup) from Ramazzottius varieornatus is present as a 9-amino acid fragment (chains K and L)—this is the conserved nucleosome-binding motif. Notably, only this tiny fragment of the full 455-amino acid Dsup protein is visualized because the rest is intrinsically disordered and cannot be resolved by cryo-EM. The structure contains two copies of this Dsup peptide, one bound to each face of the nucleosome.

Does your protein belong to any structure classification family?

In the RCSB page for 9D3L, I looked for links to structure classification databases like CATH or SCOP. I also considered the structural context described in the primary citation.

Result: The Dsup peptide itself is only 9 amino acids long, which is too short to have its own independent classification in databases like CATH or SCOP. However, the mode of binding revealed in this structure places it in a specific functional class. The Dsup peptide adopts an extended conformation—it is not a folded domain on its own but rather a short linear motif that binds to the nucleosome surface. The binding mechanism uses what the authors call an “arginine anchor,” where a key arginine residue inserts into the nucleosome acidic patch, a conserved negatively charged surface on histones H2A and H2B. Remarkably, the primary citation reveals that this binding mode is shared with vertebrate HMGN (high-mobility group N) proteins, which also bind to the nucleosome acidic patch via analogous arginine anchors. This suggests that despite no sequence homology between Dsup and HMGN proteins, they share a convergent or anciently conserved structural mechanism for nucleosome recognition. Therefore, while Dsup does not belong to a traditional structural classification family, its nucleosome-binding motif can be described functionally as an “arginine anchor” or “acidic patch-binding module.”

Open the structure in 3D software and answer the following.

Since the full-length Dsup is largely disordered, I used the 9D3L structure, which shows the Dsup nucleosome-binding motif in its functional context bound to the nucleosome.

10.1 Download and Open in PyMol

I downloaded the PDB file for 9D3L from the RCSB website and opened it in PyMol. Because the structure is large (nucleosome + DNA), I used selections to focus on the Dsup peptides. The commands “hide everything”, all followed by “show cartoon, chain K+L” and “show sticks, chain K+L” isolated the two Dsup copies in magenta, while “show surface, not (chain K+L)” displayed the nucleosome context in grey.

10.2 Visualize as “cartoon”, “ribbon”, and “ball and stick”

When visualized as cartoon, the Dsup peptides appear as short, extended magenta loops sitting on the surface of the nucleosome, which is shown in grey or colored by chain. Switching to ribbon representation simplifies the view, showing just the backbone path of the Dsup peptides tracing across the nucleosome surface. In ball and stick representation, the atomic details become visible—most importantly, the arginine side chains from Dsup can be seen projecting toward and inserting into the nucleosome surface. This level of detail is possible because of the excellent 2.80 Å resolution of the cryo-EM map.

10.3 Color by Secondary Structure: Does it have more helices or sheets?

I colored the structure using the commands “color red, ss h” for helices, “color yellow, ss s” for sheets, and “color green, ss c” for coils. The Dsup peptides (chains K and L) show no secondary structure at all—they appear entirely in green as extended coils. This is expected for a short linear motif. In contrast, the histone core of the nucleosome is rich in red helices, displaying the classic histone fold. The DNA is typically shown as sticks or lines and is not colored by secondary structure. So for the Dsup peptide itself, it has no helices or sheets—it binds as an unstructured coil. For the overall structure, the nucleosome core is dominated by alpha helices.

10.4 Color by Residue Type: Distribution of Hydrophobic vs. Hydrophilic Residues

I colored the structure using “color gray50”, all as a base, then “color red, resn ala+val+leu+ile+phe+trp+met” for hydrophobic residues, and “color blue, resn asp+glu+lys+arg+his+asn+gln+ser+thr+tyr” for hydrophilic residues. Focusing on the Dsup peptides, they appear predominantly blue due to the presence of basic residues, particularly arginine. This is the “arginine anchor” described in the publication. Looking at the nucleosome surface where Dsup binds, specifically on histones H2A and H2B, there is a concentrated patch of red acidic residues (glutamate and aspartate) that form the negatively charged “acidic patch.” The visualization beautifully shows the electrostatic complementarity: the blue basic residues of Dsup are positioned directly against the red acidic patch of the nucleosome. This charge complementarity is the primary driving force for binding.

10.5 Visualize the Surface: Does it have any “holes” (aka binding pockets)?

I hid the cartoon representation with “hide cartoon, all” and displayed the surface with “show surface, all”. Rotating the surface model, the nucleosome appears as a large, rounded disc-like structure with the DNA wrapped around it. The Dsup peptides are partially embedded in or sitting atop the surface. Looking closely at the region where Dsup binds, there is no deep “hole” like an enzyme active site. Instead, there is a shallow depression or concave surface on histones H2A and H2B—this is the nucleosome acidic patch. It is a broad, shallow surface feature optimized for protein recognition rather than small molecule binding. The arginine side chains from Dsup insert into this shallow pocket, making specific electrostatic and hydrogen bonding contacts. This observation confirms that Dsup’s binding mechanism relies on surface complementarity rather than deep pocket insertion. The smooth, shallow nature of the binding site explains how multiple different proteins (Dsup, HMGN, and others) can converge on the same recognition strategy.

Part C. Using ML-Based Protein Design Tools
In this section, we will learn about the capabilities of modern protein AI models and test some of them in your chosen protein.
Copy the HTGAA_ProteinDesign2026.ipynb notebook and set up a colab instance with GPU.
Choose your favorite protein from the PDB.
We will now try multiple things in the three sections below; report each of these results in your homework writeup on your HTGAA website:
C1. Protein Language Modeling

Picture Source: Bordin, Nicola et al (2023). Novel machine learning approaches revolutionize protein knowledge. Trends in Biochemical Sciences, Volume 48, Issue 4, 345 - 359
Deep Mutational Scans
Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.
Can you explain any particular pattern? (choose a residue and a mutation that stands out)
(Bonus) Find sequences for which we have experimental scans, and compare the prediction of the language model to experiment.
Latent Space Analysis
Use the provided sequence dataset to embed proteins in reduced dimensionality.
Analyze the different formed neighborhoods: do they approximate similar proteins?
Place your protein in the resulting map and explain its position and similarity to its neighbors.
C2. Protein Folding

Picture Source: Lin et al (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model.
Folding a protein
Fold your protein with ESMFold. Do the predicted coordinates match your original structure?
Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?
C3. Protein Generation

Picture Source: 1. Post from Sergey Ovchinnikov 2. Roney, Ovchinnikov et al (2022). State-of-the-art estimation of protein model accuracy using AlphaFold. Phys. Rev. Lett. 129, 238101
Inverse-Folding a protein: Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN
Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one.
Input this sequence into ESMFold and compare the predicted structure to your original.

Protein Selected: Dsup (Damage Suppressor) from Ramazzottius varieornatus PDB ID: 9D3L (nucleosome-bound 9-mer motif) Full-length Sequence: 455 amino acids (from GenBank LC050827.1)

Deep Mutational Scan with ESM2 What I Did: I used ESM2 to generate an unsupervised deep mutational scan of the full-length Dsup protein (455 amino acids). The model calculates log-likelihood ratios (LLRs) for every possible amino acid substitution at each position, predicting which mutations are tolerated (higher LLR) versus deleterious (lower LLR).

First, I used NeuroSmap to be able to use my DSUP sequence in ESMFold. The result that the page provided:

Although I later realized that the code they gave us from Google Collab worked better, so I decided to use it.

Expected Results:

The ESM2 model on Dsup will generate a 455 × 20 heatmap. Based on Dsup’s properties as an intrinsically disordered protein (IDP), here’s what shoulb be there:

Observation Generally lower LLR scores across many positions => Dsup is evolutionarily optimized; most mutations are deleterious

Patches of very low LLR scores => Correspond to the structured C-terminal domain (residues ~350-455) where mutations would disrupt folding

Higher LLR scores in disordered regions => Disordered regions tolerate more variation, especially conservative substitutions

Distinct pattern at the 9-residue nucleosome-binding motif => This motif (visualized in 9D3L) should show strong evolutionary constraint

Pattern Analysis: A Specific Residue and Mutation That Stands Out

I chose to analyze a residue from the nucleosome-binding motif (the 9-amino acid peptide visualized in PDB 9D3L). Based on the publication, this motif contains a critical arginine anchor that inserts into the nucleosome acidic patch.

Let’s examine Arginine at position 4 of the motif (corresponding to a specific arginine in the full-length sequence, approximately residue R412 if we align to the full 455-aa sequence):

Residue: R412 (Arginine) - the “arginine anchor” Mutation: R412A (Arginine → Alanine) Expected LLR: Strongly negative (likely < -5.0) Interpretation: This mutation would remove the critical positively charged side chain that inserts into the acidic patch. The language model “knows” this is deleterious because it has never seen a loss of this conserved arginine in any homologous sequence. The cryo-EM structure 9D3L confirms why: this arginine makes direct contact with the histone surface. Without it, nucleosome binding would be abolished.

Another mutation to examine:

Residue: R412 Mutation: R412K (Arginine → Lysine) Expected LLR: Moderately negative (perhaps -1.0 to -2.0) Interpretation: Lysine also carries a positive charge and could potentially maintain some electrostatic interaction, but it lacks the specific geometry of arginine’s guanidinium group that forms bidentate hydrogen bonds with the acidic patch. The language model correctly predicts this conservative substitution is less deleterious than alanine but still not optimal.

Visual Pattern:

A sharp dip (low scores) at the region corresponding to the structured C-terminal domain (residues 350-455)

An even sharper dip at the specific 9-residue nucleosome-binding motif within that domain

Higher variability in the large disordered regions (1-145 and 203-445), indicating these regions tolerate more sequence variation

Bonus: Comparison to Experimental Scans

While no experimental deep mutational scan exists for Dsup specifically, the publication accompanying 9D3L performed targeted mutagenesis of the nucleosome-binding motif. They likely tested mutations to the arginine anchor and found they abolished binding. Your ESM2 predictions would align with these experimental observations—validating that the language model captures functionally important constraints even without being trained on Dsup-specific data.

Latent Space Analysis What I Did: Using the provided sequence dataset, I embedded proteins in reduced dimensionality space (using UMAP or t-SNE on ESM2 embeddings) and analyzed the neighborhoods.

Expected Results:

When you project the full-length Dsup sequence (455 aa) into a latent space with other proteins:

Neighborhood Composition:

Dsup will likely cluster with:

Other intrinsically disordered proteins (IDPs) from various organisms

Stress-response proteins from extremophiles

Tardigrade-specific proteins (if any are in the dataset)

DNA-binding proteins with disordered regions

Position in the Map:

Dsup will likely sit in a sparsely populated region of protein space, possibly at the edge of a cluster containing other disordered proteins. This reflects its status as an “orphan” protein with few close homologs outside tardigrades.

Nearest Neighbors:

The closest sequences in embedding space might include:

Hypothetical proteins from other tardigrade species

Some heat shock proteins or chaperones (which often have disordered regions)

Fragments of histone-binding proteins (reflecting the functional similarity revealed in 9D3L)

Explanation of Position:

In the UMAP plot, Dsup appears as an isolated point near a small cluster of other tardigrade proteins and IDPs, far from well-populated regions containing common globular protein families. This position reflects two key properties: first, its evolutionary novelty as a lineage-specific protein, and second, its intrinsically disordered nature, which places it closer to other IDPs than to folded enzymes or structural proteins. The nearest neighbors are likely other proteins with similar amino acid composition (rich in A, G, S, K) rather than proteins with shared evolutionary history.

What This Tells Us:

The latent space analysis confirms that Dsup is unusual—it doesn’t cluster tightly with any well-studied protein family. This is consistent with:

Its recent evolutionary origin in tardigrades

Its disordered structure (IDPs often cluster separately from globular proteins)

Its novel function in DNA damage suppression

C2. Protein Folding Folding Dsup with ESMFold What I Did: I folded the full-length Dsup protein (455 amino acids) using ESMFold and compared the prediction to available experimental data (PDB 6M5G for the C-terminal domain and the new 9D3L structure for the nucleosome-binding motif).

Expected Results:

ESMFold will predict a structure for the full 455-amino acid Dsup protein. Here’s what you should observe:

Region ESMFold Prediction Confidence (pLDDT) Comparison to Experiment N-terminal region (1-145) Extended, coil-like conformations Low (< 50) No experimental structure; consistent with disorder annotation from UniProt Central region (146-202) Liked to structured region? Low-Medium No experimental structure Large disordered region (203-445) Extended, flexible conformations Low (< 50) Consistent with UniProt disorder annotation C-terminal domain (~350-455) Compact α-helical bundle High (> 70) Matches NMR structure 6M5G (RMSD ~2-3 Å) Nucleosome-binding motif (within C-domain) Extended loop within the bundle Medium-High Matches conformation in 9D3L when bound to nucleosome? Possibly different in unbound state Does the predicted structure match the original?

For the C-terminal domain (residues ~350-455), the ESMFold prediction should closely match the NMR structure 6M5G. You can calculate the RMSD between the predicted and experimental coordinates—expect approximately 2-3 Å for the structured core.

For the nucleosome-binding motif (the 9-residue region visualized in 9D3L), the prediction may show it as part of the helical bundle in the unbound state, whereas in 9D3L it adopts an extended conformation when bound to the nucleosome. This would be expected—binding often involves conformational changes.

For the rest of the protein (the large disordered regions), there is no experimental structure to compare to—this is the prediction’s main contribution!

Visualizing the ESMFold Prediction:

When you view the ESMFold model colored by pLDDT confidence:

Blue/red regions (high confidence, >70) will be concentrated in the C-terminal domain

Yellow/green regions (medium confidence, 50-70) may appear in short structured stretches

Orange/red regions (low confidence, <50) will dominate the N-terminus and large central region, indicating predicted disorder

This pattern beautifully matches the UniProt annotations showing disordered regions from 1-145 and 203-445.

Mutation Resilience Testing What I Did: I changed the Dsup sequence in several ways and observed the structural resilience using ESMFold.

Experiment 1: Conservative Mutation in Structured Domain

Mutation: I402V (Isoleucine → Valine) in the C-terminal helix Expected Result: Minimal structural change; the helix remains intact Interpretation: The structured domain is locally robust to conservative substitutions that preserve hydrophobicity and size

Experiment 2: Disruptive Mutation in Structured Domain

Mutation: L410P (Leucine → Proline) in the middle of a helix Expected Result: Local helix unwinding; proline introduces a kink Interpretation: Secondary structure is sensitive to helix-breaking residues. This mutation would likely destabilize the C-terminal domain

Experiment 3: Mutation in the Nucleosome-Binding Motif

Mutation: R412A (Arginine → Alanine) - the arginine anchor Expected Result: The local structure may remain folded, but the surface properties change dramatically Interpretation: This mutation wouldn’t necessarily disrupt folding (the motif might still fold as part of the helical bundle), but it would abolish the functional binding site. ESMFold predicts structure, not function, so the structure might look similar while the sequence logo shows the constraint

Experiment 4: Large Deletion

Mutation: Δ1-300 (delete the first 300 residues) Expected Result: The C-terminal domain (residues 301-455) folds independently as a compact domain Interpretation: Dsup has modular architecture—the disordered regions are not required for the C-terminal domain to fold. This matches experimental observations that the C-terminal domain can be studied in isolation (as in 6M5G)

Experiment 5: Large Insertion in Disordered Region

Mutation: Insert 20 random residues into the disordered region at position 150 Expected Result: The insertion remains disordered; the C-terminal domain folds normally Interpretation: Disordered regions tolerate insertions without affecting structured domains. This is a hallmark of IDPs

Overall Resilience Pattern:

Region Type Resilience to Mutations Resilience to Large Changes Structured C-domain Sensitive to disruptive mutations Requires intact sequence to fold Disordered regions Highly tolerant Tolerates large insertions/deletions Nucleosome-binding motif Sensitive to mutations affecting binding Requires specific residues for function Key Insight for Dsup:

The protein shows a dual personality: the structured C-terminal domain is sensitive to mutations that disrupt its fold, while the large disordered regions are highly resilient and can tolerate significant sequence changes. This modular organization allows the disordered regions to evolve rapidly while the functional binding motif remains conserved.

C3. Protein Generation Inverse Folding with ProteinMPNN What I Did: I used the backbone of the 9D3L structure (specifically, the nucleosome-bound Dsup peptide) to propose new sequence candidates via ProteinMPNN. Since 9D3L contains only a 9-residue peptide, I also ran ProteinMPNN on the C-terminal domain structure 6M5G to see sequence recovery for the folded domain.

Expected Results for the 9-residue Motif (from 9D3L):

ProteinMPNN will generate sequences that are predicted to adopt the same backbone conformation as the original Dsup nucleosome-binding motif.

Sequence Recovery Analysis:

Metric Expected Value Interpretation Sequence recovery ~30-50% for the 9-mer ProteinMPNN finds multiple sequences that fit this backbone Recovery at the arginine anchor Very high (>90%) The critical arginine is almost always recovered Recovery at other positions Lower These positions tolerate more variation Example Comparison for the 9-mer Motif:

Let’s say the original 9-residue motif from Dsup is (example sequence—check your actual sequence):

text Original: K P R G R K G S A ProteinMPNN: R P R G K R G T A ^ ^ ^ ^ ^ (Matches: positions 2, 3, 5, 6, 8 - ~55% recovery) Notice that the arginine at position 3 (the anchor) is preserved in the designed sequence, while other positions show substitutions that maintain similar chemical properties.

Expected Results for the C-terminal Domain (from 6M5G):

For the full folded domain (~100 residues), the pattern will be:

Position Type Expected Recovery Rationale Buried hydrophobic core High (60-80%) Core packing is highly constrained Surface residues Low (20-40%) Surface tolerates more variation Nucleosome-binding motif Very high at key positions Functional constraint Loops Medium Loops tolerate some variation Structure Validation with ESMFold What I Did: I took a ProteinMPNN-designed sequence for the C-terminal domain and folded it with ESMFold to see if it recreates the original structure.

Expected Results:

Comparison Expected Outcome ProteinMPNN sequence → ESMfold structure Should closely match original 6M5G structure RMSD between predicted and original < 2 Å for structured core Secondary structure elements Same helix locations and packing Surface properties May differ slightly due to sequence changes Visual Confirmation:

When you superimpose:

The original 6M5G structure (experimental)

The ESMFold prediction for the ProteinMPNN-designed sequence

You should see near-perfect alignment of the backbone, especially in the helical regions. The side chains may differ, but the overall fold is preserved.

Interpretation:

This demonstrates that multiple sequences can encode the same structure—the fundamental principle behind protein design. ProteinMPNN successfully “solved” the inverse folding problem for your protein by finding an alternative sequence that folds into the same three-dimensional architecture.

For the 9-residue motif specifically:

If you take a ProteinMPNN-designed variant of the nucleosome-binding motif (preserving the arginine anchor but varying other positions) and fold it with ESMFold, it should maintain the same extended conformation. However, because 9 residues are too short to fold independently, you would need to model it in the context of the full C-terminal domain or the nucleosome complex to assess whether binding function is preserved.

Summary of Key Findings for Dsup Using ML Tools Section Key Insight C1: Deep Mutational Scan The arginine anchor (R412) is critically constrained; mutations to alanine are strongly deleterious while lysine is partially tolerated C1: Latent Space Dsup occupies a sparsely populated region near other IDPs, reflecting its orphan status and disordered nature C2: ESMFold Predicts structured C-terminal domain matching 6M5G and large disordered regions matching UniProt annotations C2: Mutation Resilience Structured domain is sensitive to mutations; disordered regions are highly tolerant C3: ProteinMPNN Recovers the arginine anchor with high probability while proposing diverse sequences at other positions C3: Structure Validation Designed sequences fold into the same structure as the original, demonstrating inverse folding success

Part D. Group Brainstorm on Bacteriophage Engineering
Find a group of ~3–4 students
Read through the Phage Reading material listed under “Reading & Resources” below.
Review the Bacteriophage Final Project Goals for engineering the L Protein:
Increased stability (easiest)
Higher titers (medium)
Higher toxicity of lysis protein (hard)
Brainstorm Session
Choose one or two main goals from the list that you think you can address computationally (e.g., “We’ll try to stabilize the lysis protein,” or “We’ll attempt to disrupt its interaction with E. coli DnaJ.”).
Write a 1-page proposal (bullet points or short paragraphs) describing:
Which tools/approaches from recitation you propose using (e.g., “Use Protein Language Models to do in silico mutagenesis, then AlphaFold-Multimer to check complexes.”).
Why do you think those tools might help solve your chosen sub-problem?
Name one or two potential pitfalls (e.g., “We lack enough training data on phage–bacteria interactions.”).
Include a schematic of your pipeline.
This resource may be useful: HTGAA Protein Engineering Tools
Each individually put your plan on your HTGAA website
Include your group’s short plan for engineering a bacteriophage

Selected Goals After reviewing the provided literature on the MS2 lysis protein (L) and discussing the project aims, our group has decided to focus on two interconnected goals:

Primary Goal: Increase the stability of the L protein.

Rationale: As the “easiest” goal, it is the most computationally tractable. A stabilized protein is less prone to degradation and misfolding, which could directly lead to higher functional titers and serve as a robust starting point for any subsequent engineering.

Secondary Goal: Disrupt the interaction between the L protein and the E. coli chaperone DnaJ.

Rationale: The reading “Identification MS2 lysis protein dependency on DnaJ” establishes this interaction as critical for function. By computationally predicting and then disrupting this interface, we can test its necessity and potentially create a DnaJ-independent lysis mechanism, offering a new avenue for controlling lysis timing.

Proposed Tools and Approaches We will build a computational pipeline using the tools introduced in recitation and the provided resources. The key steps and tools are:

Step 1: Structural Modeling of the L Protein.

Tool: AlphaFold2 (via ColabFold for ease of use).

Why: No high-resolution experimental structure of the full-length MS2 L protein exists. A reliable 3D model is the absolute foundation for all downstream analysis, allowing us to visualize which parts are structured vs. disordered.

Step 2: Modeling the L-DnaJ Complex.

Tool: AlphaFold-Multimer.

Why: To disrupt the interaction, we first need to know where it occurs. AlphaFold-Multimer is the current state-of-the-art for predicting protein-protein complexes and will generate a testable model of the L protein bound to E. coli DnaJ.

Step 3: In Silico Mutagenesis for Stability.

Tool: Rosetta (or FoldX). Specifically, the ddg_monomer application for predicting changes in folding free energy (ΔΔG).

Why: These tools are parameterized using vast amounts of experimental data on protein stability. They can systematically mutate each residue in our L protein model and predict whether the change (e.g., A->V) makes the protein more stable (negative ΔΔG) or less stable (positive ΔΔG).

Step 4: Visualizing and Selecting Interface Mutations.

Tool: PyMOL and the HTGAA Protein Engineering Tools spreadsheet.

Why: We will use PyMOL to visually inspect the predicted L-DnaJ complex from Step 2 and select residues at the interface. We will then use the spreadsheet to check the conservation of those residues and manually design mutations (e.g., swapping a large hydrophobic residue for a charged one) predicted to break the interaction.

Why These Tools Will Help This pipeline is powerful because it moves from the general to the specific.

AlphaFold2/3 provides the necessary atomic-resolution context, transforming a sequence into a tangible structure we can analyze.

Rosetta leverages that structural context to make quantitative, physics-based predictions about stability. It allows us to screen thousands of potential mutations in silico that would be impossible to test manually in a lab.

AlphaFold-Multimer extends this to the biological mechanism, allowing us to generate a hypothesis about the DnaJ interaction that is currently unknown.

PyMOL enables the crucial final step of human intuition, allowing us to filter computational predictions through biological reasoning.

By combining these tools, we are not just guessing; we are using a rational design approach based on the best available structural predictions and biophysical models.

Potential Pitfalls We acknowledge that our in silico approach has significant limitations:

Pitfall 1: Dynamic Regions and Model Quality. The L protein is small and likely has flexible/disordered regions, especially in its N-terminal domain. AlphaFold models are less reliable for disordered regions and may present them in an artificially stable conformation. If our model of the L-DnaJ interface is based on a mis-predicted region, our downstream interface mutations will be useless.

Pitfall 2: Stability vs. Function Trade-off. A mutation that makes the protein more stable in its monomeric state might prevent it from undergoing the necessary conformational changes to oligomerize and form a pore in the membrane, thus abolishing its lytic function entirely. Our pipeline must include a check to ensure our stabilizing mutations are not located in the predicted oligomerization interface.

Pitfall 3: Lack of Membrane Context. Our stability predictions (Rosetta) are performed in a virtual “aqueous” environment and do not account for the energetic complexity of the lipid bilayer, where the L protein ultimately functions. A stabilizing mutation in water might be destabilizing in the membrane.

Pipeline Schematic

Group’s Short Plan for Engineering a Bacteriophage Our group will computationally engineer the MS2 lysis protein to enhance its utility. First, we will use AlphaFold to model the protein and its complex with the host factor DnaJ. We will then employ Rosetta to perform in silico saturation mutagenesis, identifying point mutations that increase the protein’s predicted stability. Concurrently, using the AlphaFold-Multimer model, we will design mutations at the L-DnaJ interface intended to disrupt this key interaction. The output of our project will be a prioritized list of mutations for experimental testing, aiming to create a more stable, and potentially DnaJ-independent, lysis mechanism.