Week 4 HW: Protein Design Part I

Part A. Conceptual Questions

Answer any NINE of the following questions from Shuguang Zhang: (i.e. you can select two to skip)

1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)

To estimate how many amino acid molecules are in 500 grams of meat, we start with the fact that the average mass of one amino acid is about 100 Daltons.

One Dalton equals 1.66 x 10^(-24) grams. So, 100 Daltons equals:

100 x 1.66 x 10^(-24) = 1.66 x 10^(-22) grams per amino acid.

Now we divide the total mass of meat (500 grams) by the mass of one amino acid:

500 divided by 1.66 x 10^(-22) = 3 x 10^(24)

Therefore, 500 grams of meat contain approximately 3 x 10^(24) amino acid molecules.

This shows that even a small amount of food contains an enormous number of molecular building blocks.

2. Why do humans eat beef but do not become a cow, eat fish but do not become fish?

Humans eat beef or fish, but we do not become a cow or a fish because our bodies break down food into basic molecular components before using it.

When we eat meat, our digestive system breaks proteins into amino acids, fats into fatty acids, and carbohydrates into simple sugars. These small molecules are absorbed into the bloodstream and then reused by our cells to build human proteins, human tissues, and human cells according to our own DNA instructions.

The key reason we do not become what we eat is that our genetic information controls how these molecules are assembled. A cow’s DNA builds cow proteins and tissues, while human DNA builds human proteins and tissues. Even though the raw materials are similar, the instructions are different.

In short, we do not become a cow or a fish because our body does not copy their structure — it only reuses their molecular building blocks to maintain and build our own human body.

3. Why are there only 20 natural amino acids?

The reason there are 20 natural amino acids is that those 20 are enough to build all the proteins that life needs.

Think of amino acids like Lego pieces. You do not need thousands of different pieces to build something complex; with a well-designed set, you can create almost any structure. The 20 amino acids have different sizes, electrical charges, and properties (some are hydrophobic, others are hydrophilic, some are positive, others negative). This variety is enough for proteins to fold into many different shapes and perform many different functions.

Another important reason is evolution. Early in the origin of life, more types of amino acids may have existed. However, these 20 worked well together within the genetic system. Once the genetic code was established using these 20 amino acids, changing it would have been very risky for organisms. For that reason, the system remained stable.

In summary, there are 20 amino acids because this number provides enough chemical diversity to create the complexity of life, and evolution fixed this set as the standard.

4. Can you make other non-natural amino acids? Design some new amino acids

Yes, it is possible to create non-natural amino acids. Scientists can design new amino acids by modifying the part of the molecule known as the side chain, or R group. All amino acids share the same basic structure: an amino group, a carboxyl group, a hydrogen atom, and a variable side chain. The side chain is what determines the chemical properties of each amino acid. By changing this side chain, new amino acids with new properties can be created.

For example, one could design a modified version of phenylalanine by adding fluorine atoms to its aromatic ring. This change could make the amino acid more chemically stable and resistant to degradation, which would be useful in biomaterials. Another possibility would be designing a photo-responsive amino acid whose side chain changes shape when exposed to light. This could allow scientists to control protein activity using specific wavelengths of light. A third example could be a metal-binding amino acid with a side chain designed to strongly interact with metals such as copper or iron, which could be useful in environmental or material science applications.

Although natural organisms use only 20 standard amino acids, synthetic biology has made it possible to expand the genetic code. By engineering specialized transfer RNAs and modifying translation systems, researchers can incorporate non-natural amino acids into proteins. This allows the creation of proteins with entirely new properties that do not exist in nature.

In summary, non-natural amino acids can be designed by modifying the chemical structure of existing ones, particularly their side chains, enabling the development of new biological functions and materials.

5. Where did amino acids come from before enzymes that make them, and before life started?

Before life began, amino acids likely formed through natural chemical reactions on the early Earth. At that time, there were no enzymes or living cells. Instead, simple molecules such as water (H2O), methane (CH4), ammonia (NH3), hydrogen (H2), and carbon dioxide (CO2) were present in the atmosphere and oceans. Energy from lightning, volcanic activity, ultraviolet radiation from the sun, and geothermal heat provided the energy needed to drive chemical reactions between these small molecules.

In 1953, the Miller-Urey experiment showed that when simple gases thought to exist on early Earth were exposed to electrical sparks (simulating lightning), amino acids formed spontaneously. This demonstrated that the building blocks of proteins can arise from non-living chemical processes. In addition, amino acids have been found in meteorites, suggesting that some may have formed in space and arrived on Earth through asteroid impacts.

After amino acids were present, some of them began to link together, forming short chains called peptides. Over time, certain peptides may have developed the ability to speed up chemical reactions slightly. These primitive catalytic molecules would have provided an advantage, because they made useful reactions happen more efficiently.

Before true protein enzymes existed, many scientists believe that RNA molecules played an important role. RNA can both store information and act as a catalyst (these catalytic RNAs are called ribozymes). This idea is known as the “RNA world” hypothesis. Eventually, as biological systems became more complex, proteins replaced most RNA catalysts because proteins are more versatile and efficient. These protein catalysts are what we now call enzymes.

In summary, amino acids likely formed through natural chemical reactions powered by environmental energy sources. Over time, they combined into peptides, some of which gained catalytic abilities. Through gradual evolution—possibly beginning with catalytic RNA—modern enzymes eventually emerged, allowing life to develop increasingly complex biochemical systems.

6. If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?

If you make an alpha-helix using D-amino acids, you would expect it to form a left-handed helix.

In nature, proteins are made almost entirely from L-amino acids. These naturally form right-handed alpha-helices because of their specific three-dimensional geometry. The spatial arrangement of atoms in L-amino acids favors a right-handed twist when they fold into an alpha-helix structure.

D-amino acids are mirror images of L-amino acids. Because their geometry is reversed, the preferred helix direction is also reversed. As a result, a chain made entirely of D-amino acids would form a left-handed alpha-helix.

In summary, L-amino acids form right-handed alpha-helices, while D-amino acids form left-handed alpha-helices due to their mirror-image stereochemistry.

7. Can you discover additional helices in proteins?

Yes, additional helices can be discovered in proteins. Proteins are very flexible molecules that can fold into many different shapes depending on their amino acid sequence. While the alpha-helix is one of the most common helical structures found in nature, it is not the only possible one.

Scientists have identified other types of helices, such as the 3₁₀ helix and the pi-helix. These structures differ slightly in how tightly they are wound and in the pattern of hydrogen bonds that stabilize them. They are less common than the alpha-helix but still naturally occur in some proteins.

In addition, researchers in protein engineering and synthetic biology can design entirely new helical structures by changing amino acid sequences or by incorporating non-natural amino acids. Advances in computational tools and artificial intelligence now allow scientists to predict and design novel protein folds that may not exist in nature.

In summary, additional helices can be discovered or designed because protein structure depends on the chemical properties and arrangement of amino acids, and these combinations allow for many possible folding patterns.

8. Why are most molecular helices right-handed?

Most molecular helices are right-handed because the building blocks of life are not symmetrical. In living organisms, amino acids are almost always in the L-form, which has a specific three-dimensional orientation. This asymmetry, called chirality, influences how molecules fold and assemble.

When many L-amino acids link together to form a protein, their geometry naturally favors a right-handed twist when forming structures like the alpha-helix. The specific angles between chemical bonds and the way hydrogen bonds stabilize the structure make the right-handed version more stable for L-amino acids.

In general, once life selected L-amino acids as the standard building blocks, the structures that formed from them (such as protein helices) also followed a consistent handedness. This biological preference became universal because it was energetically favorable and evolutionarily fixed.

In summary, most molecular helices are right-handed because life uses L-amino acids, and their three-dimensional structure naturally leads to right-handed helical folding.

9. Why do β-sheets tend to aggregate?

Beta-sheets tend to aggregate because their structure allows strong and repeated interactions between neighboring protein strands. In a beta-sheet, the backbone of the protein forms many hydrogen bonds in a very regular and extended pattern. This creates flat surfaces that can easily align with other beta-strands from nearby molecules.

When these flat regions come close together, they can form additional hydrogen bonds between different protein molecules. This stacking effect is energetically favorable, meaning it lowers the system’s energy and makes aggregation more stable. Because the pattern of hydrogen bonding is repetitive and strong, beta-sheets can “zip up” with each other, leading to large aggregates.

This behavior is especially important in diseases like Alzheimer’s, where proteins misfold and form beta-sheet–rich aggregates known as amyloid fibrils. The beta-sheet structure makes it easy for many copies of the same protein to stick together in an ordered way.

In summary, beta-sheets tend to aggregate because their flat, hydrogen-bonded structure allows them to align and form stable intermolecular interactions with other beta-sheets, promoting stacking and aggregation.

What is the driving force for β-sheet aggregation?

The main driving force for beta-sheet aggregation is the formation of hydrogen bonds between protein backbones, combined with hydrophobic interactions.

In a beta-sheet structure, the protein backbone is extended and forms hydrogen bonds in a very regular pattern. When multiple beta-strands from different protein molecules come close together, they can form additional hydrogen bonds between each other. This creates a very stable, repetitive “zipper-like” structure.

At the same time, many beta-sheet–forming regions contain hydrophobic (water-repelling) amino acids. When these hydrophobic surfaces are exposed to water, it is energetically unfavorable. Aggregation helps bury these hydrophobic regions away from water, which lowers the overall energy of the system. This hydrophobic effect strongly promotes aggregation.

So, the driving forces are:

Backbone hydrogen bonding between strands.
The hydrophobic effect, which pushes nonpolar regions to cluster together.
Overall energy minimization, making the aggregated state more stable.

In summary, beta-sheet aggregation is driven by strong hydrogen bonding and hydrophobic interactions that stabilize stacked protein structures.

Part B: Protein Analysis and Visualization

1. Briefly describe the protein you selected and why you selected it

I selected collagen as the protein for this assignment. Collagen is a structural protein that is the main component of connective tissues such as skin, tendons, cartilage, and bone. It is especially important in archaeology because collagen is one of the primary organic materials preserved in ancient bones, textiles, and artifacts. Archaeologists often analyze collagen to study diet, radiocarbon dating, and preservation conditions.

Collagen has a unique three-dimensional structure known as a triple helix, formed by three polypeptide chains tightly wound around each other. This structure gives collagen its strength and stability. I selected collagen because it is directly relevant to archaeological research and material preservation, and its distinctive structure makes it an excellent example for understanding how protein structure relates to function and long-term stability.

2. Identify the amino acid sequence of your protein.

How long is it? What is the most frequent amino acid?

The length of the protein is: 2993 aminoacids. The most common amino acid is: G, which appears 627 times.

How many protein sequence homologs are there for your protein?

To determine the number of homologs, I used UniProt’s BLAST tool to search for sequences similar to human collagen type I (COL1A1). The BLAST results showed thousands of homologous protein sequences across many different organisms, particularly vertebrates such as mammals, birds, reptiles, and fish.

Collagen is a highly conserved structural protein, meaning its sequence has remained relatively similar throughout evolution. Because it plays a critical role in connective tissues such as bone and skin, it is present in nearly all multicellular animals. As a result, BLAST identifies a very large number of homologs with significant sequence similarity.

This high number of homologous sequences reflects the essential structural role of collagen and its evolutionary conservation across species.

Does your protein belong to any protein family?

Yes, collagen Type I alpha 1 chain (COL1A1) belongs to the collagen protein family. More specifically, it is part of the fibrillar collagen family.

Collagens are a large family of structural proteins that form the extracellular matrix in connective tissues. They share a characteristic triple-helix structure composed of three polypeptide chains and a repeating Gly-X-Y amino acid sequence, where glycine appears every third residue. This repeating pattern is essential for forming the stable triple helix.

Within the collagen superfamily, Type I collagen belongs to the fibrillar collagens, which also include types II, III, V, and XI. These collagens form rope-like fibers that provide tensile strength to tissues such as bone, skin, and tendons.

In summary, COL1A1 is a member of the collagen superfamily and specifically part of the fibrillar collagen family, which is responsible for structural support in connective tissues.

3. Identify the structure page of your protein in RCSB

When was the structure solved? Is it a good quality structure?

One representative collagen triple-helix structure was solved using X-ray crystallography with a resolution of approximately 2.7 Å. A resolution of 2.7 Å is considered good quality, since smaller values indicate higher structural detail. At this resolution, atomic positions can be determined with reasonable accuracy. LINK: https://www.rcsb.org/structure/1CAG

Are there any other molecules in the solved structure apart from protein?

Yes. In addition to the collagen protein chains, X-ray crystallography structures often include water molecules and sometimes small ions or stabilizing molecules. These are commonly found in crystal structures because they help stabilize packing interactions in the crystal lattice.

Does your protein belong to any structure classification family?

Yes. The collagen triple helix belongs to the fibrous protein structural family. Collagens are classified as structural proteins with a unique triple-helix motif, consisting of three polypeptide chains wound around each other. This distinct arrangement classifies collagen as a structural superfamily separate from globular proteins.

4. Open the structure of your protein in any 3D molecule visualization software:

Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.

When visualized as cartoon and ribbon representations, the structure clearly shows the characteristic triple-helix arrangement of collagen. The protein consists of three polypeptide chains tightly wound around each other. When displayed in ball-and-stick representation, individual atoms and the repeating Gly-X-Y pattern can be clearly observed.

Color the protein by secondary structure. Does it have more helices or sheets?

When colored by secondary structure, the protein shows predominantly helical structure. Collagen does not form beta-sheets like many globular proteins. Instead, it forms a unique triple-helix composed of three left-handed helices wrapped into a right-handed superhelix. Therefore, the structure contains more helical content and essentially no beta-sheets.

Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?

When colored by residue type, a clear pattern appears. Glycine residues are distributed regularly throughout the structure because glycine occurs every third position in collagen (Gly-X-Y repeat). Many hydrophobic residues are buried toward the interior of the triple helix, contributing to structural stability, while more hydrophilic residues are exposed toward the solvent. This distribution supports structural integrity and interaction with the extracellular environment.

Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

When visualizing the surface of the protein, collagen does not display deep binding pockets like many enzymes. Since it is a structural fibrous protein rather than a globular enzyme, it lacks large internal cavities or active-site pockets. Instead, the surface is elongated and repetitive, consistent with its mechanical structural role.

Part C. Using ML-Based Protein Design Tools

For this project, I selected the collagen triple helix model (PDB ID: 1CAG). Collagen is a structural protein that forms the extracellular matrix in connective tissues such as bone and skin. I chose collagen because of its biological relevance and its characteristic Gly-X-Y repeating motif, which is essential for triple-helix formation. As an archaeologist interested in biomaterials and preservation, collagen is particularly meaningful due to its importance in bone structure and archaeological remains.

1. Deep Mutational Scans

Using ESM2, I generated an unsupervised deep mutational scan of the collagen triple helix model (PDB ID: 1CAG) based on language model likelihood scores. The heatmap reveals clear positional constraints across the sequence.

A particularly striking pattern appears at glycine positions. For example, at position 13, glycine shows a strongly positive score (yellow), while most alternative amino acids show strongly negative scores (blue/purple). This indicates that substitutions at this position are highly unfavorable.

This pattern reflects the structural constraint of the Gly-X-Y repeating motif characteristic of collagen. Glycine is required every third residue to allow tight packing of the triple helix. Replacing glycine with a bulkier amino acid would introduce steric clashes and destabilize the structure. The language model successfully captures this evolutionary constraint without being explicitly trained on structural data.

Overall, the deep mutational scan demonstrates that structurally critical residues, particularly glycine, are strongly conserved and intolerant to mutation.

Part D. Group Brainstorm on Bacteriophage Engineering

Project Objective

Engineer the L protein of the MS2 phage to increase structural stability.
Disrupt or reduce its interaction with the bacterial chaperone DnaJ.
Preserve the C-terminal lysis domain to maintain lytic function.
Avoid mutations that interfere with structurally or evolutionarily coupled residues.

Phase 1: Mapping the DnaJ Interaction Interface

Since the exact binding interface between the L protein and DnaJ is unknown, the first step is to identify it computationally rather than introducing arbitrary mutations.

Use AlphaFold-Multimer to model the complex between L protein and DnaJ.
Generate multiple structural predictions and select the top-ranked models.
Identify consensus interface residues that consistently appear in the predicted binding interface.
Perform in silico alanine scanning of the N-terminal residues in the complex to determine which residues significantly contribute to binding energy (ΔΔG).
Analyze whether the N-terminal region resembles known DnaJ-binding motifs, typically hydrophobic residues flanked by basic amino acids.

This phase defines which residues are critical for interaction and should not be mutated randomly.

Phase 2: Targeted N-Terminal Redesign

Instead of deleting regions or performing extensive random substitutions, introduce controlled chemical modifications to disrupt interaction while preserving structural stability.

Focus on charge inversion strategies:
- Basic residues (K, R) → Acidic residues (E, D)
- Acidic residues (E, D) → Basic residues (K, R)
Disrupt hydrophobic interaction patches:
- Hydrophobic residues (L, I, V, F) → Polar residues (S, T, N, Q)
- Aromatic residues (F, Y, W) → Aliphatic or small residues
Generate a graded library of variants:
- Minor charge modifications
- Moderate interface perturbations
- Strong hydrophobic disruption

This creates a Pareto front of variants balancing reduced DnaJ interaction and preserved protein stability.

Phase 3: Stability and Functional Filtering

To ensure that redesigned variants remain structurally viable and functionally relevant:

Use Rosetta or FoldX to calculate ΔΔG and verify that mutations do not destabilize the overall protein fold.
Confirm that mutations in the N-terminal region do not propagate structural stress toward the C-terminal lysis domain.
Perform co-evolutionary analysis (e.g., EVcouplings):
- Identify residue pairs that co-evolved between the N-terminal and C-terminal regions.
- Avoid mutating co-evolved residues independently to prevent functional disruption.
Evaluate aggregation propensity using tools such as Aggrescan3D to ensure that mutations do not create exposed hydrophobic patches leading to cytoplasmic aggregation.
Assess sequence plausibility using protein language models such as ESM to filter out unlikely or non-natural variants.

Key Limitations

The DnaJ binding mode may be transient or dynamic, reducing AlphaFold-Multimer accuracy.
Protein language model scores do not guarantee in vivo functionality.
Intrinsically disordered regions may not be accurately modeled.
Computational predictions must ultimately be validated experimentally.