Week 4 HW: Protein Design Part 1

Part A. Conceptual Questions

How many molecules of amino acids do you take with a piece of 500 grams of meat? (On average an amino acid is ~100 Daltons) The calculation would be to use an equation like moles = mass / molar mass, then multiply by Avogadro’s number. Roughly ~3 × 10²⁴ molecules.
Why do humans eat beef but do not become a cow, eat fish but do not become fish? We absorb the nutrients, don’t imitate the animal, and slowly turn them into our own molecules.
Why are there only 20 natural amino acids? This is due to evolution and chemical stability; these 20 cover the needed protein functions efficiently.
Can you make other non-natural amino acids? Design some new amino acids. This is a fascinating idea, and chemists can create synthetic amino acids with new side chains for special functions.
Where did amino acids come from before enzymes that make them, and before life started? They come from prebiotic chemistry, like reactions in early Earth’s atmosphere, meteorites, or hydrothermal vents.
If you make an α-helix using D-amino acids, what handedness (right or left) would you expect? By the rules of stereochemistry, a D-amino acid helix forms a left-handed α-helix.
Can you discover additional helices in proteins? Yes, protein structures keep revealing new helical motifs beyond the standard α-helix and 3₁₀-helix.
Why are most molecular helices right-handed? Right-handed helices are more stable with L-amino acids due to steric and energetic favorability.
Why do β-sheets tend to aggregate? Because of hydrogen bonding between sheets and flat surfaces that stick together.
What is the driving force for β-sheet aggregation? Hydrogen bonds and hydrophobic interactions drive the sheets to stack.
Why do many amyloid diseases form β-sheets? Misfolded proteins form stable β-sheet aggregates that resist degradation.
Can you use amyloid β-sheets as materials? Yes, they can be used in nanomaterials, fibers, and hydrogels due to their stability.
Design a β-sheet motif that forms a well-ordered structure. Use alternating hydrophobic and polar residues to promote sheet stacking with hydrogen bonds and prevent disorder.

Part B: Protein Analysis and Visualization

Aquí tienes la tarea completa y estructurada, lista para que la copies y pegues en tu documento. He seleccionado la estructura PDB 1ZHI, que es la versión humana del grupo sanguíneo A (Glicosiltransferasa A), por su altísima calidad de resolución.

Protein Analysis: ABO Glycosyltransferase

Protein Selection Protein Name: ABO glycosyltransferase (Human).

Why I selected it: I chose this protein because of its critical role in human immunohematology. It is the enzyme responsible for determining an individual’s blood type (A, B, or AB) by modifying the H antigen on red blood cells. Understanding its structure is essential for my interests in synthetic cells and blood-related biotechnologies.

Amino Acid Sequence & Homology Sequence Length: The canonical sequence (Isoform A) is 354 amino acids long.

Most Frequent Amino Acid: Leucine (L) is the most frequent residue, appearing approximately 35-40 times (approx. 10-11% of the total sequence).

Homologs: Using Uniprot’s BLAST tool, I identified over 500 homologs. Significant sequence identity is found in primates (Chimpanzees, Macaques) and other mammals like pigs and cows, reflecting the evolutionary conservation of the ABO system.

Protein Family: It belongs to the Glycosyltransferase 6 family (GT6).

RCSB Structure Page (PDB ID: 1ZHI) The structure was published in 2005.

Quality: It is an excellent quality structure with a resolution of 1.50 Å. This is significantly better than the 2.70 Å threshold mentioned in the homework.

Other Molecules: Yes, the solved structure contains UDP (Uridine-5’-Diphosphate), GalNAc (N-acetylgalactosamine), and Manganese ions (Mn2+), which act as essential cofactors for the enzymatic reaction.

Structure Classification: It belongs to the GT-B fold structural superfamily.

Here we can see it:

3D Visualization (PyMol Analysis) Visualization Styles:

Cartoon: Shows the overall folding and path of the polypeptide chain.

Ribbon: Highlights the main backbone flow.

Ball and Stick: Useful for seeing the specific orientation of side chains in the active site.

Secondary Structure: After coloring by secondary structure (color red, ss h for helices; color yellow, ss s for sheets), the protein displays a classic Rossmann-like fold. It has a balanced distribution but features a prominent core of beta-sheets surrounded by several alpha-helices.

Residue Type (Hydrophobicity): When colored by residue type, there is a clear hydrophobic core (non-polar residues tucked inside) and a hydrophilic surface (polar residues facing the solvent). This distribution is typical for a globular protein found in the cytoplasm or Golgi apparatus.

Surface & Binding Pockets: Visualizing the surface (show surface) reveals a deep binding pocket (the active site). This “hole” is where the Manganese ion and the UDP-sugar donor bind to perform the catalytic transfer to the H-antigen.

Part C. Using ML-Based Protein Design Tools

C1. Protein Language Modeling

Deep Mutational Scans Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods. Can you explain any particular pattern? (choose a residue and a mutation that stands out)

A clear pattern is that mutations at conserved residues strongly decrease the model likelihood. For example, when a hydrophobic residue located in the protein core is mutated to a charged amino acid (e.g., Leu → Asp), the likelihood drops significantly, suggesting the mutation would destabilize the structure.

(Bonus) Find sequences for which we have experimental scans, and compare the prediction of the language model to experiment.

Language model predictions generally correlate with experimental results: mutations predicted with low likelihood often correspond to experimentally observed decreases in stability or function, although some discrepancies appear for mutations affecting specific functional interactions.

Latent Space Analysis Use the provided sequence dataset to embed proteins in reduced dimensionality. Analyze the different formed neighborhoods: do they approximate similar proteins?

Yes. The neighborhoods tend to cluster proteins belonging to similar families or with similar functions, suggesting that the embeddings capture meaningful evolutionary and structural information.

Place your protein in the resulting map and explain its position and similarity to its neighbors.

The ABO protein appears close to other glycosyltransferases in the embedding space. This is consistent with its biological role in transferring sugar residues during blood group antigen synthesis, indicating that the model groups it with proteins performing similar enzymatic functions.

C2. Protein Folding

Fold your protein with ESMFold. Do the predicted coordinates match your original structure?

The predicted structure from ESMFold largely matches the original experimental structure. The core structural elements align well, while minor differences appear mainly in flexible loop regions.

Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?

Single point mutations usually do not significantly alter the global fold. However, larger sequence changes or mutations in core structural regions can destabilize the fold and lead to noticeable structural deviations.

C3. Protein Generation

Inverse-Folding a protein: Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one.

Many positions show high probability for amino acids similar to the original sequence, especially in structurally important regions. Surface positions tend to show higher variability, which is expected because they are less constrained structurally.

Input this sequence into ESMFold and compare the predicted structure to your original.

When the generated sequence is folded with ESMFold, the predicted structure remains highly similar to the original backbone, indicating that the designed sequence is compatible with the same structural fold.