Week 04 HW: Protein Design Part 1

Part A. Conceptual Questions (the responses heavilily relied on Google)

How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)
It depends on the meat. If it is red meat, it is about 67.8e23 amino acid molecules.
Why do humans eat beef but do not become a cow, eat fish but do not become fish?
Humans’ digestive system and immune system identify and destroy foreign DNA, preventing it from being incorporated into the human genome. Therefore, human bodies break down foreign proteins and DNA into basic nutrients, using them to build human cells rather than adopting the food’s genetic structure.
Why are there only 20 natural amino acids?
The 20 natural amino acids are the result of evolution, which has optimized them to create stable, functional, and soluble proteins. In this way, 20 suffices to create all necessary proteins, offering resistance against mutations.
Can you make other non-natural amino acids? Design some new amino acids.
Yes. For example, Photo-crosslinkers, such as the 4-benzoyl-L-phenylalanine (BPA), are used to map protein-protein interactions by creating covalent bonds upon exposure to UV light.
Where did amino acids come from before enzymes that make them, and before life started?
Likely through non-biological, prebiotic chemical reactions driven by energy sources like UV light, lightning, or hydrothermal heat acting on simple atmospheric gases.
If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?
A left-handed helix. While natural proteins made of L-amino acids form right-handed -helices, D-amino acids are mirror images of L-amino acids, causing their stable secondary structures to form the corresponding mirror-image (left-handed) helix
Can you discover additional helices in proteins?
Researchers have discovered that by modifying protein ends, they can create or stabilize specific helices to study their function, such as HBS (Hydrogen Bond Surrogate) helices: Designed to cap the N-terminus of a helix to increase stability, which can act as a “new” type of stable helical structure in design contexts.
Why are most molecular helices right-handed?
Because this structure is energetically more stable and allows for less steric hindrance between amino acid side chains or bases.
Why do β-sheets tend to aggregate? What is the driving force for β-sheet aggregation?
Because of their inherent structural propensity to form extensive intermolecular hydrogen bonds. The driving force for this aggregation is a combination of hydrophobic effects, hydrogen bonding, and, most importantly, the dehydration of pre-formed intramolecular hydrogen bonds, which triggers the burial of non-polar groups and the formation of highly stable “cross-β” structures.

Part B: Protein Analysis and Visualization

Briefly describe the protein you selected and why you selected it.
The protein I chose was Nitrogenase iron protein (NifH). Nitrogenase is the enzyme complex responsible for converting atmospheric nitrogen (N₂) into ammonia (NH₃), which plants can use. Because nitrogen availability limits crop productivity in cold climates, improving the efficiency of nitrogen fixation in engineered soil microbes is directly relevant to my proposed final project.
https://www.uniprot.org/uniprotkb/C1DGZ6/entry#sequences
Identify the amino acid sequence of your protein.

-How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids.

290 amino acids. Most frequent: E: 29 (10.00%); A: 28 (9.66%); G: 28 (9.66%)

-How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.

-Does your protein belong to any protein family?

NifH/BchL/ChlL family.

Identify the structure page of your protein in RCSB

-When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)

OK resolution (2.9Å). (INIP) https://www.rcsb.org/structure/1NIP#entity-1

-Are there any other molecules in the solved structure apart from protein?

ADP or ATP analogs

-Does your protein belong to any structure classification family?

NifH/FrxC-like; SCOP ID: 4003981; PF00142

Open the structure of your protein in any 3D molecule visualization software:

-PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands). Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.

-Color the protein by secondary structure. Does it have more helices or sheets?

A mix of alpha helices and beta sheets.

-Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?

Hydrophobic residues cluster in the protein core.

-Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

Yes, it does.

Part C. Using ML-Based Protein Design Tools

C1. Protein Language Modeling

Deep Mutational Scans: Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.

-Can you explain any particular pattern? (choose a residue and a mutation that stands out)

The deep mutational scan reveals strong intolerance to mutation at conserved cysteine residues.

Latent Space Analysis: Use the provided sequence dataset to embed proteins in reduced dimensionality. Analyze the different formed neighborhoods: do they approximate similar proteins?

NifH is positioned centrally within the nitrogenase cluster.

C2. Protein Folding

-Folding a protein: Fold your protein with ESMFold. Do the predicted coordinates match your original structure?

There will be small modifications.

-Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?

The structure may look different, especially with Large segment mutations to critical residues. However, the function may not be fundamentally shifted by changes in structure.

C3. Protein Generation

-Inverse-Folding a protein: Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one. Input this sequence into ESMFold and compare the predicted structure to your original.

The structure resembles the original.

Part D. Group Brainstorm on Bacteriophage Engineering

I would like to choose the goal, Increased stability. This is because Lysis proteins are often small, membrane-associated, and partially disordered, and can be unstable when expressed at high levels.

One tool to use would be ESMFold, which predicts 3D structure of L protein, and stability engineering requires structural context. Another tool would be ESM2 language model to perform single-residue mutational scanning. The potential pitfall is the stability vs. function tradeoff, as mutations that stabilize fold may reduce lytic activity.

Input: L Protein Sequence → ESMFold: Predict Structure to identify unstable regions (low pLDDT) → ESM2 Deep Mutational Scan to select tolerated stabilizing mutations → ESMFold re-prediction of mutants to tank variants for experimental validation