Week-04-hw-protein-design-part-1

Homework: Protein Design I

Part A. Conceptual Questions

1.Why are there only 20 natural amino acids?

The 20 natural amino acids evolved as optimal sets very early, during the RNA world (4 billion years ago). The format was not changed and became frozen because it would disrupt all proteins and also due to tRNA recognition limitations further expansion was prohibited.

2.Where did amino acids come from before enzymes that make them, and before life started?

The amino acids were formed by abiotic processes on early Earth(4.5 billion years ago) using gases, minerals and energy sources present at that time.

Miller-Urey experiment simulated the similar environment in their experiment and created glycine,alanine and 33 otehr amino acids by condensation and reduction.

3.If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?

L-amino acids form righthanded α-helices because their chirality favours such formation to prevent steric clashes in the side chains. In contrast the D-amino acids should prefer left handed helices to prevent steric clashes in the side chains.

4.Can you discover additional helices in proteins?

There are other helix types like the 3₁₀-helices, π-helices, and polyproline II (PPII) helices. They are formed by specific hydrogen bonding patterns and amino acid sequences.

5.Why most molecular helices are right-handed?

The molecular helices in biology are right handed because of the L-chirality of amino acids and D-sugars. These molecular conformations stercially favour the right handed twist for stability and folding efficiency.

6.Why do β-sheets tend to aggregate? What is the driving force for β-sheet aggregation?

β-sheet aggregate due to hydrogen bond donors/acceptors at their edges, promote edge to edge interactions with other sheets or unfolded chains.Hydrophobic side chains on edges prefer being buried by intermolecular contacts, leading to intermolecular associations that extend sheets into fibrils or amyloids.

The primary driving force for β-sheet aggregation is thermodynamics. The hydrogen bonds and Van der Waal’s forces lower free energy, further by cooperativity by dimerization. Aggregation occurs when the hydrophobic residues bury themselves in a compact core-this “collapse” reduces solvent-exposed area and drives entropy gain from released water molecules.

7.Why do many amyloid diseases form β-sheets? Can you use amyloid β-sheets as materials?

Many amyloid diseases occur because of misfolding of proteins and adoption of β-sheet conformation and then self assemble into insoluble fibrils. Destabilization of native protein structure occurs first, then partial unfolding leading to exposure of β-strand regions that stack via hydrogen bonding into cross-β-sheet architectures. The fibrils formed are highly ordered parallel or antiparallel β-sheets,aggregate in a prion like manner, leading to plaque formation that disrupt tissue function in conditions like Alzheimer’s and type II diabetes.

Amyloid β-sheets as materials can be used as biomaterials because of their exceptional mechanical strength,biocompatibility, and nanoscale self-assembly. Non-pathogenic or engineered amyloid fibrils form robust scaffolds for tissue engineering, drug delivery, and biosensors. They mimic extracellular matrices to support cell adhesion and growth. They allow fabrications with bioplastics, hydrogels, and functional coatings for tunable properties via genetic modification or hybridization with nanoparticles.

8.Can you make other non-natural amino acids? Design some new amino acids.

Yes we can make.

The sid echain of the amino acid has to be modified by methylation or some otehr functional group, or with anotehr side chain taht is bulky. Advantages: Green, selective; challenges: Low yield, stability issues.

Part B: Protein Analysis and Visualization

Briefly describe the protein you selected and why you selected it.

mCardinal is the far red fluorescent protein I have chosen. It is a bright, monomeric,derived from Entacmaea quadricolor, with an emission peak around 656 nm.

I chose this because its excitation at 604 nm and emission at 659 nm, is the optimal far-red range for deep-tissue penetration. It is far brighter than mKate2 and other early-generation far-red variants.The monomeric form of the fluorescent protein, minimizes toxicity and can be used as fusion tags with target proteins without causing aggregation. Highly photostable so can be used for long term imaging.

How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids.

The protein is 268 amino acids long. The most common amino acid is G, it occurs 25 times.

How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.

It has many homologs and soem of them are uncharacterised proteins too. Mostly the homologs belong to the red fluorescent protein family.

Does your protein belong to any protein family?

mCardinal belongs to the GFP-like protein family (specifically the Green Fluorescent Protein superfamily)

Identify the structure page of your protein in RCSB When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)

The structure was solved in 2014.it a good quality structure. its resolution is 2.21Å.

Are there any other molecules in the solved structure apart from protein?

No.

Does your protein belong to any structure classification family?

It belongs to family of Fluorescent proteins.

Open the structure of your protein in any 3D molecule visualization software: PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands)

Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.

Color the protein by secondary structure. Does it have more helices or sheets?

It has more sheets.Helices are red, sheets are yellow and loops are green.

Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?

The hydrophobic residues are yellow in colour and hydrophilic are gray in colour. This colour combination tells us that hydrophilic residues are more towards the outer side of protein and hydrophobic residues lie within the molecule buried inside.

Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

Part C. Using ML-Based Protein Design Tools

Copy the HTGAA_ProteinDesign2026.ipynb notebook and set up a colab instance with GPU.

Choose your favorite protein from the PDB.

I am choosing the mCardinal far red fluorescent protein.

We will now try multiple things in the three sections below; report each of these results in your homework writeup on your HTGAA website:

C1. Protein Language Modeling

Deep Mutational Scans

Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.

Can you explain any particular pattern? (choose a residue and a mutation that stands out)

Latent Space Analysis

Use the provided sequence dataset to embed proteins in reduced dimensionality.

Analyze the different formed neighborhoods: do they approximate similar proteins?

Place your protein in the resulting map and explain its position and similarity to its neighbors.

Protein Folding

Folding a protein

Fold your protein with ESMFold. Do the predicted coordinates match your original structure?

Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?