Week 4 HW: Protein Design Part I

This page tackles all homeworks of week 4.

Part A. Conceptual Questions

QuestionsAnswers
1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)Considering the average meat contains 25% of proteins by weight[1], and proteins are approximately 100% composed of amino acids, we need to find the number of amino acids present in 125 grams. As it is already provided that an average amino acid is about 100 Daltons, the estimated number of amino acids is equal to:

125 g 100 Dalton = 125 g 100 g/mol = 1.25 mol × 6.022 × 1023 mol-1 ≈ 7.53e23 molecules
2. Why do humans eat beef but do not become a cow, eat fish but do not become fish?The human digestive system breaks down all raw materials into its basic forms (for example proteins are broken down into the amino acids) and then these are used for the body's own processes. If the proteins are somehow magically ingested as is, in their same form in the original organism and those proteins somehow get inside a living human cell, then there could be some other issues related to creation of new pathways which are similar to what the protein was used for in the original organism and this could have unintended consequences ultimately killing the carnivore. Nucleic Acids (constituents of DNA/RNA) are also broken down in our digestive system preventing any possibility of incorporating the external DNA fragments within our cells; in the case of a leaky-gut when some protein or DNA fragments might enter the body, the immune system responds appropriately. Thus, the digestive system acts as an information shredder, passing on raw memory-less ingredients to the host.
3. Why are there only 20 natural amino acids?There are >500 amino acids that occur naturally, but only 22 of them are expressed through living genes.[2] Why only 22 seems a tough question for evolutionists, my hunch is that even if life started with more than 500 types of animo acids being expressed, there could be eating preferences which could have led to the specific pathways (which we cuurently have) being reinforced while the other pathways disappeared slowly; or, the origin of life that posibly evolved in a pond-cluster started with these 22 essentials while the other pond-clusters didn't make it to large scale organisms. Solid proof can only be found upon trying to replicate a fast-tracked evolutionary process using all pathways consisting of 500 AAs and then observing if the evolution converges to the same 22 AAs.
4. Can you make other non-natural amino acids? Design some new amino acids.As per the definition of Amino Acids, they must contain Amino groups and Carboxylic Acids (and an alpha carbon connecting both groups; research about beta and gamma AAs are also worthwhile). So it should be pretty simple to design a new AA as per this definition. An easy way to generate new AA's (so that they are unique) is to find the heaviest AA designed till date and add another carbon atom somewhere (or at the other end); this is a general trick to keep on extending artificial AAs, but instability issues should be kept in mind. Another important reason for the limit of 22 AAs could be their size, which allows them to pass through the cellular membranes via AA-transporter proteins (larger/heavier AAs may have higher resistance to do that). Further just having a new AA wont matter much unless a new Aminoacyl-tRNA synthetase (aaRS, essential enzymes that attach specific amino acids to their corresponding tRNAs) is designed; humans have 20 different types of aaRS for attaching the 20 standard-essential AAs to their respective tRNAs. Worth Mentioning: Simply adding single carbons can be often ignored by ribosomes, and the translation machinery, better to add larger groups; in this regard adding Phosphoserine analogs with non-hydrolyzable bonds are used to study signaling.
5. Where did amino acids come from before enzymes that make them, and before life started?Before enzymes and life existed, amino acids formed spontaneously through natural geochemical and atmospheric processes. They were synthesised on primitive Earth and supposedly also delivered from space. The water reservoirs that they accumulated into form the raw chemical building blocks that sparked life are also known as The Primordial Soup.
6. If you make an Ξ±-helix using D-amino acids, what handedness (right or left) would you expect?D-amino acids are exact mirror images of natural L-amino acids and thus they form exact mirror-image structures. Natural proteins made of L-amino acids form right-handed helices; therefore, reversing the chirality of the building blocks reverses the handedness of the helix. The answer is therefore LEFT.
7. Can you discover additional helices in proteins?Proteins possess several other structures apart from the well-known alpha helices and beta sheets because the amino acid backbone allows for various bending angles. While alpha helices and beta sheets are the most common repeating patterns of secondary structure, others exist and serve critical functions:
  • – Beta Turns and Loops: Short, non-repetitive segments connecting alpha helices and beta sheets. They allow the protein chain to fold back on itself, giving the protein its compact, 3D shape.
  • – Random Coils: Irregular, unordered stretches of the polypeptide chain. Unlike helices or sheets, they don't have a stable, repeating geometry and are highly flexible.
  • – Other Helices: Less common structures like the 310-helix or π-helix, which are tighter and sometimes found at the ends of alpha helices
8. Why are most molecular helices right-handed?In nature, all amino acids found are of the chirality type L, and therefore, they form right-handed curls. A possibility regarding this is that the first life from the primordial soup was of the L-type and that subsequent superstructures (i.e., proteins) took away much of the intermediate raw materials that could have been used to form D-type amino acids. Thus, D-type structures died in the primordial evolutiona nd we don't see them anymore, except when artificially created in labs. However, this does not directly imply that the L-amino acids physically cannot form left-handed helices. Left-handed L-helices are physically possible and L-amino acids can theoretically fold into left-handed alpha helices; but the issue is steric hindrance: In a left-handed helix made of L-amino acids, the bulky side chains (Rgroups) are forced too close to the backbone atoms. This creates severe physical crowding (steric clash). Right-handed is energetically cheaper: In a right-handed helix made of L-amino acids, the side chains point outward and away from the backbone, minimizing crowding and maximizing stability.
9. Why do Ξ²-sheets tend to aggregate? What is the driving force for Ξ²-sheet aggregation?Ξ²-sheets tend to aggregate because they possess exposed, unsatisfied hydrogen bonds along their outer edges and highly hydrophobic flat surfaces. This architectural vulnerability drives them to stack or extend infinitely to achieve thermodynamic stability (frequently leading to the formation of pathological amyloid fibrils). The Primary Driving Forces are:
  • – Enthalphy gain via backbone-to-backbone Unsatisfied Edge Hydrogen Bonding
  • – The Hydrophobic Effect and Solvation Entropy: When flat, hydrophobic side chains (e.g., Leu, Ile, Val, Phe) stack face-to-face, trapped, highly ordered water molecules are released back into the bulk solvent. This creates large, concentrated patches of hydrophobic residues (such as Leucine, Isoleucine, and Valine); to escape the surrounding water, these greasy, flat surfaces crash together, driving face-to-face stacking of multiple Ξ²-sheets. This massive increase in solvent entropy is the primary thermodynamic engine driving aggregation.
10. Why do many amyloid diseases form Ξ²-sheets? Can you use amyloid Ξ²-sheets as materials?Amyloid diseases (like Alzheimer's, Parkinson's, and Type 2 Diabetes) do not happen because Ξ²-sheets are inherently toxic, but because the cross-Ξ² sheet architecture is the lowest global thermodynamic energy minimum for almost any polypeptide chain. Almost any protein, if unfolded or destabilized long enough, can convert into a Ξ²-sheet-rich amyloid. Amyloid Ξ²-sheets make extraordinary, high-performance nanomaterials. While biology views them as pathological, material scientists exploit their steel-like mechanical properties, chemical resilience, and self-assembling nature, to create Protective Biofilm Matrices, Conductive Nanowire, Injectable Hydrogels, High-Strength Composites, etc.
11. Design a Ξ²-sheet motif that forms a well-ordered structure.General steps to design a Ξ²-sheet motif:
  • – [Step 1: Length] ──> Choose 8 to 16 amino acids (peptide needs to be long enough to provide sufficient surface area for self-assembly, but short enough to remain highly soluble)
  • – [Step 2: Pattern] ──> Alternate Hydrophobic (X) and Hydrophilic (Y)
  • – [Step 3: Charge] ──> Engineer Electrostatic Complementarity: Alternate + and - charges on the Hydrophilic face to prevent the strands from shifting randomly. The hydrophilic residues/sheets then line up with these opposing (alternating positive and negative) charges locking into a precise, pristine grid of salt bridges.
  • – [Step 4: Termini] ──> Cap the ends (Acetylation / Amidation) to remove raw charges: Natural peptide ends carry a positive charge at the N-terminus and a negative charge at the C-terminus. These raw, terminal charges can disrupt the sheet geometry; they can be neutralised by adding an Acetyl group (Ac-) to the front and an Amide group (-NHβ‚‚) to the back.
The following template of RAD16 is supposedly the Gold-Standard Design Template widely used in biomedical engineering to create stable biomaterials and scaffolds: Ac-R A D A R A D A R A D A R A D A-NHβ‚‚ (Arginine - Alanine - Aspartic Acid - Alanine... repeated)

Part B: Protein Analysis and Visualization

QuestionsAnswers
1. Briefly describe the protein you selected and why you selected it.I selected the protein Ubiquitin for this assignment. Ubiquitin is a small regulatory protein found in eukaryotic cells that plays an important role in protein degradation, signalling, and cellular regulation. It has a well-characterised three-dimensional structure, and many experimentally determined structures are available in the Protein Data Bank (PDB). I initially considered using GB1, but found that its initiator methionine is retained in the mature protein. Since my broader project involves studying retro (reverse) protein sequences, I wanted a protein in which the initial methionine is naturally removed during post-translational processing. In many proteins, the initiator methionine is cleaved by methionine aminopeptidase when the second amino acid has a small side chain. Ubiquitin is therefore a suitable choice because its mature form does not retain the starting methionine, allowing cleaner comparison between the natural sequence and a retro/reverse sequence without introducing an artificial terminal methionine residue. Additionally, ubiquitin is small, structurally stable, and extensively studied, making it convenient for computational and structural analysis.
2. Identify the amino acid sequence of your protein. How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids. How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs. Does your protein belong to any protein family?The protein I selected is Ubiquitin from humans, a highly conserved regulatory protein involved in protein degradation and cellular signalling. Ubiquitin. The mature amino acid sequence of UBIQUITIN is QIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG; it contains 75 amino acids with the most frequent AA being Leucine (L appears 9 times, frequency 11.84%). Lysine (K) is also highly abundant and biologically important because ubiquitin forms polyubiquitin chains through lysine residues.
Ubiquitin has thousands of homologs across eukaryotes because it is one of the most evolutionarily conserved proteins known. Human and yeast ubiquitin, for example, share about 96% sequence identity, differing in only 3 out of 76 amino acids; (these were validated using UniProt BLAST).
Ubiquitin belongs to the Ubiquitin protein family (Pfam accession: PF00240), a family of small regulatory proteins involved in ubiquitination and intracellular protein turnover. Members of this family are small proteins or protein domains that adopt the characteristic ubiquitin fold. Proteins in this family are involved in post-translational modification pathways and intracellular protein turnover.
3. Identify the structure page of your protein in RCSB. When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Γ…). Are there any other molecules in the solved structure apart from protein? Does your protein belong to any structure classification family?The structure page of my protein is available in the Protein Data Bank (RCSB PDB) under the entry 1UBQ titled β€œStructure of ubiquitin refined at 1.8 Γ… resolution.”
The structure was solved in 1987 using X-ray diffraction, with the reported resolution indicating a very high-quality protein structure. In structural biology, lower resolution values correspond to more accurate atomic positions, and structures below about 2.0 Γ… are generally considered excellent quality. Apart from the protein itself, the solved structure also contains water molecules that were resolved in the crystal structure. No additional ligands or cofactors are present in the basic ubiquitin structure.
Ubiquitin belongs to the ubiquitin fold structural family, a highly conserved protein fold characterized by a compact globular structure containing Ξ²-sheets and an Ξ±-helix. It is also classified within the Ubiquitin-like superfamily in structural classification databases such as SCOP and CATH. Members of this structural family are involved in protein regulation, signalling, and intracellular protein turnover.
4. Open the structure of your protein in any 3D molecule visualization software: PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands). Visualize the protein as β€œcartoon”, β€œribbon” and β€œball and stick”. Colour the protein by secondary structure. Does it have more helices or sheets? Colour the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues? Visualize the surface of the protein. Does it have any β€œholes” (aka binding pockets)?I used RCSB's 3D Viewer to visualize ubiquitin in cartoon, ribbon, molecular surface, ball-and-stick, and spacefill representations.
Cartoon with Ball-and-stick representation
Molecules
Molecular surface representation
Secondary structure view
Colored by residue charge
Colored by residue type
Colored by Hydrophobicity
Coloured by Accessible Surface Area
Coloured by Geometry Quality

Ubiquitin contains more Ξ²-sheets than Ξ±-helices (consistent with the characteristic ubiquitin Ξ²-grasp fold). Hydrophobic residues are mainly buried in the interior core, while charged and hydrophilic residues are exposed on the surface. It has one hole (binding pocket). Surface visualization shows a compact globular structure with shallow interaction grooves rather than deep catalytic binding pockets.

Part C. Using ML-Based Protein Design Tools

QuestionsAnswers
C1.1. Deep Mutational Scans. Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods. Can you explain any particular pattern? (choose a residue and a mutation that stands out)
Mutation Scan Heatmap
Single Mutation Scan (purple indicates strongly unfavorable mutations relative to the wild-type residue) (download to view interactive HTML)
The first Amino acid mutation is highly unfavourable to anything. However, peculiarly, the second AA mutation to Methionine (M) is extremely favourable. Does that mean that after the second place mutation to M, the first place AA is removed through post-translational modifications; otherwise in this case there will be an AA before the methionine (M) which generally does not happen as M is the start codon.
For the retro protein sequence, the first place's mutation is similarly highly unfavourable but a specific position 40 seems to have many favourable mutations...
Mutation Scan Heatmap
Mutation Scan of the Retro-Protein Sequence (download to view interactive HTML)
C1.2. Latent Space Analysis: Use the provided sequence dataset to embed proteins in reduced dimensionality. Analyze the different formed neighborhoods: do they approximate similar proteins? Place your protein in the resulting map and explain its position and similarity to its neighbours.
Mutation Scan Heatmap
Protein sequuence similarity in latent space representation (download to view interactive HTML)
Protein sequences were first converted into 320-dimensional latent embeddings using the ESM2 protein language model. These embeddings are numerical representations learned from large-scale evolutionary protein sequence data and encode biochemical, structural, and evolutionary features of proteins. Similar proteins tend to occupy nearby regions in this high-dimensional embedding space.
To visualize these embeddings, t-distributed Stochastic Neighbor Embedding (t-SNE) was used to reduce the 320-dimensional vectors into three dimensions. Since only five protein sequences were analyzed, the perplexity parameter was reduced from the default value of 30 to 3 (because t-SNE requires the perplexity value to be smaller than the number of samples). Perplexity approximately controls the effective number of neighboring points considered during dimensionality reduction. Lower perplexity values emphasize local neighborhood relationships between nearby proteins.
The three plotted axes (TSNE1, TSNE2, and TSNE3) do not correspond to specific physical or biochemical quantities such as molecular weight, hydrophobicity, or sequence length. Instead, they are abstract latent coordinates generated by t-SNE to preserve local similarity relationships between the original high-dimensional embeddings. Proteins that appear closer together in the plot are interpreted as having more similar learned sequence representations, while proteins farther apart are considered more dissimilar in the latent embedding space.
In the resulting embedding map, Ubiquitin and the retro-Ubiquitin sequence occupied different positions despite containing the same amino acid composition in reversed order, suggesting that sequence order strongly influences the learned representation of the protein language model and that the retro sequence is interpreted as biologically distinct from native ubiquitin. Other proteins, such as haemoglobin, insulin, and albumin fragments, also occupied separate regions, reflecting their differing sequence patterns and biological functions.
However, t-SNE visualizations should be interpreted qualitatively rather than quantitatively. The distances and cluster sizes in t-SNE plots can vary depending on initialization and parameter choices such as perplexity, and therefore do not represent exact evolutionary or structural distances between proteins.
C2. Protein Folding: Fold your protein with ESMFold. Do the predicted coordinates match your original structure? Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?ESMFold was used to predict the structure of both native ubiquitin and a retro-ubiquitin sequence generated by reversing the amino acid order. The model directly predicted atomic coordinates from sequence without requiring multiple sequence alignments.
The native ubiquitin prediction showed high confidence, with most residues having pLDDT values above 85–90. The PAE heatmap also showed low predicted aligned error across most residue pairs, indicating that the model was confident about the global fold and residue positioning. The predicted structure retained the compact ubiquitin-like fold with well-defined Ξ²-sheets and Ξ±-helices. pLDDT is a per-residue confidence score ranging from 0–100, where higher values indicate more reliable structural predictions.
Ubiquitin: colour=plDDT Confidence
Ubiquitin: plDDT Plot
Ubiquitin: PAE-HeatMap
Retro-Ubiquitin: colour=plDDT Confidence
Retro-Ubiquitin: plDDT Plot
Retro-Ubiquitin: PAE-HeatMap
In contrast, the retro-ubiquitin model showed much lower pLDDT values (~30–55 across most residues), suggesting low confidence and reduced structural stability. The PAE heatmap also displayed much larger predicted errors across residue pairs, indicating uncertainty in the relative arrangement of structural regions. Even though retro-ubiquitin contains the same residues as native ubiquitin, reversing the sequence disrupts the learned structural and evolutionary patterns required for forming the exact original folds.
C3. Protein Generation: Inverse-Folding a protein: Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN. Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one. Input this sequence into ESMFold and compare the predicted structure to your original.ProteinMPNN was used to perform inverse folding on the Ubiquitin backbone structure. Instead of predicting structure from sequence, ProteinMPNN predicts amino acid sequences that are compatible with a given protein backbone geometry. The generated heatmap represents the probability of each amino acid occurring at every residue position based on the backbone structure. Bright regions indicate amino acids strongly preferred by the model at specific positions, while diffuse regions represent positions that tolerate greater sequence variability.
ProteinMPNN HeatMap indicating model predictions for a given Backbone Structure
ProteinMPNN HeatMap indicating model predictions for a given Backbone Structure
The designed sequence differed substantially from native Ubiquitin while still remaining compatible with the same backbone fold. The sequence recovery value was ~54%, indicating that roughly half of the residues were conserved relative to the original sequence. Structurally important positions showed stronger conservation, whereas surface or flexible positions tolerated more substitutions. Interestingly, the generated sequence achieved a lower ProteinMPNN score than the native sequence, suggesting that the designed sequence may fit the backbone geometry more favorably (according to the model!). This demonstrates that many different amino acid sequences can potentially encode similar protein folds, highlighting the robustness and degeneracy of protein structural organization.
  Original:       -Q-F-K-LT-K------E-S---ENV-A--QD-E----DQ----FA-KQ-E-GR--S----QKES--H-V-RLR--
  Generated:      -T-Y-E-ED-T------S-D---AEL-K--EE-T----EE----YK-EE-K-DK--A----KEGD--K-E-VPK--
ESMFold structure of protein predicted by MPNN based on Ubiquitin-BackBone
ESMFold structure of protein predicted by MPNN based on Ubiquitin-BackBone

Part D. Group Brainstorm on Bacteriophage Engineering

Since I am solo, I am proposing Recursive Protein Optimization using beam-search over Mutation Space: considering ESM2 likelihoods, conditional mutational exploration, ESMFold structural validation, recursive branching search, and local + global optimization.

  • – The pipeline starts with athe MS2 L protein sequence as input. First, an ESM2 mutational scan is performed to estimate which single-residue mutations are evolutionarily plausible based on protein language model likelihoods.
  • – Instead of selecting only one mutation, the algorithm performs a beam-search style exploration of the top candidate mutations. A parameter termed beam_width controls how many of the highest-scoring mutations are retained at each stage.
  • – For every candidate mutation, ESMFold is used to predict the resulting structure and estimate metrics such as pLDDT confidence, pTM score, and structural similarity relative to both the original protein and the locally mutated parent sequence.
  • – Mutations that improve or preserve structural confidence are retained, while destabilizing mutations are discarded. The retained sequences are then recursively reintroduced into the same pipeline, allowing conditional mutations to accumulate over multiple generations.
  • – This creates a branching mutation graph where each node corresponds to a protein variant and edges correspond to accepted mutations. The search terminates when no further stabilizing or plausible mutations can be identified.
  • – Finally, all terminal branches are compared to identify candidate proteins with improved stability while preserving the original fold topology.

Process Flowchart:

           Protein Sequence
                    β”‚
                    β–Ό
         ESM2 Mutational Scan
                    β”‚
                    β–Ό
       Rank Mutations by Score
                    β”‚
                    β–Ό
        Select Top-k Mutations
          (beam_width = 1–3)
                    β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β–Ό                   β–Ό
   Mutation 1           Mutation 2
          β”‚                   β”‚
          β–Ό                   β–Ό
   ESMFold Prediction   ESMFold Prediction
          β”‚                   β”‚
          β–Ό                   β–Ό
 Evaluate pLDDT / pTM / RMSD / PAE
          β”‚                   β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
       Keep Improved Variants
                    β”‚
                    β–Ό
      Recursive Mutational Search
                    β”‚
                    β–Ό
        Build Mutation Graph
                    β”‚
                    β–Ό
      Select Best Final Variant

An Example mutation tree:

Original Protein Sequence
β”‚
β”œβ”€β”€ V12A
β”‚    β”œβ”€β”€ V12A + K18R
β”‚    β”‚      └── V12A + K18R + L27I
β”‚    β”‚
β”‚    └── V12A + S22T
β”‚
β”œβ”€β”€ K18R
β”‚    β”œβ”€β”€ K18R + L27I
β”‚    └── K18R + E31D
β”‚
└── L27I
     └── L27I + S22T

This pipeline takes into consideration condition-dependent mutations and evaluates then to choose the next pathways, and ultimately reach the best stable form. This recursive strategy attempts to approximate epistatic interactions between mutations, where the effect of one mutation depends on the presence of earlier mutations.