Week 4 HW: Protein Design Part I

Part A. Conceptual Questions

  1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)

6.022 x10^23 molecules of amino acids.

  1. Why do humans eat beef but do not become a cow, eat fish but do not become fish? Humans obtain amino acids from the food they eat, it gets broken down and digested by the human, incorporated into human tissues and used for energy consuming, and leftover parts are excreted as normal.

  2. Why are there only 20 natural amino acids?

There are only 20 amino acids because there was only a limited choice of C, H, N, O and S atoms which can form limited functional groups on Earth, approximately 4 billion years ago. These were also selected because of their favourable solubility, folding and stability - which takes into account the amount of energy and resources the protein needs to be synthesised (Doig, 2017).

  1. Can you make other non-natural amino acids? Design some new amino acids.

Design considerations I would take:

  • I would like to produce a therapeutic non natural AA which would extend a drug’s half life.
  • I would like my amino acid to be in the D-isomer form, as the L-isomer form is easily digested by the body.
  • I’ll use a CH3 group and add it to the alpha-carbon to protect it from being enzymatically degraded, and so it remains in the bloodstream.
  • I’ll add membrane permeability to increase bioavailability and absorption of the amino acid drug; by first creating a prodrug form of it, but also using hydrophobic groups to increase permeability.
  • Research making non natural AA’s shows when fluorine replaces carbon, it can increase its stability in the body without making the amino acid too bulky.

My idea is the target GLP-1 which Ozempic, the weight loss and diabetes metabolic drug is used to target. Usually, GLP-1 is degraded by the enzyme DPP-4 quickly in the bloodstream. I’d like to add a alpha-methylated NAA or a D-amino acid at the DPP-4 cleavage sites. By joining the peptide with an NAA like alpha-methyl-L-phenylalanine, I will extend the half-life of the drug from minutes to over a week, which would make it as long term treatment for metabolic disorders (Kohnke & Zhang, 2026).

  1. Where did amino acids come from before enzymes that make them, and before life started? Amino acids were produced by abiogenesis: gases like methane, nitrogen etc could react with lightning (electricity) to produce amino acids. They may have been found in meteorites and may be produced in reactions taking place in extreme environments like deep sea hydrothermal vents to produce amino acids (Cowing, 2023).

  2. If you make an α-helix using D-amino acids, what handedness (right or left) would you expect? It will form a left-handed helix.

  3. Can you discover additional helices in proteins?

Zhang and Egli (2021) found ways of using chemical classification methods to discover additional helices in proteins. This was based on the composition of the helices: whether they have water-hating and water-loving amino acid types and their water solubility. They are organised as structurally identical α-helices into three chemically differing types: Type I, hydrophilic α-helix; Type II, hydrophobic α-helix; Type III, amphiphilic α-helix. The QTY code is used to use the structural similarities between specific polar and non-polar amino acids to change proteins. By replacing water hating amino acid residues with chemically similar water loving amino acid residues, water-insoluble membrane proteins can be changed into soluble modifications without losing their initial structure or function. This framework provides a simplified “molecular code” for protein design, comparable to the base-pairing rules of DNA.

  1. Why are most molecular helices right-handed?

Left hand helices are not as energetically favourable or form stable helices, as the right hand conformation promotes increased hydrogen bonding of the peptide chain backbone. According to evolution and adaptation, for chirality purposes, we use L-amino acids and D-sugars which favour and create the right handed conformation. As the helices fold, the right handed conformation is trapped like this due to an unusually high unfolding barrier which causes extremely reduced protein unfolding rates (Manning & Colón, 2004). Moreover, steric hindrance, which prevents a chemical reaction occuring or slows it down due larger chemical groups obstructing access to the reactive site, can occur in the left handed conformation therefore the right handed conformation is favoured.

  1. Why do β-sheets tend to aggregate?

Beta-sheets aggregrate due to the hydrophobic side chains sticking out and intermolecular hydrogen bonding between the peptide backbone and the neighbouring strand. This causes stacking and increases stability of the beta sheets causing aggregation.

  • What is the driving force for β-sheet aggregation?

The driving force is the Intermolecular hydrogen bonding between the peptide backbone of one strand to its corresponding strand and hydrophobic side chains protruding out therefore building a highly stable and rigid structure which aggregrates.

  1. Why do many amyloid diseases form β-sheets?

In neurodegenerative diseases like Alzheimer’s, Parkinson’s, prion diseases and endocrine disorders like Type II diabetes the Beta-sheets form proteins which do not have their normally functional, 3D, native structure, leading to cellular aggregation, toxicity and disease, as in the formation of plaques in these diseases causing cognitive and memory deficits. The amyloid diseases cause flattened fibrils which produce plaques which are extremely insoluble, stable and strong due to the Beta-sheet structures ability to form hydrophobic side chains which stick out and the peptide protein backbone hydrogen bonding with the neighbouring strands. The beta sheets insoluble, stable structure prevents protease degradation to degrade and recycle these proteins.

  • Can you use amyloid β-sheets as materials? Due to amyloid β-sheets having properties such as being highly rigid, stable, increased adsorption capacity they can be used for bioremediation to capture water contaminants such as toxins and heavy metals, to produce biodegradable and sustainable bioplastics, and making protein based biogels and matrices for the 3D tissues bioprinting.
  1. Design a β-sheet motif that forms a well-ordered structure. XX (I would like to attempt this later!) XX

Part B: Protein Analysis and Visualization

1. Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions. Briefly describe the protein you selected and why you selected it.

I chose the p53 protein, which triggers programmed cell death when ailments like cancer cause extensive DNA damage from oxidative stress like UV light, oxygen radicals or chemicals. In a cancerous cell, the p53 protein will travel to the nucleus and signal the mitochondria to release reactive oxygen species or increase calcium levels. Other death factors released include cytochrome c, which activates caspases and SMAC which blocks survival proteins (Fogg et al., 2011). I selected this protein as mutations in this protein can cause cancer and it is vital to protect the human genome from damage .

2. Identify the amino acid sequence of your protein. How long is it? What is the most frequent amino acid? You can use this notebook to count most frequent amino acid - https://colab.research.google.com/drive/1vlAU_Y84lb04e4Nnaf1axU8nQA6_QBP1?usp=sharing

p53 is 393 amino acids long.

The most common amino acid is: P (Proline), which appears 45 times.

How many protein sequence homologs are there for your protein? Hint: Use the pBLAST tool to search for homologs and ClustalOmega to align and visualize them.

I found 175 total homologs using the p53 human version (https://www.uniprot.org/uniprotkb/P04637/entry).

My Cluster alignment sequences are below:

CLUSTAL O(1.2.4) multiple sequence alignment


Zebrafish_P53      -------------MAQNDSQEFAELWEKN----LISIQPPGGGSCWDII-----NDEEYL	38
Frog_P53           ----MEPSSETGMDPPLSQETFEDLWSLLPDPLQTVTCR-------------LDNLSEFP	43
Human_P53          ---MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLP---SQAMDDLMLSPDDIEQWF	54
Mouse_P53          MTAMEESQSDISLELPLSQETFSGLWKLLPPEDILPS-----PHCMDDLLL-PQDVEEFF	54
Rat_P53            ---MEDSQSDMSIELPLSQETFSCLWKLLPPDDILPTTATGSPNPMEDLFL-PQDVAELL	56
                                    ..: *  **.                           :  :  

Zebrafish_P53      ---PGSFDPNFFG-NV-----LEEQP------QPSTLPPTSTVPETSDYPGDHGFRLRFP	83
Frog_P53           D-YPLAADMTV------LQ--------EGLMGNAVPTVTSCAVPSTDDYAGKYGLQLDFQ	88
Human_P53          TEDPGPDEAPRMPEAAPPVAPAPATPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFL	114
Mouse_P53          E---GPSEALRVSGAPAAQDPVTETPGPVAPAPATPWPLSSFVPSQKTYQGNYGFHLGFL	111
Rat_P53            E---GPEEALQVS-APAAQEPGTEAPAPVAPASATPWPLSSSVPSQKTYQGNYGFHLGFL	112
                          :                               :. **. . * *.:*::* * 

Zebrafish_P53      QSGTAKSVTCTYSPDLNKLFCQLAKTCPVQMVVDVAPPQGSVVRATAIYKKSEHVAEVVR	143
Frog_P53           QNGTAKSVTCTYSPELNKLFCQLAKTCPLLVRVESPPPRGSILRATAVYKKSEHVAEVVK	148
Human_P53          HSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVR	174
Mouse_P53          QSGTAKSVMCTYSPPLNKLFCQLVKTCPVQLWVSATPPAGSRVRAMAIYKKSQHMTEVVR	171
Rat_P53            QSGTAKSVMCTYSISLNKLFCQLAKTCPVQLWVTSTPPPGTRVRAMAIYKKSQHMTEVVR	172
                   :.****** ****  ***:****.****: : *   ** *: :** *:**:*:*::***:

Zebrafish_P53      RCPHHERTP-DGDNLAPAGHLIRVEGNQRANYREDNITLRHSVFVPYEAPQLGAEWTTVL	202
Frog_P53           RCPHHERSVEPGEDAAPPSHLMRVEGNLQAYYMEDVNSGRHSVCVPYEGPQVGTECTTVL	208
Human_P53          RCPHHERCS-DSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIH	233
Mouse_P53          RCPHHERCS-DGDGLAPPQHLIRVEGNLYPEYLEDRQTFRHSVVVPYEPPEAGSEYTTIH	230
Rat_P53            RCPHHERCS-DGDGLAPPQHLIRVEGNPYAEYLDDKQTFRHSVVVPYEPPEVGSDYTTIH	231
                   *******    .:. **  **:*****    * :*  : **** **** *: *:: **: 

Zebrafish_P53      LNYMCNSSCMGGMNRRPILTIITLETQEGQLLGRRSFEVRVCACPGRDRKTEESNFKKDQ	262
Frog_P53           YNYMCNSSCMGGMNRRPILTIITLETPQGLLLGRRCFEVRVCACPGRDRRTEEDNYTKKR	268
Human_P53          YNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKG	293
Mouse_P53          YKYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKE	290
Rat_P53            YKYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKE	291
                    :***********************  .* **** .*************:***.*  *. 

Zebrafish_P53      ETKTMAKTTTGTKRSLVKESSSATLRPEGSKKAKGSSSDEEIFTLQVRGRERYEILKKLN	322
Frog_P53           GLKPS------GKRELAHPPS---SEPPLPKKRLVVDDDEEIFTLRIKGRSRYEMIKKLN	319
Human_P53          EPHHELPPGS-TKRALPNNTS---SSPQPKKK----PLDGEYFTLQIRGRERFEMFRELN	345
Mouse_P53          VLCPELPPGS-AKRALPTCTS---ASPPQKKK----PLDGEYFTLKIRGRKRFEMFRELN	342
Rat_P53            EHCPELPPGS-AKRALPTSTS---SSPQQKKK----PLDGEYFTLKIRGRERFEMFRELN	343
                               ** *    *     *   **      * * ***:::**.*:*::::**

Zebrafish_P53      DSLELSDVVPASDAEKYRQKFMTKNKKENRGSSEPKQGKKLMVKDEGRSDSD	374
Frog_P53           DALELQESLDQQKVTI--------KCRKCRDEIKPKKGKKLLVKDEQPDSE-	362
Human_P53          EALELKDAQAGKEPGGSRAHS---SHLKSKKGQSTSRHKKLMFKTEGPDSD-	393
Mouse_P53          EALELKDAHATEESGDSRAHS---SYLKTKKGQSTSRHKKTMVKKVGPDSD-	390
Rat_P53            EALELKDAHAAEESGDSRAHS---SYPKTKKGQSTSRHK-PMIKKVGPDSD-	390
                   ::***.:    ..           .  : :   . .: *  :.*    ... 

Does your protein belong to any protein family?

The protein belongs to the p53 family.

Identify the structure page of your protein in RCSB When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)

The structure for 1TUP (https://www.rcsb.org/structure/1TUP) was solved and made public in 1995. It is 2.20 Å, higher resolution, so it is a high quality structure.

Are there any other molecules in the solved structure apart from protein?

DNA which is bound and complexed with p53.

Does your protein belong to any structure classification family?

The 1TUP structure represents the core domain of the p53 tumour suppressor family in complex with its target DNA binding site.

3D molecule visualizations

CARTOON Cartoon_1TUP.png Cartoon_1TUP.png

RIBBON Ribbon_1TUP.png Ribbon_1TUP.png

BALL & STICK Ball&Stick_1TUP.png Ball&Stick_1TUP.png

Color the protein by secondary structure. Does it have more helices or sheets? 2NDARY STRUCTURE

SecondaryStructure_1TUP.png SecondaryStructure_1TUP.png

Colour key: Beta-Sheets (Arrows) = YELLOW Alpha-Helices (Spirals) = RED Loops = GREEN Nucleic acid = CYAN

It has more beta sheets than helices, consisting of two opposing antiparallel β-sheets.

Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues? Residuetype_1TUP.png Residuetype_1TUP.png

1TUP has more hydrophilic residues (coloured in white) than hydrophobic residues (coloured in orange).

Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

Yes, 1TUP has many binding pockets. 1) It has a DNA Binding cleft, 2) it has a zinc cofactor binding pocket important for its shape, 3) it has hydrophobic pockets, which drugs can target.

Part C. Using ML-Based Protein Design Tools

C1. Protein Language Modeling

1a. 2LZM (Bacteriophage T4 Lysozyme)

newplot.png newplot.png

1b. Key residues/stretches of residues that stood out to me: A Protein Language Model mutation scanning result was coloured entirely yellow/green in Position 51, Mutation to Tyrosine [Y], Score: 3.037 has the highest score in the entire column and row. The high score indicates the mutation has a positive stabilising impact on the enzyme stability and increases its enzymatic activity. It will increase its rigidity, ensure it is well compacted and improve the enzyme’s function.

In the Proline (P) row, the score becomes increasingly negative indicating a deleterious mutation, compared to the Alanine (A) or Serine (S) rows which indicates the protein may not able to fold in the right way, or it has lost its stability completely that it is likely to be degraded by the proteasome, the protein recycling and breaking down machine.

1c. (Bonus) Find sequences for which we have experimental scans, and compare the prediction of the language model to experiment.

Rennell et al. (1991) conducted a deep mutation scan which found that the Position 51, Mutation to Tyrosine [Y] was a stabilising mutation. Similarly, mutation from any amino acid to proline causes a Misfolded/Degraded protein in the P row, and the proline row is have a highly negative score throughout as replicated by the protein language model.

Latent Space Analysis

Use the provided sequence dataset to embed proteins in reduced dimensionality newplot (4).png newplot (4).png

Found target at index 13819: d1k9oi_ e.1.1.1 (I:) Alaserpin (serpin 1) {Tobacco hornworm (Manduca sexta) [TaxId: 7130]}
--- Neighborhood of d1k9oi_ e.1.1.1 (I:) Alaserpin (serpin 1) {Tobacco hornworm (Manduca sexta) [TaxId: 7130]} ---
Dist: 0.00 | ID: d1k9oi_ e.1.1.1 (I:) Alaserpin (serpin 1) {Tobacco hornworm (Manduca sexta) [TaxId: 7130]}
Dist: 0.14 | ID: d2wqfa_ d.90.1.0 (A:) automated matches {Lactococcus lactis [TaxId: 1358]}
Dist: 0.22 | ID: d1q2la1 d.185.1.1 (A:504-732) Protease III {Escherichia coli [TaxId: 562]}
Dist: 0.23 | ID: d2b0ta_ c.77.1.2 (A:) automated matches {Corynebacterium glutamicum [TaxId: 1718]}
Dist: 0.26 | ID: d5jhxa2 a.93.1.0 (A:476-784) automated matches {Magnaporthe oryzae [TaxId: 242507]}
Dist: 0.26 | ID: d1q2la2 d.185.1.1 (A:733-960) Protease III {Escherichia coli [TaxId: 562]}
Dist: 0.27 | ID: d1nhpa3 d.87.1.1 (A:322-447) NADH peroxidase {Enterococcus faecalis [TaxId: 1351]}
Dist: 0.30 | ID: d7c6ca_ d.2.1.0 (A:) automated matches {Bacillus subtilis [TaxId: 224308]}
Dist: 0.32 | ID: d1q2la3 d.185.1.1 (A:264-503) Protease III {Escherichia coli [TaxId: 562]}
Dist: 0.34 | ID: d1dpga2 d.81.1.5 (A:182-412,A:427-485) Glucose 6-phosphate dehydrogenase {Leuconostoc mesenteroides [TaxId: 1245]}

Analyze the different formed neighborhoods: do they approximate similar proteins?

Yes it does but with some variation. My 3D t-SNE analysis shows that T4 Lysozyme (2LZM) neighbourhoods approximate functional and structural archetypes than simply just identical sequences. Neighbours include include Protease III, Alaserpin and NADH peroxidase involves in regulatory and degradation functions, similar to bacteriophage T4 lysozyme which is involved in degrading bacterial peptidoglycan cell walls. Structurally, they all contain high alpha helical content, similar surface charge spread out or specific pocket geometries required for substrate binding.

Place your protein in the resulting map and explain its position and similarity to its neighbors.

The T4 lysozyme protein sits in a cluster of highly ordered, multi-domain globular shaped proteins. In the 3D plot, it’s away from the “Globin” cluster (the $a.1.1.1$ group we saw earlier) and instead occupies a space reserved for metabolic and regulatory machinery. Many of the neighbors (like Protease III and Alaserpin) are involved in cleavage or inhibition of bonds. As Y4 Lysozyme’s job breaks down bacterial cell walls by hydrolysing the peptidoglycan bond. Hosts are shared, we can see the hosts are from E. coli , Lactococcus lactis, and Enterococcus faecalis, all bacteria. Since T4 lysozyme is a Bacteriophage enzyme (which attacks these specific bacteria), it shares a biochemical language with the proteins of its prey.

C2. Protein Folding

Fold your protein with ESMFold. Do the predicted coordinates match your original structure?

Output: Total sequence length: 164 Running ESMFold inference for sequence with length 164… Prediction complete. ptm: 0.889 plddt: 93.913

ESMFold predicted coordinates for Phage Y4 Lysozyme show a high structural resemblance to the X-ray crystallography structure (PDB 2LZM) with an TM score > 0.9.

Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations? It is resilient to mutations. I tried a well known mutation : leucine to alanine which changes a bulky amino acid to a small one, and this demonstrated no effect on the stability and the globular 3D shape of the protein.

C3. Protein Generation

Inverse-Folding a protein: Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one. Input this sequence into ESMFold and compare the predicted structure to your original. Length of chain A is 164 of T4 Lysozyme.

2LZM, score=1.1319, fixed_chains=[], designed_chains=[‘A’], model_name=v_48_020 MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNCNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRCALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL T=0.1, sample=0, score=0.7766, seq_recovery=0.6098 MNIYEMLKILEGLRLKIYKDRYGNYTIGIGHLLTKDPSLEKAKAKLDEAIGRKTNGVITEEEANELFEKDVKAAIEAIKNNPVLKPVYDSLDEVRKMALIALVFRMGEKGVAGLKETLALLKEGKWDEAAETLKKSRWYKNEPENAEKIITMFKTGTFEAFEED

Amino acid probabilities newplot (6).png newplot (6).png

Sampling temperature adjusted amino acid probabilties

newplot (7).png newplot (7).png
FeatureResult observed
Sequence recovery60.98% (High similarity to native T4 Lysozyme).
Probability analysisSharp peaks at the catalytic core; high entropy on surface loops.
Mutational strategyMostly conservative (hydrophobic-to-hydrophobic) swaps.
Structural matchPredicted to be nearly identical to 2LZM (RMSD < 1.0Å).
ConclusionThe T4 Lysozyme backbone is a highly designable scaffold.

There is high certainty for residues in the hydrophobic core and active site cleft. I had 60.98% sequence identity between the ProteinMPNN sequence candidate and the native Y4 sequence, but hydrophobic residues and surface charge remained conserved across both.

Part D. Group Brainstorm on Bacteriophage Engineering

Question 3) I am addressing the project goals 1) Higher toxicity of lysis protein (hard), and mixing in elements of Increased stability (easiest) as a computational game plan.

Goal 1: I want to disrupt the interaction between MS2-L and E. coli DnaJ. The MS2-L protein depends on the DnaJ for lysis, which functions in regulation to slow down lysis and allows for virus particle construction. By disrupting the interaction, I’d like to increase bacterial lysis and also maximise the burst size or enhance the phage lytic cycle.

Goal 2: Increase toxicity of the lysis protein: Optimise the essential C-terminal lytic domain (Domains 2–4) and the conserved LS motif (Leu48-Ser49) to increase its membrane rupture thus cell lytic abilities.


Tools and approaches I would like to use

Protein Language Models (ESM2) for In Silico deep mutational scan: ESM2 can be used to make a mutational scan of the MS2-L sequence. We can see amino acids which are important in the N-terminal domain which are important for DnaJ binding and decipher the amino acids in the C-terminal domain which are more responsive to change.

ESMFold will be used for Structure Prediction of non-mutated MS2-L and its variants to ensure that mutations intended to disrupt DnaJ binding do not compromise the overall fold or the predicted alpha-helical transmembrane domain.

ProteinMPNN to remodel the DnaJ-N-terminal domain to reduce inhibitory and optimise the C-terminal domain for enhanced membrane binding.


Justification for tool selection

ESM2 will scan for the conserved LS motif without knowing a established structure, because it relies on evolutionary patterns.

ProteinMPNN can make sequences which are more stable and will fit a certain backbone than natural sequences.

ESMFold predicts 3D atomic-level protein structures directly from their amino acid sequences., ensuring that they maintain the necessary alpha-helical structure (residues 37 to the C-terminus) required for membrane insertion.


Potential Pitfalls

  • Higher order Oligomerization of MS2-L - computational folding tools only look at monomeric / multimeric structures, so it is hard predicting how mutations will affect oligomerization of MSL-2.

Pipeline Schematic Input: Non mutated MS2-L sequence and a predicted alpha-helical backbone.

Analysis: ESM2 Deep Mutational Scan to check DnaJ-binding amino acid residues in the N-terminus and important lytic amino acid residues in the C-terminus.

Design: ProteinMPNN remodelling of the N-terminal domain to stop inhibitory interactions and optimise the C-terminal domain for enhanced membrane binding.

Filter: ESMFold structural validation to discard candidates that fail to maintain the predicted transmembrane helix. Output: Top candidates for experimental testing in a cell-free expression system or in vivo.

References

Fogg, V. C., Lanning, N. J., & MacKeigan, J. P. (2011). Mitochondria in cancer: At the crossroads of life and death. Chinese Journal of Cancer, 30(8), 526–539. https://doi.org/10.5732/cjc.011.10018

Rennell, D., Bouvier, S. E., Hardy, L. W., & Poteete, A. R. (1991). Systematic mutation of bacteriophage T4 lysozyme. Journal of Molecular Biology, 222(1), 67–88. https://doi.org/10.1016/0022-2836(91)90738-r

Manning, M., & Colón, W. (2004). Structural basis of protein kinetic stability: Resistance to sodium dodecyl sulfate suggests a central role for rigidity and a bias toward β-sheet structure. Biochemistry, 43(35), 11248–11254. https://doi.org/10.1021/bi0491898

Doig, A. J. (2017). Frozen, but no accident – why the 20 standard amino acids were selected. The FEBS Journal, 284(9), 1296–1305. https://doi.org/10.1111/febs.13982

Zhang, S., & Egli, M. (2022). Hiding in plain sight: Three chemically distinct α-helix types. Quarterly Reviews of Biophysics, 55, e7. https://doi.org/10.1017/S0033583522000063

Kohnke, P., & Zhang, L. (2026). Expedient synthesis of n -protected/ c -activated unnatural amino acids for direct peptide synthesis. Journal of the American Chemical Society, 148(5), 5615–5622. https://doi.org/10.1021/jacs.5c20374

Cowing, K. (2023, April 5). How were amino acids formed before the origin of life on earth? Astrobiology. https://astrobiology.com/2023/04/how-were-amino-acids-formed-before-the-origin-of-life-on-earth.html

Chamakura, K. R., Edwards, G. B., & Young, R. (2017). Mutational analysis of the MS2 lysis protein L. Microbiology, 163(7), 961–969. https://doi.org/10.1099/mic.0.000485

Chamakura, K. R., Tran, J. S., & Young, R. (2017). MS2 lysis of Escherichia coli depends on host chaperone DnaJ. Journal of Bacteriology, 199(12), Article e00058-17. https://doi.org/10.1128/JB.00058-17

Mezhyrova, J., Martin, J., Börnsen, C., Dötsch, V., Frangakis, A. S., Morgner, N., & Bernhard, F. (2023). In vitro characterization of the phage lysis protein MS2-L. Microbiology Research Resources, 2(4), Article 28. https://doi.org/10.20517/mrr.2023.28

Strathdee, S. A., Hatfull, G. F., Mutalik, V. K., & Schooley, R. T. (2023). Phage therapy: From biological mechanisms to future directions. Cell, 186(1), 17–31. https://doi.org/10.1016/j.cell.2022.11.017

Gemini AI, AI prompt, How to use KNN neighbours to find neighbourhoods, and debugging code