Homework 4 – Protein Design I

Part A. Conceptual Questions

1. How many molecules of amino acids do you take with a piece of 500 grams of meat?

(on average an amino acid is ~100 Daltons)

From 500 g of meat, you ingest approximately 1 mole of protein, which equals 6 × 10²³ molecules according to Avogadro’s constant.

2. Why do humans eat beef but do not become cows, eat fish but do not become fish?

Humans do not become cows or fish because food is broken down during digestion into biomolecules such as carbohydrates, fats, proteins, and simple sugars. These molecules are then reassembled into human tissues according to our own genetic code.

3. Why are there only 20 natural amino acids?

The genetic code naturally fits 20 amino acids while providing enough functional diversity to encode proteins. Proteins are encoded by triplet codons composed of four nucleotides:

4³ = 64 codons

61 codons encode amino acids and 3 codons function as stop signals.

4. Can you make other non-natural amino acids? Design some new amino acids.

Yes, synthetic amino acids can be designed by maintaining the standard backbone:

NH₂–CH–COOH

while modifying the R side chain to provide new chemical properties.

Examples:

Fluorinated amino acids
- CF₃ side chain
- Highly hydrophobic and chemically stable
- Could increase protein resistance to degradation
Long hydrophobic amino acids
- Side chain: (CH₂)₁₂CH₃
- Promote membrane interactions
- Useful for membrane-active peptides
Redox-active amino acids
- Quinone-containing side chain
- Capable of electron transfer
- Useful in artificial enzymes or bioelectronic systems

5. Where did amino acids come from before enzymes that make them and before life started?

Before enzymes and life existed, amino acids likely formed through prebiotic chemical reactions on early Earth and in space.

Simple molecules such as:

methane
ammonia
carbon dioxide
water
hydrogen

could react under energy sources such as:

lightning
volcanic heat
UV radiation

This concept was demonstrated in the Miller–Urey experiment (1953). Additionally, amino acids have been detected in meteorites, suggesting that space chemistry may have contributed to the origin of biomolecules used by early life.

6. If you make an α-helix using D-amino acids, what handedness would you expect?

An α-helix composed entirely of D-amino acids would form a left-handed helix, which is the mirror image of the right-handed helices formed by L-amino acids in natural proteins.

7. Can you discover additional helices in proteins?

Proteins contain several types of helices besides the α-helix, including:

3₁₀ helices
polyproline helices
collagen triple helices
other engineered or synthetic helical structures

8. Why are most molecular helices right-handed?

Most natural helices are right-handed because proteins are composed primarily of L-amino acids. The stereochemistry of L-amino acids makes the right-handed helix sterically and energetically favourable.

9. Why do β-sheets tend to aggregate?

β-sheets aggregate because their backbone hydrogen bonds and side-chain interactions favour tight stacking, which can lead to structures such as amyloid fibrils.

Driving forces of β-sheet aggregation:

intermolecular hydrogen bonding
van der Waals interactions
hydrophobic interactions

These forces stabilise the aggregated state.

Part B — Protein Analysis and Visualisation

1. Protein description

Germin and germin-like proteins are ubiquitous plant proteins expressed during various biotic and abiotic stresses.

The OsRGLP1 protein has confirmed superoxide dismutase activity, although additional activities such as oxalate oxidase remain under investigation.

2. Amino acid sequence

MASSSFLLLATLLAMASWQGMASDPSPLQDFCVADMHSPVLVNGFACLNPKDVNADHFFKAAMLDTPRKTNKVGSNVTLINVMQIPGLNTLGISIARIDYAPLGQNPPHTHPRATEILTVLEGTLYVGFVTSNPDNKFFSKVLNKGDVFVFPVGLIHFQFNPNPYKPAVAIAALSSQNPGAITIANAVFGSKPPISDDVLAKAFQVEKGTIDWLQAQFWENNHY

3. Sequence analysis

Sequence length: 224 amino acids

Most frequent amino acid: Alanine (A) — 23 occurrences

Amino acid frequencies

A: 23 L: 21 P: 18 V: 18 N: 17 S: 15 F: 15 G: 13 T: 12 D: 12 I: 12 K: 11 Q: 9 M: 6 H: 6 Y: 4 E: 4 W: 3 R: 3 C: 2

4.Protein sequence homologues

BLAST search parameters:

Program: blastp
Matrix: BLOSUM62
Alignments: 250
Scores: 250

OsRGLP1 belongs to the germin-like protein (GLP) family. Germins share 30–70% sequence similarity with wheat germins. Protein sequence homologues

Reference:

Zhang, Y. et al. (2018). Overexpression of germin-like protein GmGLP10 enhances resistance to Sclerotinia sclerotiorum in transgenic tobacco. Biochemical and Biophysical Research Communications.

5. Structure information (RCSB)

Structure determination

Identify the structure page of your protein in RCSB
• When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)
https://www.rcsb.org/3d-view/2ETE • CRYSTAL STRUCTURE OF GERMIN (OXALATE OXIDASE) Woo, E.J., Dunwell, J.M., Goodenough, P.W., Marvier, A.C., Pickersgill, R.W.2000) Nat Struct Biol 7: 10 https://www.rcsb.org/3d-view/2ETE

• Does your protein belong to any structure classification family Germin and GLPs are described as archetypal members of the cupin superfamily (Dunwell 1998). The cupin superfamily of proteins is named based on this conserved B barrel fold (Cupa; Latin word for barrel), discovered using a conserved motif found in germins.

Open the structure of your protein in any 3D molecule visualisation software Used Alphafold to generate the secondary structure (Helices and Sheets), the protein has more β-sheets than α- helixs. Forming a central β-sheets core surrounded by helices and loops. Hydrophobic residues are buried in the core, and hydrophilic on the surface

Part C. Using ML-Based Protein Design Tools In this section, we will learn about the capabilities of modern protein AI models and test some of them in your chosen protein.

Copy the HTGAA_ProteinDesign2026.ipynb notebook and set up a colab instance with GPU. i) Deep mutational Scan

ii) Latent Space Analysis Scan

iii) Inverse fold modelling

iv. The heat map was generated in C1-protein protein language modelling, specifically from a deep mutational scan using the ESM-2 protein language model, which predicts the probability of every amino acid mutation at each position in the protein sequence