Week 4 HW: protein-design-part-i

Part A. Conceptual Questions

  1. How many molecules of amino acids in 500g of meat? A quick search shows that, for example, beef contains ~20–22g of protein per 100g of meat.

Also, it would be good to mention that Raw meat contains more water, so the total weight of amino acids is lower per 100g compared to cooked meat, where water loss concentrates the nutrients (often increasing protein to 28–36g per 100g).

If: 100g = 20g of protein 500g = ? (500*20)/100= 100g of protein An average amino acid molecular weight of ~100 Da (100 g/mol): Moles of amino acids = 100g ÷ 100 g/mol = 1 mol Number of molecules = 1 mol × 6.022 × 10²³ ≈ 6 × 10²³ amino acid molecules

  1. Why do humans eat beef but do not become a cow, eat fish but do not become fish? The short answer is because of digestion. Proteases in the gut break all dietary proteins down to their constituent free amino acids, meaning the informational content (sequence) of the cow’s proteins is destroyed. Ribosomes then reassemble amino acids in the order dictated by ones own mRNA, encoding human proteins.

Interesting fact I discovered Any intact foreign protein that did slip through is attacked by ones immune system as a non-self antigen which is the basis of food allergies (fragments sometimes escape digestion).

  1. Why are there only 20 natural amino acids? One of the main theories as to why there are only 20 amino acids in the world is attributed to evolution. The “frozen accident” / historical contingency The genetic code was fixed early in evolution (~3.5–4 billion years ago) and became locked in. Changing it would be catastrophic (every codon reassignment would corrupt thousands of proteins simultaneously).

  2. Can you make other non-natural amino acids? Design some new amino acids. Yes, numerous non-natural (or non-canonical) amino acids (ncAAs) can be designed and synthesized using advanced chemical methods, such as palladium-catalyzed C-H bond functionalization. By incorporating unnatural amino acids, scientists can extend the genetic code, creating proteins with unique structural and functional properties.

  3. Where did amino acids come from before enzymes that make them, and before life started? Some of the theories of the origin of amino acids include: a) The Miller-Urey experiment (1953) b) Meteorite delivery c) Hydrothermal vents d) HCN and formaldehyde chemistry e) The “RNA world” scenario

  4. If you make an α-helix using D-amino acids, what handedness (right or left) would you expect? It would be LEFT-HANDED. Proteins built from D-amino acids form left-handed α-helices, which are mirror images of the right-handed helices formed by L-amino acids.

  5. Can you discover additional helices in proteins? Proteins can adopt several types of helical structures, each with distinct geometric and hydrogen-bonding characteristics. The α-helix is the most common helical structure found in proteins. Beyond the classic α-helix, several other helical structures exist. There is:

  • 3₁₀-helix is a tighter helix with 3.0 residues per turn. Its hydrogen bonds form between residue i and i + 3, resulting in a rise of about 2.0 Å per residue. This helix often appears at the termini of α-helices rather than as long, independent structures.

  • π-helix is a rare and wider helix, containing 4.4 residues per turn. Its hydrogen bonding occurs between residue i and i + 5, with a rise of approximately 1.1 Å per residue. π-helices are uncommon but are frequently found at functional sites in proteins.

  • Polyproline II (PPII) helix has 3.0 residues per turn and lacks intramolecular hydrogen bonds. Instead, it adopts an extended left-handed conformation with a rise of about 3.1 Å per residue. This structure plays an important role in signaling, particularly in binding interactions such as those involving SH3 domains.

  • Polyproline I (PPI) helix contains 3.3 residues per turn and also lacks intramolecular hydrogen bonds. It is right-handed and more compact, with a rise of about 1.9 Å per residue. This form is typically observed in organic solvents.

  • Collagen helix consists of approximately 3.3 residues per turn and is stabilized by interchain hydrogen bonds rather than intramolecular ones. Each chain has a rise of about 2.9 Å per residue, and three chains wind together to form a characteristic triple helix. This structure is distinguished by its repeating Gly–X–Y sequence pattern.

  1. Why are most molecular helices right-handed? Natural proteins are composed mainly of L-amino acids. The stereochemistry of the peptide backbone energetically favors right-handed α-helices.

  2. Why do β-sheets tend to aggregate? What is the driving force for β-sheet aggregation? β-sheets hydrogen bond through their edge strands, which have free NH and C=O groups pointing outward, ready to hydrogen bond with another β-sheet edge. Unlike α-helices (where all H-bond donors/acceptors are internally satisfied), β-sheet edges are “sticky.”

Part B: Protein Analysis and Visualization

One of the final individual projects I want to do is the development of a biosensor for the diagnosis of Vibrio Cholerae. To help me learn more about V. Cholerae I will use this assignment to study one of the main toxins that is associated with toxicity.

Cholera toxin (CT) is defined as the major virulence factor of the bacterium Vibrio cholerae, comprising a single A subunit and five identical B subunits in an A-B5 architecture, which causes severe diarrheal symptoms in infected individuals. Cholera toxin is a member of the AB toxin family and is composed of a catalytically active heterodimeric A-subunit linked with a homopentameric B-subunit.

  1. Briefly describe the protein you selected and why you selected it. The protein I have selected is the Cholera enterotoxin subunit A. Cholera toxin subunit A (CTA) is the catalytically active component of the AB₅ cholera enterotoxin secreted by Vibrio cholerae, and it is the central molecular component responsible for the devastating secretory diarrhea that defines cholera disease.

  2. Identify the amino acid sequence of your protein. (For this part I search the Cholera toxin on UniProt, after which i selected the Cholera enterotoxin subunit A)

The amino acid sequence MVKIIFVFFIFLSSFSYANDDKLYRADSRPPDEIKQSGGLMPRGQSEYFDRGTQMNINLYDHARGTQTGFVRHDDGYVSTSISLRSAHLVGQTILSGHSTYYIYVIATAPNMFNVNDVLGAYSPHPDEQEVSALGGIPYSQIYGWYRVHFGVLDEQLHRNRGYRDRYYSNLDIAPAADGYGLAGFPPEHRAWREEPWIHHAPPGCGNAPRSSMSNTCDEKTQSLGVKFLDEYQSKVKRQIFSGYQSDIDTHNRIKDEL

Sequence Lenght 258 Amino Acids

Amino Acid Frequency Using the provided colab notebook I was able to get the amino acid count.

alt text alt text The output is as follows: S: 23, G: 22, D: 19, Y: 18, R: 17, I: 16, L: 16, A: 15, P: 14, V: 13, F: 12, Q: 12, N: 11, E: 11, H: 11, T: 10, K: 8, M: 5, W: 3, C: 2 The letters represent each of the amino acids: S = Serine, G = Glycine, D = Aspartate, Y = Tyrosine, R = Arginine, I = Isoleucine, L = Leucine, A = Alanine, P = Proline, V = Valine, F = Phenylalanine, Q = Glutamine, N = Asparagine, E = Glutamate, H = Histidine, T = Threonine, K = Lysine, M = Methionine, W = Tryptophan, C = Cysteine.

The most abundant amino acid is Serine (S, n=23), closely followed by Glycine (G, n=22) and Aspartate (D, n=19) The overall composition hints at a hydrophilic, surface-exposed protein given the prevalence of polar and charged residues (S, D, R, Y, Q, N, E, H).

Protein Sequence Homologs Protein homologs are proteins that share a common evolutionary origin. They come from the same ancestral gene, even if their sequences or functions have changed over time. There are two main types:

  • Orthologs – Proteins in different species that evolved from a common ancestral gene after a speciation event. They often retain similar functions.
  • Paralogs – Proteins within the same species that arose by gene duplication. They may evolve new or specialized functions over time.

Through BlastKB there were 250 hits

A comprehensive search for cholera toxin homologs across biological databases yielded 250 results distributed across three domains. The vast majority of results (207 sequences, 83%) came from fungi, particularly Ascomycota species, with notable representations from entomopathogenic fungi in the families Ophiocordycipitaceae, Clavicipitaceae, and Cordycipitaceae, as well as the plant pathogen Colletotrichum (84 results). Bacterial homologs comprised 42 sequences (17%), dominated by Pseudomonadota, including pathogenic species such as Escherichia coli, Vibrio cholerae, and Paraburkholderia, alongside environmental bacteria like Bartonella and Leptospira. A single viral sequence (1 result) was identified as Vibrio phage CTXphi, which is notably the well-characterized prophage known to carry the cholera toxin genes themselves, validating the search methodology. The predominance of fungal and bacterial sequences suggests that toxin-like proteins with structural or functional homology to cholera toxin are widespread across microbial organisms, though the biological significance of these fungal matches warrants further investigation given that toxin production is not a typical trait of this kingdom.

alt text alt text

Protein Family Based on the results, the protein belongs to the Cholera enterotoxin subunit A family.

  1. Identify the structure page of your protein in RCSB When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)

It was solved on 2004-04-06. It has a resolution of 1.9 Å which was obtained through X-RAY DIFFRACTION. I would say it has good quality since the resolution is smaller than 2.70 Å.

alt text alt text

Are there any other molecules in the solved structure apart from protein?

There are two unique ligands present in the structure, specifically: beta-D-galactopyranose and Sodium Ion.

alt text alt text

Does your protein belong to any structure classification family?

Yes. It belongs to the ADP-ribosylating toxins. (NOT SURE ABOUT THIS ANSWER) alt text alt text

  1. Open the structure of your protein in any 3D molecule visualization software: Using PyMol I wanted to visualize my protein of interest. Download the PDB file for the protein, then on PyMol, go to file then open, select the downloaded PDB file. (Chatgpt helped with the python code for the different visualizations)
alt text alt text
  • Cartoon

I used the following commands

PyMOL>hide everything

PyMOL>show cartoon

PyMOL>color red

PyMOL>util.cbc (color by secondary structure)

alt text alt text
  • Ribbon

I used the following commands

PyMOL>hide everything

PyMOL>show ribbon

alt text alt text

Ball and Stick

I used the following command

yMOL>hide everything

PyMOL>show sticks

PyMOL>show spheres

To Improve structure

PyMOL>set sphere_scale, 0.25

PyMOL>set stick_radius, 0.15

alt text alt text

To answer the following questions I used chapgpt to explain how to identify the different structures and also to understand the residues

  • Color the protein by secondary structure. Does it have more helices or sheets? By default, PyMOL colors:

Red → α-helices

Yellow → β-sheets

Green → loops/coils

The protein has α-helices that appeared as a spiral / corkscrew shape, Long cylindrical coils and was red It also has β-Sheets that appeared as Flat arrow-shaped strands, Multiple arrows next to each other and were yellowSpiral/corkscrew It also had Loops that appeared as Thin connecting regions that were green

To tell whether I have more helices or sheets, I ran the following code PyMOL>select helices, ss h Selector: selection “helices” defined with 3854 atoms. PyMOL>select sheets, ss s Selector: selection “sheets” defined with 2744 atoms.

My protein has more Helices

  • Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues? “hydrophobic” defined with 4211 atoms. “hydrophilic” defined with 3626 atoms.

There are more hydrophobic atoms than hydrophilic atoms.

  • Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

Part C. Using ML-Based Protein Design Tools

C1. Protein Language Modeling

  1. Deep Mutational Scans

Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.

alt text alt text

Can you explain any particular pattern? (choose a residue and a mutation that stands out) The protein has several highly conserved positions (deep blue vertical stripes).

C2. Protein Folding

C3. Protein Generation

Part D. Group Brainstorm on Bacteriophage Engineering