Week 4: Protein Design Part I

Part A. Conceptual Questions

Q1: How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)

1 Da equals 1.66053906892(52)×10−27 kg, so 1 aa is 1.66053906892(52)×10−25 kg. The average protein fraction in meat is ~20%. Therefore, the total amount of protein is 100g. The number of aa in 100g = 0.1kg of protein is ~6×10²³ or Avogadro number, 1 mole.

Q2: Why do humans eat beef but do not become a cow, eat fish but do not become fish?

To become another organism, we need to use its DNA to produce proteins within us. However, DNA, RNA, and proteins, we consume are broken down and absorbed, and so only their building blocks and not the information their sequences carry is used in our body.

Q3: Why are there only 20 natural amino acids?

The set of 20 amino acids has been established through evolution, and it would be impossible to change already fixed codons because a change would corrupt thousands of proteins.

Q4: Can you make other non-natural amino acids? Design some new amino acids.

Yes. Some design strategies can include adding an azide to a sidechain for the aa to react in click chemistry reactions or an isotope.

Q5: Where did amino acids come from before enzymes that make them, and before life started?

The production of amino acids does not require enzymes. Enzyme-independent production of amino acids was possible in the early Earth atmosphere, in hydrothermal vents, and through the Strecher Synthesis. Additionally, amino acids reach Earth with meteorite material.

Q6: If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?

The resulting helix will be left-handed.

Q7: Can you discover additional helices in proteins?

Yes. Several other types (4+) of helices have been identified beyond α-helices.

Q8: Why are most molecular helices right-handed?

Most polymer molecules are right-handed because they are comprised from left-handed monomers. In helices, left-handed monomers would occupy more space near the backbone, which is energetically inefficient.

Q9: Why do β-sheets tend to aggregate? What is the driving force for β-sheet aggregation?

They aggregate due to exposed NH group being hydrogen donors and C=O groups being acceptors on the edge of a sheet, so edges form hydrogen bonds (the driving force); due to hydrophobic sides associating with each other; due to steric compatibility of identical sheets; due to matching electrostatic periodicity in sheets; due to low configurational entropy cost of aggregations.

Q10: Why do many amyloid diseases form β-sheets? Can you use amyloid β-sheets as materials?

Yes, β-sheets are used as materials (Kevlar, peptide hydrogels as tissue engineering scaffolds).

Due to stress, mutations, and aging, proteins partially unfold. β-sheets with an alternation of hydrophobic and hydrophilic sites ease aggregation and then the growth of stable aggregates. Sequences with alternating hydrophobic and hydrophilic residues get exposed and form β-sheet conformation due to thermodynamic stability. These sheets bind and stack due to hydrogen bonds and hydrophobic interactions, and the oligomers then grow to exceptionally stable fibers.

Q11: Design a β-sheet motif that forms a well-ordered structure.

TBA

Part B: Protein Analysis and Visualization

1. Protein choice

I chose alpha-synuclein (SNCA) Uniprot P37840 because its toxic form accumulates in Parkinson’s disease, and the mechanisms of alpha-synuclein aggregation are still investigated.

The protein is comprised of 3 domains: an amphipathic N-terminal domain (1-60), a hydrophobic NAC domain (61-95), and an acidic C-terminus (96-140). Three isoforms produced by alternative splicing are: isoform 1 of 140 aa canonical, isoform 2 of 112 aa, and isoform 3 of 126 aa. The protein is normally a monomer, with the highest concentration in the brain and concentrated in the presynaptic terminals of nerve cells. The ’non A-beta component of Alzheimer disease amyloid plaque’ domain (NAC domain) is hydrophobic and is involved in fibril formation. The C-terminus may regulate aggregation and determine the diameter of the filaments.

2. Protein Sequence and Its Analysis

Alpha-synuclein — Sequence
UniProt ID: P37840 | PDB: 1XQ8
MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKEGVVHGVATVAEKTKEQVTNVGGAVVTGVTAVAQKTVEGAGSIAAATGFVKKDQLGKNEEGAPQEGILEDMPVDPDNEAYEMPSEEGYQDYEPEA

• SNCA is 140 aa long. The most common aa are A (Alanine) and V (Valine).

Amino Acid Frequencies:

V: 19 (13.57%)

A: 19 (13.57%)

G: 18 (12.86%)

E: 18 (12.86%)

K: 15 (10.71%)

T: 10 (7.14%)

D: 6 (4.29%)

Q: 6 (4.29%)

P: 5 (3.57%)

M: 4 (2.86%)

L: 4 (2.86%)

S: 4 (2.86%)

Y: 4 (2.86%)

N: 3 (2.14%)

F: 2 (1.43%)

I: 2 (1.43%)

H: 1 (0.71%)

• 250 homologs were identified with the Uniprot’s BLAST tool

• Alpha-synuclein is a member of the synuclein family, which also includes beta- and gamma-synuclein.

3. Protein Structure

Protein Data Bank PDB Structure Page: micelle-bound human alpha-synuclein, 1XQ8 | pdb_00001xq8 and disease-relevant fiber structure, 6A6B | pdb_00006a6b

The structure of the monomer was solved with NMR and published in 2005. The structure of the fiber was solved with EM, with a resolution of 3.07 Å and published in 2018. No other molecules other than SNCA are present in the structures.

The protein belongs to the Synuclein structure classification family (Structural Classification of Proteins).

4. Visualization

Cartoon visualization of an aggregate (6A6B) and an SNCA monomer (1XQ8) structures Cartoon visualization of an aggregate (6A6B) and an SNCA monomer (1XQ8) structures PyMol Cartoon visualization of an SCNA aggregate (6A6B) EM structure and of an SNCA monomer (1XQ8) NMR structure. Colours by secondary structure.

Cartoon visualization of an aggregate (6A6B) and an SNCA monomer (1XQ8) structures, domains Cartoon visualization of an aggregate (6A6B) and an SNCA monomer (1XQ8) structures, domains PyMol Cartoon visualization of the same structures. Blue (37–60) - N-terminal region, Orange (61–95) - NAC region, Red (96–99)- C-terminal region.

Part C. Using ML-Based Protein Design Tools

I chose human mitochondrial protease ClpP, the protease component of the ClpXP complex that cleaves peptides and various proteins in an ATP-dependent process. ClpP binds the NAC domain of SCNA, and this binding inhibits the protease.

ClpP Protease — Sequence
UniProt ID: Q16740 | PDB: 1TG6
MWPGILVGGARVASCRYPALGPRLAAHFPAQRPPQRTLQNGLALQRCLHATATRALPLIPIVVEQTGRGERAYDIYSRLLRERIVCVMGPIDDSVASLVIAQLLFLQSESNKKPIHMYINSPGGVVTAGLAIYDTMQYILNPICTWCVGQAASMGSLLLAAGTPGMRHSLPNSRIMIHQPSGGARGQATDIAIQAEEIMKLKKQLYNIYAKHTKQSLQVIESAMERDRYMSPMEAQEFGILDKVLVHPPQDGEDEPTLVQKEPVEAAPAAEPVPAST

C1. Protein Language Modeling

1. Deep Mutational Scan

Esm2_t6_8M_UR50D (the shallowest one and using fewer parameters) model was used to generate an unsupervised deep mutational scan of ClpP protein based on language model likelihoods.

clpp_Mutation_Scan clpp_Mutation_Scan

Most mutations are neutral (green/as likely as wildtype). There are several blue regions (121-122, 151,153, and 176) in which all mutations score highly unlikely, so these are structurally and functionally important, not likely to mutate, and these are probably part of active centres of the protease.

The mutation scan mapped the active site neighbourhood. The active sites mentioned in UniPort are TYR 153 and GLU 178. The sites with almost all aa unlikely are: HIS 122, TYR 153, PRO 176, ILE 121, ASN 151.

2. Latent Space Analysis

The provided sequence dataset was used to embed proteins in a reduced dimensionality space. To place ClpP into the resulting latent space map, its ESM2 embedding was first generated. Then, this new embedding was combined with the existing dataset’s embeddings, and the t-SNE algorithm on this combined set was re-run.

latent_space_with_clpp latent_space_with_clpp Esm2_t6_8M_UR50D latent space with embedded human ClpP Q16740 (black).

latent_space_with_clpp_nearest_neighbours latent_space_with_clpp_nearest_neighbours Zoom-in into the Esm2_t6_8M_UR50D latent space with embedded human ClpP Q16740 (black).

The neighbours are all ClpP homologs; therefore, the 6-layer model is correctly identifying sequence homology. Larger models like esm2_t33_650M or esm2_t36_3B are needed to capture more nuanced functional information.

C2. Protein Folding

Folding the original sequence

ClpP_ESM_fold ClpP_ESM_fold ClpP folded with ESMFold.

EM structure of the subunit visualized with Pymol EM structure of the subunit visualized with Pymol EM structure of the ClpP subunit visualized with Pymol (1TG6 | pdb_00001tg6).

EM structure of the ClpP visualized with Pymol EM structure of the ClpP visualized with Pymol EM structure of the ClpP with the subunit highlighted, visualized with Pymol (1TG6 | pdb_00001tg6).

Resilience of the structure to changes in the original sequence

To test the resilience of the predicted fold, I changed 5 positions (121, 122, 151, 153, 176) that are highly unlikely to mutate to Tryptophane.

121 Ser (S)→ Trp (W)

122 Pro (P)→ Trp (W)

151 Ala (A)→ Trp (W)

153 Ser (S)→ Trp (W)

176 Met (M)→ Trp (W)
ClpP Protease — Original Sequence
UniProt ID: Q16740 | PDB: 1TG6
MWPGILVGGARVASCRYPALGPRLAAHFPAQRPPQRTLQNGLALQRCLHATATRALPLIPIVVEQTGRGERAYDIYSRLLRERIVCVMGPIDDSVASLVIAQLLFLQSESNKKPIHMYINSPGGVVTAGLAIYDTMQYILNPICTWCVGQAASMGSLLLAAGTPGMRHSLPNSRIMIHQPSGGARGQATDIAIQAEEIMKLKKQLYNIYAKHTKQSLQVIESAMERDRYMSPMEAQEFGILDKVLVHPPQDGEDEPTLVQKEPVEAAPAAEPVPAST
ClpP Protease — Modified Sequence
MWPGILVGGARVASCRYPALGPRLAAHFPAQRPPQRTLQNGLALQRCLHATATRALPLIPIVVEQTGRGERAYDIYSRLLRERIVCVMGPIDDSVASLVIAQLLFLQSESNKKPIHMYINWWGGVVTAGLAIYDTMQYILNPICTWCVGQWAWMGSLLLAAGTPGMRHSLPNSRIWIHQPSGGARGQATDIAIQAEEIMKLKKQLYNIYAKHTKQSLQVIESAMERDRYMSPMEAQEFGILDKVLVHPPQDGEDEPTLVQKEPVEAAPAAEPVPAST

pymol_preview of the mutated locations pymol_preview of the mutated locations The changed residues (positions 121, 122, 151, 153, 176) highlighted in red within the original subunit structure visualized with PyMol.

The modified sequence was then folded with EMSFold.

Image refold after mutations Image refold after mutations A fold of the modified ClpP sequence generated with ESMFold.

Image Original fold Image Original fold A fold of the original ClpP sequence generated with ESMFold.

C3. Protein Generation

As an input, I used the ClpP subunit isolated previously from (1TG6 | pdb_00001tg6):

EM structure of the subunit visualized with Pymol EM structure of the subunit visualized with Pymol EM structure of the ClpP subunit visualized with Pymol (1TG6 | pdb_00001tg6).

A sequence generated with inverse folding with ProteinMPNN

ClpP Protease — ProteinMPNN-generated Sequence
GAAPVLAXXXXXXXXXATLEEALLAQRVVLVRGPIDAALAAKVVAQLDALEAESPTAPITLLIDSPGGDYDAGLAILDRIRAIPNPVRTWAVGQAASMGALLLASGTPGLRFSTPDARIAIHKVSGTASGSPEELAEQKAALEAKNEELADLLSEYTGQSLETIKEAMKEVNYLTPEEAKEFGLLDHVLAEPP
T=0.1, sample=0, score=0.8587, seq_recovery=0.4402

The generated sequence (with the Xs changed to As) was then folded with EMS fold to compare with the fold predicted from the original sequence and with the published ClpP structure.

subunit_iverse_fold subunit_iverse_fold A fold of the ProteinMPNN-generated ClpP sequence predicted by ESMFold. ESMFold inference for sequence with length 193. ptm: 0.861 plddt: 90.378.

PyMol clp_ESM_fold_compare PyMol clp_ESM_fold_compare The original ClpP subunit structure, Q16740.

clp_ESM_fold clp_ESM_fold A fold of the original ClpP sequence predicted with ESMFold.