Week 4 HW: Protein design part 1

Answer conceptual questions
Learn basic concept of protein design
Brainstorm how to apply these together in the group project

Part A - Conceptual questions

Amino Acid & Protein Structure Q&A

How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)

To find the number of molecules, we first determine the mass of protein and then convert that to moles and molecules.

Protein Mass: Lean meat is roughly 20% protein by weight.
$500\text{ g} \times 0.20 = 100\text{ g of protein}$.
Moles of Amino Acids: Using the average molecular weight (100 Daltons = 100 g/mol):
$100\text{ g} / 100\text{ g/mol} = 1\text{ mole of amino acids}$.
Molecules: Using Avogadro’s number ($6.022 \times 10^{23}$):
You consume approximately $6.022 \times 10^{23}$ amino acid molecules.

Why are there only 20 natural amino acids?

The “Standard 20” represents a biological “frozen accident” that reached a functional optimum early in evolution.

Chemical Versatility: These 20 provide a sufficient range of acidity, basicity, hydrophobicity, and polarity to fold into almost any required 3D shape.
Genetic Code Constraints: Our triplet codon system ($4^3 = 64$ combinations) must balance diversity with error tolerance.
Metabolic Cost: Adding more amino acids requires more complex metabolic pathways and specialized tRNA synthetases.
Note: Some organisms do use “extra” ones like Selenocysteine or Pyrrolysine.

Can you make other non-natural amino acids? Design some new amino acids.

Yes, “Non-Canonical Amino Acids” (ncAAs) are frequently synthesized.

The “Photo-Switch”: An amino acid with an azobenzene side chain that changes shape (cis/trans) when hit by light.
The “Click-Linker”: Incorporating an azide or alkyne group into the side chain to allow “Click Chemistry” reactions.
The “Boron-Carrier”: Adding a boronic acid group for creating glucose-sensing proteins.

Where did amino acids come from before enzymes that make them, and before life started?

Prebiotic chemistry provided several pathways:

Strecker Synthesis: Spontaneous formation in a “primordial soup” containing ammonia, hydrogen cyanide, and aldehydes.
The Miller-Urey Experiment: Demonstrated that sparking a mixture of $CH_4$, $NH_3$, $H_2$, and $H_2O$ creates various amino acids.
Hydrothermal Vents: High pressure and temperature gradients at the ocean floor catalyze organic formation.
Exogenesis: Amino acids found on meteorites suggest they formed in space via UV radiation.

If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?

Natural L-amino acids form right-handed α-helices. Because D-amino acids are the mirror image (enantiomer) of L-amino acids, a polymer made entirely of D-amino acids would form a left-handed α-helix.

Can you discover additional helices in proteins?

Yes, they are classified by their hydrogen-bonding patterns:

$3_{10}$ helix: A tighter, more elongated helix often found at the ends of $\alpha$-helices.
$\pi$-helix: A wider, shorter helix that is relatively rare and often associated with functional sites.
Polyproline II helix: A left-handed helix that doesn’t rely on internal hydrogen bonds; common in collagen.

Why are most molecular helices right-handed?

This is due to homochirality. In biology, almost all amino acids are L-isomers. When L-amino acids link together, the steric hindrance (physical bumping) between the side chains and the backbone is minimized in a right-handed geometry. A left-handed helix made of L-amino acids would cause the side chains to clash significantly.

Why do β-sheets tend to aggregate? What is the driving force for β-sheet aggregation?

$\beta$-sheets are “sticky” at their edges.

The Driving Force: The primary driver is Hydrogen Bonding and the Hydrophobic Effect.
Unlike $\alpha$-helices, the peptide backbone at the edge of a $\beta$-strand has “exposed” donors and acceptors. If a strand doesn’t find a partner within its own protein, it will bind to a neighboring protein.

Why do many amyloid diseases form β-sheets? Can you use amyloid β-sheets as materials?

Amyloid Formation: When proteins misfold, they expose hydrophobic backbones that snap into stable, insoluble cross-$\beta$ structures. These are incredibly resistant to degradation.

Use as Materials: They are being researched for:

Nanowires: Conductive amyloids for bio-electronics.
Drug Delivery: Hydrogels for slow-release medication.
Adhesives: Mimicking super-strong underwater glues used by bacteria.

Part B - Protein analysis and visualization

Briefly describe the protein you selected and why you selected it.
- I am analyzing and visualizing a thermoacidophillic extremozyme. Glucoamylases from Sulfolobus solfataricus, which are very heat seeking, and acidity seeking. Specifically $\beta$-glycosidase (SSO1353). source, this paper: Advances in Extremophile Research: Biotechnological Applications through Isolation and Identification Techniques | MDPI
Identify the amino acid sequence of your protein
- MGRFAIYEAPQNCPYLGTIGACYEFGSLPVILMFPELEKSFLKLLIRHIREDGYVPHDLG YHSLDSPIDGTTSPPRWKDMNPSLILLVYRYFKFTNDIEFLKEVYPILVKVMDWELRQCK GNLPFMEGEMDNAFDATIIKGHDSYTSSLFIGSLIAMREIAKLVGDSNYVDFISEKLSSA REAFRRMFNGRYFKAWDSVDNASFLAQLYGEWFTTLVGLEDIVEEDIIKKALESIIRLNG NASPHCVPNLVDDNGKIVGLSVQTYSSWPRMVFAICWLAYKKGVGDLSFCKKEWDNLVKN GMVWDQPSRINGYNGKAEMNYLDHYIGSPSPWSFLF. Source: DNASU Plasmid | SSO1353 (S. solfataricus )
  In pET21_NESG (His-tagged bacterial expression vector)
  - Uniprot link - Glycosyl-hydrolase family 116 catalytic region domain-containing protein - Saccharolobus solfataricus (strain ATCC 35092 / DSM 1617 / JCM 11322 / P2) (Sulfolobus solfataricus) | UniProtKB | UniProt
1. How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids.
  - Total sequence length is 336, with L being the most frequent
2. How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.
  - Around 250 Homologs are there -
3. Does the protein belong to a family?
  - Yes - GH116 Glycosyl-hydrolase family 116 catalytic region domain-containing protein
Identify the structures page of your protein in RCSB - RCSB PDB - 1GOW: BETA-GLYCOSIDASE FROM SULFOLOBUS SOLFATARICUS
1. When the structure was resolved
  1. Released: 1997-08-20 at resolution of 2.60 Å
2. Are there any other molecules in the solved structure apart from protein
  1. No
3. Does your protein belong to any structure classification family?
  1. Glycoside Hydrolase Family
PYMOL section
1. Download PDBX/mmCIF format of protein from RCSB
2. in import section, select the above file
3. Protein is visualized and can be seen in the main console. On the right side panel, there will be options to alter visualization “ASHLC” Which is what needs to be changed
- show cartoon, ribbon and sticks/stones view
- Color by secondary structure
- Color by residue type
- show holes

Pymol screenshots

Pymol protein visualizations

cartoon view of molecule Select, S, then cartoon view

Ribbon view, select S then show as then ribbon view, on 1GOW, not on all

Show as spheres first, the in the previous drop down click on sticks as well

Color the protein by their secondary structure

Colored by ss structure. Seems like it has more helixes than sheets, but more loops than both. Cyan is helix, pink is sheets, orange is loops

Color by Residue, hydrophobicity and hydrophilicity

Import the python script and run it from files-run_script.
In the command Palette below, run the file name, in this case color_h

Source:Mapping properties onto a structure: Electrostatic potential, conservation, hydrophobicity/polarity

The redder the part is, more hydrophobic it is, white is not necessarily hydrophilia. In the second image below, green are polar molecules, and white are non-polar molecules, so could indicate hydrophilia.

Check if any “holes” there

Doesn’t seem like any outright holes are there, but a lot of surfaces and gaps are there though for binding pockets

Part C - Using ML based protein design tools

Copy of google colab - https://colab.research.google.com/drive/1Hn82J2OK4n2e_SrKc0UW3Pw9U4y6dzv6?usp=sharing

C1 Protein Language Modelling

Deep Mutational scan

Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.
Can you explain any particular pattern? (choose a residue and a mutation that stands out)

Used ESM2 to run a mutational scan on above protein. Input the amino acid sequence into the code and just execute it with relative mode. The model that was run was esm2t68MUR50D

Interesting pattern, there are 5-6 major straits where mutation can cause detrimental effects, but just like that, right before those sites, there are points where mutation can cause beneficial effects. This is if the target amino acid is replaced by any other amino acid.

According to the graph, the most dangerous amino acid to replace with seems to be P, because there is a horizontal line indicating P replacement causing a detrimental effect.
Replacing 162nd amino Acid with T is a very beneficial mutation at one site.

There are a few patterns along the latent space graph, but most proteins seem to be similar to each other. Clusters are often 10-15 proteins large are in the outskirts. Most proteins are also having TSNE heatmap along the vertical axis, so lower down the proteins are more likely they are along one other dimension also.

Added our protein the beta-glycosidase to the existing list of proteins. Redone Our target is part of a small cluster at the edge of the graph, as shown by the red.

It is grouped with 2 other yeast and bacteria based glucoamylase
Nisin Biosynthesis protein
arylamine N-acetyletransferase
epimerase
mannosidase
putative NAG-isomerase
endo b-1,4,glucanase

The below code was used to add it to the rest of the sequences, and the to create the graph.

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

# Create a new sequence (can be any protein or nucleic acid sequence)
new_seq_data = "MGRFAIYEAPQNCPYLGTIGACYEFGSLPVILMFPELEKSFLKLLIRHIREDGYVPHDLGYHSLDSPIDGTTSPPRWKDMNPSLILLVYRYFKFTNDIEFLKEVYPILVKVMDWELRQCKGNLPFMEGEMDNAFDATIIKGHDSYTSSLFIGSLIAMREIAKLVGDSNYVDFISEKLSSAREAFRRMFNGRYFKAWDSVDNASFLAQLYGEWFTTLVGLEDIVEEDIIKKALESIIRLNGNASPHCVPNLVDDNGKIVGLSVQTYSSWPRMVFAICWLAYKKGVGDLSFCKKEWDNLVKNGMVWDQPSRINGYNGKAEMNYLDHYIGSPSPWSFLF"

new_seq = Seq(new_seq_data)
# Create a new SeqRecord object
new_record = SeqRecord(
    new_seq,
    id="Somename_",
    name="Somename_",
    description="d1dlwa_ a.1.1.1 (A:) sulfolobuuls sulfactarius {bacteria (extremophile) [TaxId: 2287]}"
)
 
# Add the new SeqRecord object to the sequences list
sequences.append(new_record)
print("New SeqRecord added to the sequences list.")
print("First three entries of the updated sequences list:")
print(sequences[0:3])





from sklearn.manifold import TSNE
import plotly.express as px
import numpy as np
import pandas as pd

# Convert the list of embeddings to a numpy array if not already done
embeddings_array = np.array(embeddings)
protein_sequence_annotations = [str(record.description) for record in sequences]
print(f"Shape of embeddings array before 3D t-SNE: {embeddings_array.shape}")

# Apply t-SNE for 3D dimensionality reduction
tsne_3d = TSNE(n_components=3, perplexity=30, n_iter=300, random_state=42)
embeddings_3d = tsne_3d.fit_transform(embeddings_array)
print(f"Shape of embeddings array after 3D t-SNE: {embeddings_3d.shape}")

# Create a DataFrame for Plotly
tsne_df_3d = pd.DataFrame(embeddings_3d, columns=['TSNE1', 'TSNE2', 'TSNE3'])
# Create a category column to highlight the last added sequence
tsne_df_3d['category'] = 'Other sequences'

# Assuming the last element in `embeddings` (and thus `embeddings_3d`) corresponds to the last added sequence
tsne_df_3d.loc[len(tsne_df_3d) - 1, 'category'] = 'Last added sequence'  

# Define custom colors
color_map = {'Other sequences': 'blue', 'Last added sequence': 'red'}
# Visualize with Plotly 3D scatter plot, coloring by the new category
fig_3d = px.scatter_3d(
    tsne_df_3d,
    x='TSNE1',
    y='TSNE2',
    z='TSNE3',
    color='category', # Color points based on the new category
    color_discrete_map=color_map, # Apply custom colors
    title='3D t-SNE Visualization of Protein Sequence Embeddings (Last Added Highlighted)',
    hover_name=protein_sequence_annotations[:len(embeddings_array)]
)
fig_3d.update_layout(
    height=800 # Increase the height of the plot
)
fig_3d.show()

C2 Protein Folding

Fold your protein with ESMFold. Do the predicted coordinates match your original structure?
Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?

ESM fold generated protein looks very different from the pyMOL proteins.

Changed the 162nd amino acid to T since according to previous analysis, it was supposed to be very beneficial

C2 Protein Generation

Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one.
Input this sequence into ESMFold and compare the predicted structure to your original.

Generated sequence:

T=0.1, sample=0, score=0.7474, seq_recovery=0.5031 MIEFPKDFKFGFSTHPYSSYMGTPGSENPNTDWYAWHRDPENIAAGLVSGDNIDNGPGWWGNYQKYIDAAKELGATIIRTSVDWSAIFPKPLPPDPDFDPASKRVDSVEINPERIAELKKYANQAAIAHYKKILTAIKDAGFHIILNIWAGPLPLWLFDALAVRKGDFSTPIGWLDPRTVVEFALYAGFIVSEFKDLIDEIEVMNEPNDYSYFGYIETEKGYPPGYKSEELSEIALRNLIQAIALARDAIKKVSDLPVGISINTRGFKPATPDDKEAVEKARYDNVDKFLDAIINGVYSEGGEEKTDPSLKGSLDFIGVNYYTYDVVKKEGGGYEILDGYGWRCKKNSVDANGNPTDEAGYEFYPEGLYDVLKYLYDKYKLPMYVTEFGIADKDGTMQPYYIVSNIAQVAKAIKDGVDIKGALYHHLADEYHWSKGWALQYGLLSVDFATKELSRKPAADVFKKIATNNAIPDEIAHLIAKPDVSPLPK

The sequence that is generate is completely different from the original, it is longer and has completely different sequences. amino acid probabilities seem completely random

Regenerated inverse folded protein:

This looks drastically different, with one extra helix, and different placement for loops as well as sheets

Pard D - Group Brainstorm on bacteriophage engineering

Form team of 3-4 people
Read through phage reading resources
Review bacteriophage final project goals for engineering the L-protein
Do brainstorm session
Include plan to engineer phage on the website

Group brainstorm result:

Use esmfold to see in what ways the protein can mutate.
Use analytical methods to find target protein regions.
Using genomic language models for creating more lytic proteins and their target regions.
Check folding again using alphafold