Week 4: Protein Design Part I

This week focuses on how sequence, structure, and energetics can be modeled and manipulated to create or optimize proteins with specified functions.

Objective:

  1. Learn basic concepts:
    • amino acid structure
    • 3D protein visualization
    • the variety of ML-based design tools
  2. Brainstorm as a group how to apply these tools to engineer a better bacteriophage (setting the stage for the final project).

Part A. Conceptual Questions

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

Answer any NINE of the following questions from Shuguang Zhang: (i.e. you can select two to skip)

  1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)

For 500 grams of meat, there is roughly 20-25% grams of protein. This means that roughly 100 grams belong to protein, while there is remaining fat, fiber, and water that make up the rest of the mass. Because 1 mole = 100 Da Number of moles = 100 g of protein / 100 Da = 100g/ 100 g / mol $$\text{Molecules} = 1 \text{ mol} \times 6.022 \times 10^{23} \text{ molecules/mol}$$$$\text{Molecules} \approx 6.022 \times 10^{23}$$ There are roughly 602 sextillion amino acids.

  1. Why do humans eat beef but do not become a cow, eat fish but do not become fish?

When humans eat beef, through mastication and digestion we break down the beef into smaller units. First protein is broken down by enzymes (proteases) and into shorter chains of amino acids in the stomach. Then the chains become further broken down into individual amino acids in the small intestine. As these amino acids enter the bloodstream, they require DNA to instruct them into building other things. The human DNA does different things than cows and fish, therefore the amino acids will build a cow or a fish.

  1. Why are there only 20 natural amino acids?

It may be an evolutionary mystery that almost all living things are built from these 20 natural amino acids. The 20 amino acids serve as the building blocks of most proteins, they line up as codons in 3-letter assemblies, in which the ribosomes read to create actions following the DNA sequence. When they read 3 bases at once, the combinations create 4^3 possibilities that are expansive enough for the making of diverse lifeforms.

  1. Can you make other non-natural amino acids? Design some new amino acids.

Yes, there are a lot of non-natural amino acids. Designing new amino acids require us to follow the same chassis but redesign the ‘r-group’ to alter the chemistry of the bond, which is the side chain of the amino acid. One may attach an azide to the chain to create a strong bond for stickiness or bio-glue. For experiments, some researchers also use non-natural florescent amino acids like Acridonylalanine to glow under microscopy or photographs.

  1. Where did amino acids come from before enzymes that make them, and before life started?

This might be related to assembly theory? Lee Cronin proposed that life is composed of different assemblies, in that life is scaffolded by energy, raw sources, and minerals through complex interactions and then becomes amino acids, and longer chains. Gases and energy together can create amino acids. The Miller-Urey Experiment use water, methane, ammonia, and hydrogen to create amino acids.

  1. If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?

Left-handed. D-amino acids create a mirror image of α-helixes, because the building blocks and the structure are completely mirrored.

  1. Can you discover additional helices in proteins?

Yes, since 2020, AlphaFold has allowed us to quickly discover new helices and the instructions to their fold, revealed millions of protein structures.

  1. Why are most molecular helices right-handed?

Because of chirality, most helices are non-identical to their mirror image. As most amino acids are L-form (left-handed), the way they most efficiently stack together is twisting to the right where they can create stable bonds with enough room between each other.

  1. Why do β-sheets tend to aggregate?

β-sheets bond together via hydrogen bonds. The geometry appears like pleated, zigzag, sheet-like structure with side chains protruding.

  • What is the driving force for β-sheet aggregation?

They tend to aggregate because of its geometry, where the hydrophobic faces might sandwich and stick together to hide from water. The force from the water becomes driving force for clumping.

  1. Why do many amyloid diseases form β-sheets?
    • Can you use amyloid β-sheets as materials?
  2. Design a β-sheet motif that forms a well-ordered structure.

Part B: Protein Analysis and Visualization

In this part of the homework, you will be using online resources and 3D visualization software to answer questions about proteins. Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions:

  1. Briefly describe the protein you selected and why you selected it.

alt text alt text I chose GPR3 Orphan G-coupled Protein Receptor in complex with Dominant Negative Gs (8U8F) because I’m interested in GPR3 is a class A orphan G protein-coupled receptor (GPCR) exhibiting broad expression across various brain regions including the hypothalamus, hippocampus, and cortex, as well as in peripheral tissues such as liver and ovary.It has a potential role in modulating a number of brain functions, including behavioral responses to stress, amyloid-beta peptide generation in neurons and neurite outgrowth. For brains-on-chips research I’m interested in different types of expressions in the central nervous system and the brain.

  1. Identify the amino acid sequence of your protein.
    • How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids.

    There are four protein chains. Chain A: 372, Chain B: 339, Chain C: 58, Chain D: 384. The most frequent amino acid seems to be leucine. It is a sturdy, hydrophobic (water-hating) amino acid.

    >8U8F_4|Chain D[auth R]|G-protein coupled receptor 3|Homo sapiens (9606)NSTMKTIIALSYIFCLVFADYKDDDDLEVLFQGPAMWGAGSPLAWLSAGSGNVNVSSVGPAEGPTGPAAPLPSPKAWDVVLCISGTLVSCENALVVAIIVGTPAFRAPMFLLVGSLAVADLLAGLGLVLHFAAVFCIGSAEMSLVLVGVLAMAFTASIGSLLAITVDRYLSLYNALTYYSETTVTRTYVMLALVWGGALGLGLLPVLAWNCLDGLTTCGVVYPLSKNHLVVLAIAFFMVFGIMLQLYAQICRIVCRHAQQIALQRHLLPASHYVATRKGIATLAVVLGAFAACWLPFTVYCLLGDAHSPPLYTYLTLLPATYNSMINPIIYAFRNQDVQKVLWAVCCCCSSSKIPFRSRSPSDVPAGLEVLFQGPHHHHHHHHAAAFESR
    >8U8F_3|Chain C[auth G]|Guanine nucleotide-binding protein G(I)/G(S)/G(O) subunit gamma-2|Homo sapiens (9606)
    NTASIAQARKLVEQLKMEANIDRIKVSKAAADLMAYCEAHAKEDPLLTPVPASENPFR
    >8U8F_2|Chain B|Guanine nucleotide-binding protein G(I)/G(S)/G(T) subunit beta-1|Homo sapiens (9606)
    QSELDQLRQEAEQLKNQIRDARKACADATLSQITNNIDPVGRIQMRTRRTLRGHLAKIYAMHWGTDSRLLVSASQDGKLIIWDSYTTNKVHAIPLRSSWVMTCAYAPSGNYVACGGLDNICSIYNLKTREGNVRVSRELAGHTGYLSCCRFLDDNQIVTSSGDTTCALWDIETGQQTTTFTGHTGDVMSLSLAPDTRLFVSGACDASAKLWDVREGMCRQTFTGHESDINAICFFPNGNAFATGSDDATCRLFDLRADQELMTYSHDNIICGITSVSFSKSGRLLLAGYDDFNCNVWDALKADRAGVLAGHDNRVSCLGVTDDGMAVATGSWDSFLKIWN
    >8U8F_1|Chain A|Guanine nucleotide-binding protein G(s) subunit alpha isoforms short|Homo sapiens (9606)
    MGCLGNSKTEDQRNEEKAQREANKKIEKQLQKDKQVYRATHRLLLLGAGESGKNTIVKQMRILHVNGFNGEGGEEDPQAARSNSDGEKATKVQDIKNNLKEAIETIVAAMSNLVPPVELANPENQFRVDYILSVMNVPDFDFPPEFYEHAKALWEDEGVRACYERSNEYQLIDCAQYFLDKIDVIKQADYVPSDQDLLRCRVLTSGIFETKFQVDKVNFHMFDVGAQRDERRKWIQCFNDVTAIIFVVASSSYNMVIREDNQTNRLQAALKLFDSIWNNKWLRDTSVILFLNKQDLLAEKVLAGKSKIEDYFPEFARYTTPEDATPEPGEDPRVTRAKYFIRDEFLRISTASGDGRHYCYPHFTCSVDTENIRRVFNDCRDIIQRMHLRQYELL
    • How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.

    There are thousands of homologs, incuding human, pygmy chimpanzee, olive babboon, cotton-top tamarin, etc. The protein seems highly conserved and not changed.

    • Does your protein belong to any protein family?

    G Protein-Coupled Receptor (GPCR) Family

  2. Identify the structure page of your protein in RCSB
    • When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)

    The structure is solved around 2023 September and released 2024 Match. The method id electron microscopy but resolution 3.49 Å.

    • Are there any other molecules in the solved structure apart from protein?

    Yes, I see palmitic acid in the structure apart from protein.

    It belongs to a membrain protein, and falls under 7-transmembrane receptive (GPCR).

  3. Open the structure of your protein in any 3D molecule visualization software:
    • PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands)
    • Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.

    Cartoon alt text alt text Ribbon alt text alt text Ball and stick alt text alt text

    • Color the protein by secondary structure. Does it have more helices or sheets? alt text alt text It has a lot more helices than sheets.
    • Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?

    alt text alt text I used an additional script to label the hydrophobicity scale. Hydrophobic residues are red and hydrophilic (polar/charged) residues are white. It is slightly more hydrophobic.

   #https://pymolwiki.org/index.php/Color_h
   from pymol import cmd

def color_h(selection='all'):
        s = str(selection)
        print(s)
        cmd.set_color('color_ile',[0.996,0.062,0.062])
        cmd.set_color('color_phe',[0.996,0.109,0.109])
        cmd.set_color('color_val',[0.992,0.156,0.156])
        cmd.set_color('color_leu',[0.992,0.207,0.207])
        cmd.set_color('color_trp',[0.992,0.254,0.254])
        cmd.set_color('color_met',[0.988,0.301,0.301])
        cmd.set_color('color_ala',[0.988,0.348,0.348])
        cmd.set_color('color_gly',[0.984,0.394,0.394])
        cmd.set_color('color_cys',[0.984,0.445,0.445])
        cmd.set_color('color_tyr',[0.984,0.492,0.492])
        cmd.set_color('color_pro',[0.980,0.539,0.539])
        cmd.set_color('color_thr',[0.980,0.586,0.586])
        cmd.set_color('color_ser',[0.980,0.637,0.637])
        cmd.set_color('color_his',[0.977,0.684,0.684])
        cmd.set_color('color_glu',[0.977,0.730,0.730])
        cmd.set_color('color_asn',[0.973,0.777,0.777])
        cmd.set_color('color_gln',[0.973,0.824,0.824])
        cmd.set_color('color_asp',[0.973,0.875,0.875])
        cmd.set_color('color_lys',[0.899,0.922,0.922])
        cmd.set_color('color_arg',[0.899,0.969,0.969])
        cmd.color("color_ile","("+s+" and resn ile)")
        cmd.color("color_phe","("+s+" and resn phe)")
        cmd.color("color_val","("+s+" and resn val)")
        cmd.color("color_leu","("+s+" and resn leu)")
        cmd.color("color_trp","("+s+" and resn trp)")
        cmd.color("color_met","("+s+" and resn met)")
        cmd.color("color_ala","("+s+" and resn ala)")
        cmd.color("color_gly","("+s+" and resn gly)")
        cmd.color("color_cys","("+s+" and resn cys)")
        cmd.color("color_tyr","("+s+" and resn tyr)")
        cmd.color("color_pro","("+s+" and resn pro)")
        cmd.color("color_thr","("+s+" and resn thr)")
        cmd.color("color_ser","("+s+" and resn ser)")
        cmd.color("color_his","("+s+" and resn his)")
        cmd.color("color_glu","("+s+" and resn glu)")
        cmd.color("color_asn","("+s+" and resn asn)")
        cmd.color("color_gln","("+s+" and resn gln)")
        cmd.color("color_asp","("+s+" and resn asp)")
        cmd.color("color_lys","("+s+" and resn lys)")
        cmd.color("color_arg","("+s+" and resn arg)")
cmd.extend('color_h',color_h)

def color_h2(selection='all'):
        s = str(selection)
        print(s)
        cmd.set_color("color_ile2",[0.938,1,0.938])
        cmd.set_color("color_phe2",[0.891,1,0.891])
        cmd.set_color("color_val2",[0.844,1,0.844])
        cmd.set_color("color_leu2",[0.793,1,0.793])
        cmd.set_color("color_trp2",[0.746,1,0.746])
        cmd.set_color("color_met2",[0.699,1,0.699])
        cmd.set_color("color_ala2",[0.652,1,0.652])
        cmd.set_color("color_gly2",[0.606,1,0.606])
        cmd.set_color("color_cys2",[0.555,1,0.555])
        cmd.set_color("color_tyr2",[0.508,1,0.508])
        cmd.set_color("color_pro2",[0.461,1,0.461])
        cmd.set_color("color_thr2",[0.414,1,0.414])
        cmd.set_color("color_ser2",[0.363,1,0.363])
        cmd.set_color("color_his2",[0.316,1,0.316])
        cmd.set_color("color_glu2",[0.27,1,0.27])
        cmd.set_color("color_asn2",[0.223,1,0.223])
        cmd.set_color("color_gln2",[0.176,1,0.176])
        cmd.set_color("color_asp2",[0.125,1,0.125])
        cmd.set_color("color_lys2",[0.078,1,0.078])
        cmd.set_color("color_arg2",[0.031,1,0.031])
        cmd.color("color_ile2","("+s+" and resn ile)")
        cmd.color("color_phe2","("+s+" and resn phe)")
        cmd.color("color_val2","("+s+" and resn val)")
        cmd.color("color_leu2","("+s+" and resn leu)")
        cmd.color("color_trp2","("+s+" and resn trp)")
        cmd.color("color_met2","("+s+" and resn met)")
        cmd.color("color_ala2","("+s+" and resn ala)")
        cmd.color("color_gly2","("+s+" and resn gly)")
        cmd.color("color_cys2","("+s+" and resn cys)")
        cmd.color("color_tyr2","("+s+" and resn tyr)")
        cmd.color("color_pro2","("+s+" and resn pro)")
        cmd.color("color_thr2","("+s+" and resn thr)")
        cmd.color("color_ser2","("+s+" and resn ser)")
        cmd.color("color_his2","("+s+" and resn his)")
        cmd.color("color_glu2","("+s+" and resn glu)")
        cmd.color("color_asn2","("+s+" and resn asn)")
        cmd.color("color_gln2","("+s+" and resn gln)")
        cmd.color("color_asp2","("+s+" and resn asp)")
        cmd.color("color_lys2","("+s+" and resn lys)")
        cmd.color("color_arg2","("+s+" and resn arg)")
cmd.extend('color_h2',color_h2)
alt text alt text
  • Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

alt text alt text Yes it appears to have a hole in the middle.

Part C. Using ML-Based Protein Design Tools

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

In this section, we will learn about the capabilities of modern protein AI models and test some of them in your chosen protein.

  1. Copy the HTGAA_ProteinDesign2026.ipynb notebook and set up a colab instance with GPU.
  2. Choose your favorite protein from the PDB.
  3. We will now try multiple things in the three sections below; report each of these results in your homework writeup on your HTGAA website:

C1. Protein Language Modeling

Picture Source: Bordin, Nicola et al (2023). Novel machine learning approaches revolutionize protein knowledge. Trends in Biochemical Sciences, Volume 48, Issue 4, 345 - 359

Picture Source: Bordin, Nicola et al (2023). Novel machine learning approaches revolutionize protein knowledge. Trends in Biochemical Sciences, Volume 48, Issue 4, 345 - 359

  1. Deep Mutational Scans

    1. Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.
    2. >Using ESM2 mutational scans, 8U8F looks like >![alt text]()
    3. Can you explain any particular pattern? (choose a residue and a mutation that stands out)
    4. It appears that there are vertical bands in the sequence where across different amino acids, it's predicted to have a low score. This might be due to highly conserved functional and structural reasons. Lysine is the most common amino acid, but it also shows lots of dark spots and low scores because it is may have a hydrophobic mismatch. >There is a yellow band at position 243. >It is interesting Lysine is charged and has lots of blue bands, Leucine is neutral and is mostly high on the score.
    5. (Bonus) Find sequences for which we have experimental scans, and compare the prediction of the language model to experiment.
  2. Latent Space Analysis

    1. Use the provided sequence dataset to embed proteins in reduced dimensionality.
    2. >![alt text]()
    3. Analyze the different formed neighborhoods: do they approximate similar proteins?
    4. >They are positionally far away from each other, they are very different proteins.
    5. Place your protein in the resulting map and explain its position and similarity to its neighbors.
    6. >G-protein subunits ($\alpha, \beta, \text{ and } \gamma$ are much closer to each other on the map. >Chain G is much shorter, only 58 amino acids and is structurally very different to other proteins. Chain G is essentially just two small alpha-helices connected by a loop.

C2. Protein Folding

Picture Source: Lin et al (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model.

Picture Source: Lin et al (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model.

Folding a protein

  1. Fold your protein with ESMFold. Do the predicted coordinates match your original structure?
alt text alt text
  1. Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?

I tried changing small snippets of the sequence and it wasn’t as visible, but adding longer sequences of the same amino acid allowed twists to be more visible. alt text alt text

C3. Protein Generation

Picture Source: 1. Post from Sergey Ovchinnikov 2. Roney, Ovchinnikov et al (2022). State-of-the-art estimation of protein model accuracy using AlphaFold. Phys. Rev. Lett. 129, 238101

Picture Source: 1. Post from Sergey Ovchinnikov 2. Roney, Ovchinnikov et al (2022). State-of-the-art estimation of protein model accuracy using AlphaFold. Phys. Rev. Lett. 129, 238101

Inverse-Folding a protein: Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN

  1. Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one.

Using the fixed-backbone design, we kept the 3D shape of 8U8F Chain A and reskinned a sequence. ProteinMPNN ended up rewriting 75% of the protein, there is a high frequency of Leucine and Lysine. alt text alt text My results look like:

Model weights found in ProteinMPNN/vanilla_model_weights
Using device: cuda:0
Number of edges: 48
Training noise level: 0.2A
Model loaded
{'8u8f': (['A'], [])}
Length of chain A is 381
Generating sequences...
>8u8f, score=2.1622, fixed_chains=[], designed_chains=['A'], model_name=v_48_020
NEEKAQREANKKIEKQLQKDKQVYRATHRLLLLGAGESGKNTIVKQMRIXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXSGIFETKFQVDKVNFHMFDVGAQRDERRKWIQCFNDVTAIIFVVASSSYXXXXXXXXQTNRLQAALKLFDSIWNNKWLRDTSVILFLNKQDLLAEKVLAGKSKIEDYFPEFARYTTPEDATPEPGEDPRVTRAKYFIRDEFLRISTASGDGRHYCYPHFTCSVDTENIRRVFNDCRDIIQRMHLRQYELL
>T=0.1, sample=0, score=1.0949, seq_recovery=0.2511
ELLKLLEELLKKLAEKLKKEEEEEKKIKKILLLGSPSSGKTTLLKNIKKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXEPEEVVEFTIDGKKYKIYDLKNQPPDLREVLAKYKDAKVIIYVFPLGSFXXXXXXXXPEDLEKVALEELEWIWNHPDLKNVPILVIFNRPELLRERVLSGKNPIEERFPEYKGYELPKEVKPPEGVPEEWVKVLAFIIDKILKFANKNRGGIREVYPVISSPESKDIKQIIYDAIKKAEERKKLIAEGKL
>T=0.1, sample=0, score=1.1122, seq_recovery=0.2338
LLLLLLLLLLLLLLVLLLLKLLEESKIKKLLLLGSPSSGKTSLLENIEKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXEPERVLEFEIDGVKYRIIDLSNLPPDLSDVLSEYSDCEIIIYVFSTGSYXXXXXXXXPEDLESVDLERLKWIWNHPALKNTPILVIFNRPELLAKRVLSGEKPIEERFPEYKGYKLPENVKPPPGVPEETVKVLSFLIDKVLEFANQNRGGIREVYPVISSVKSKEIKEIIYEAVKKAEERKKLIAQGLL
>T=0.1, sample=0, score=1.0975, seq_recovery=0.2554
KEEEKKKELEEKLKKEEEKKKEEEEKVIKLLLLGLPNSGKTTILENIKKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXEPEEVIEFEIEGKKYRIVDLKNLPPDLSEILEKYSDCKILVYIFPTGSFXXXXXXXXPENLEKEALELLKRIWNHPSLKNVPLLVIFNRAEKLKEIVLSGEKPIEEYFPEYKGYKLPESAKPPPNTDPEVVKVLSFLIDKILEYANQNRGGIRKVFPVISSPESKDIREIIYKAVKEAEERKKLIALGLL
>T=0.1, sample=0, score=1.1196, seq_recovery=0.2857
AALAEELAKKKALAALKKKEEEEESKVKKLLLLGGPSSGKTTLLENISKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXSSIRELEFEIDGVKYKILDLENRPEDLSEILSEFKDCEIIIYVFPLGSFXXXXXXXXPENLLKKALEEFERIWNHPDLKDVPILVLFNRPELLKEKVLSGKKPLEEIFPEYKGWELPEDAKPPPNTPLEWVKALYFLKEKVLEIANKNRGGRREVFPFIVSPKSKDIKEIIYNAVKEAEKRKALIAAGLL
>T=0.1, sample=0, score=1.1445, seq_recovery=0.2381
LLLLLLLALLLALAALLAALAEEEKKVRKLLLLGLPNSGKTTLLKNISKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXEPEEILKFEIDGVKYEIKDLKNRPPDLSDILKEYSDCDIIIYVFPSGLFXXXXXXXXPENLEEVALEQLKNLLNNPDLKNVPILVLFNRPELLKKIVESGKRPLEEIFPEYKGYELPESAVCPPNTPLEWCKAIYFLIDKILEFANQNRGGISEVYPHITSPDSKDIKQIIYDAVKKAEERKKLIAAGKL

New Sequence: DKKIKKDDKKIIKDIKIIDDDDDIIHIIHKKKFRNRRFISSKKIMHIIYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYDNDDTTDESHCFIIWIHWCKIMPNNCKQDTKIWICITHHWTENKFREYYYYYYYYNDCKDITKDDKDVHVMGNCKIMTNHKTHEMQNDKKQDQTKRFIMNHDDQENDWIFWDKNIDTINNDFTNDDVTITKEHHCIHKIEMIMQFFHQDTWNTHRRNDRICHIPHHWCHIIDDQIIKHDFIK

============================================================
Summary
============================================================
Sequence 1: score=1.0949, recovery=25.11%
Sequence 2: score=1.1122, recovery=23.38%
Sequence 3: score=1.0975, recovery=25.54%
Sequence 4: score=1.1196, recovery=28.57%
Sequence 5: score=1.1445, recovery=23.81%

Google Colab doesn’t work with GPU acceleration so I’ve cloned to work locally.

  1. Input this sequence into ESMFold and compare the predicted structure to your original.

new sequence new sequence alt text alt text

DKKIKKDDKKIIKDIKIIDDDDDIIHIIHKKKFRNRRFISSKKIMHIIYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYDNDDTTDESHCFIIWIHWCKIMPNNCKQDTKIWICITHHWTENKFREYYYYYYYYNDCKDITKDDKDVHVMGNCKIMTNHKTHEMQNDKKQDQTKRFIMNHDDQENDWIFWDKNIDTINNDFTNDDVTITKEHHCIHKIEMIMQFFHQDTWNTHRRNDRICHIPHHWCHIIDDQIIKHDFIK

The predicted structure has retained the structure but upon comparison on PyMOL, the white structure (new) looks displaced. alt text alt text

Part D. Group Brainstorm on Bacteriophage Engineering

Assignees for the following sections
MIT/Harvard studentsOptional
Committed ListenersRequired
  1. Find a group of ~3–4 students

  2. Read through the Phage Reading material listed under “Reading & Resources” below.

  3. Review the Bacteriophage Final Project Goals for engineering the L Protein:

    • Increased stability (easiest)
    • Higher titers (medium)
    • Higher toxicity of lysis protein (hard)
  4. Brainstorm Session

    • Choose one or two main goals from the list that you think you can address computationally (e.g., “We’ll try to stabilize the lysis protein,” or “We’ll attempt to disrupt its interaction with E. coli DnaJ.”).

    optimizing protein’s binding affinity to e coli to accelerate lysis trigger increasing stability of L protein, ensuring proteins are folded and integrated into membrane to perform function.

    • Write a 1-page proposal (bullet points or short paragraphs) describing:

      • Which tools/approaches from recitation you propose using (e.g., “Use Protein Language Models to do in silico mutagenesis, then AlphaFold-Multimer to check complexes.”).

      We would like to use protein language models such as ESM2 in the colab document to perform in sillilco mutagenesis. We will calculate single point mutations in the L protein sequence, and try to idenitfy mutations that are more evolutionarily favorable. Like the assignment I am interested to use ProteinMPNN for to redesign and generate a new sequence. Given the backbone structure of the L protein, this tool will help us generate alternative sequences that maintain the same fold but with higher thermal stability, thereby achieving our goals. AlphaFold Multimer was particularly interesting too, as it predicts 3D structures of protein complexes (co-folding multiple chains). Novel complexes create range and breadth.

      • Why do you think those tools might help solve your chosen sub-problem?

      ProteinMPNN was very robust in developing sequences that fit a specific shape, there is guarantee we will be able to increase protein stabililty. ESM2 allows us to scan so many mutations at once, which allows us to very quickly narrow down a direction that we couldn’t perform in wet lab setting.

      • Name one or two potential pitfalls (e.g., “We lack enough training data on phage–bacteria interactions.”).

      L protein is a membrane protein. Most standard protein models like AlphaFold multimer seem to be trained primarily on soluble proteins. The specific lipid-protein interactions required for lysis may not be fully captured, leading to “stable” designs that fail to insert into the membrane. In my assignment I don’t understsand still how the shape will fit as it seems displaced?

      • Include a schematic of your pipeline.
    • This resource may be useful: HTGAA Protein Engineering Tools

  5. Each individually put your plan on your HTGAA website

    • Include your group’s short plan for engineering a bacteriophage

Input a L protein sequence > use ESM2 to generate favorable mutations, the heat map should show us green-light vs no-go directions in the sequence > Use protein MPNN to generate and find a skeleton template for core stability > add complexity via alphaFold, predicting an interaction. >use PyMOL to check shape and geometry > calculate binding affinity score via colab > and select best candidates!


Reading & Resources (click to expand)

Tools

Phage Reading