Week 4: Protein Design Part I

This week focuses on how sequence, structure, and energetics can be modeled and manipulated to create or optimize proteins with specified functions.

Objective:

Learn basic concepts:
- amino acid structure
- 3D protein visualization
- the variety of ML-based design tools
Brainstorm as a group how to apply these tools to engineer a better bacteriophage (setting the stage for the final project).

Part A. Conceptual Questions

Assignees for the following sections

MIT/Harvard students	Required
Committed Listeners	Required

Answer any NINE of the following questions from Shuguang Zhang: (i.e. you can select two to skip)

How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)

For 500 grams of meat, there is roughly 20-25% grams of protein. This means that roughly 100 grams belong to protein, while there is remaining fat, fiber, and water that make up the rest of the mass. Because 1 mole = 100 Da Number of moles = 100 g of protein / 100 Da = 100g/ 100 g / mol $$\text{Molecules} = 1 \text{ mol} \times 6.022 \times 10^{23} \text{ molecules/mol}$$$$\text{Molecules} \approx 6.022 \times 10^{23}$$ There are roughly 602 sextillion amino acids.

Why do humans eat beef but do not become a cow, eat fish but do not become fish?

When humans eat beef, through mastication and digestion we break down the beef into smaller units. First protein is broken down by enzymes (proteases) and into shorter chains of amino acids in the stomach. Then the chains become further broken down into individual amino acids in the small intestine. As these amino acids enter the bloodstream, they require DNA to instruct them into building other things. The human DNA does different things than cows and fish, therefore the amino acids will build a cow or a fish.

Why are there only 20 natural amino acids?

It may be an evolutionary mystery that almost all living things are built from these 20 natural amino acids. The 20 amino acids serve as the building blocks of most proteins, they line up as codons in 3-letter assemblies, in which the ribosomes read to create actions following the DNA sequence. When they read 3 bases at once, the combinations create 4^3 possibilities that are expansive enough for the making of diverse lifeforms.

Can you make other non-natural amino acids? Design some new amino acids.

Yes, there are a lot of non-natural amino acids. Designing new amino acids require us to follow the same chassis but redesign the ‘r-group’ to alter the chemistry of the bond, which is the side chain of the amino acid. One may attach an azide to the chain to create a strong bond for stickiness or bio-glue. For experiments, some researchers also use non-natural florescent amino acids like Acridonylalanine to glow under microscopy or photographs.

Where did amino acids come from before enzymes that make them, and before life started?

This might be related to assembly theory? Lee Cronin proposed that life is composed of different assemblies, in that life is scaffolded by energy, raw sources, and minerals through complex interactions and then becomes amino acids, and longer chains. Gases and energy together can create amino acids. The Miller-Urey Experiment use water, methane, ammonia, and hydrogen to create amino acids.

If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?

Left-handed. D-amino acids create a mirror image of α-helixes, because the building blocks and the structure are completely mirrored.

Can you discover additional helices in proteins?

Yes, since 2020, AlphaFold has allowed us to quickly discover new helices and the instructions to their fold, revealed millions of protein structures.

Why are most molecular helices right-handed?

Because of chirality, most helices are non-identical to their mirror image. As most amino acids are L-form (left-handed), the way they most efficiently stack together is twisting to the right where they can create stable bonds with enough room between each other.

Why do β-sheets tend to aggregate?

β-sheets bond together via hydrogen bonds. The geometry appears like pleated, zigzag, sheet-like structure with side chains protruding.

What is the driving force for β-sheet aggregation?

They tend to aggregate because of its geometry, where the hydrophobic faces might sandwich and stick together to hide from water. The force from the water becomes driving force for clumping.

Why do many amyloid diseases form β-sheets?
- Can you use amyloid β-sheets as materials?
Design a β-sheet motif that forms a well-ordered structure.

Part B: Protein Analysis and Visualization

In this part of the homework, you will be using online resources and 3D visualization software to answer questions about proteins. Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions:

Briefly describe the protein you selected and why you selected it.

I chose GPR3 Orphan G-coupled Protein Receptor in complex with Dominant Negative Gs (8U8F) because I’m interested in GPR3 is a class A orphan G protein-coupled receptor (GPCR) exhibiting broad expression across various brain regions including the hypothalamus, hippocampus, and cortex, as well as in peripheral tissues such as liver and ovary.It has a potential role in modulating a number of brain functions, including behavioral responses to stress, amyloid-beta peptide generation in neurons and neurite outgrowth. For brains-on-chips research I’m interested in different types of expressions in the central nervous system and the brain.

Identify the amino acid sequence of your protein.

How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids.

There are four protein chains. Chain A: 372, Chain B: 339, Chain C: 58, Chain D: 384. The most frequent amino acid seems to be leucine. It is a sturdy, hydrophobic (water-hating) amino acid.

>8U8F_4|Chain D[auth R]|G-protein coupled receptor 3|Homo sapiens (9606)NSTMKTIIALSYIFCLVFADYKDDDDLEVLFQGPAMWGAGSPLAWLSAGSGNVNVSSVGPAEGPTGPAAPLPSPKAWDVVLCISGTLVSCENALVVAIIVGTPAFRAPMFLLVGSLAVADLLAGLGLVLHFAAVFCIGSAEMSLVLVGVLAMAFTASIGSLLAITVDRYLSLYNALTYYSETTVTRTYVMLALVWGGALGLGLLPVLAWNCLDGLTTCGVVYPLSKNHLVVLAIAFFMVFGIMLQLYAQICRIVCRHAQQIALQRHLLPASHYVATRKGIATLAVVLGAFAACWLPFTVYCLLGDAHSPPLYTYLTLLPATYNSMINPIIYAFRNQDVQKVLWAVCCCCSSSKIPFRSRSPSDVPAGLEVLFQGPHHHHHHHHAAAFESR
>8U8F_3|Chain C[auth G]|Guanine nucleotide-binding protein G(I)/G(S)/G(O) subunit gamma-2|Homo sapiens (9606)
NTASIAQARKLVEQLKMEANIDRIKVSKAAADLMAYCEAHAKEDPLLTPVPASENPFR
>8U8F_2|Chain B|Guanine nucleotide-binding protein G(I)/G(S)/G(T) subunit beta-1|Homo sapiens (9606)
QSELDQLRQEAEQLKNQIRDARKACADATLSQITNNIDPVGRIQMRTRRTLRGHLAKIYAMHWGTDSRLLVSASQDGKLIIWDSYTTNKVHAIPLRSSWVMTCAYAPSGNYVACGGLDNICSIYNLKTREGNVRVSRELAGHTGYLSCCRFLDDNQIVTSSGDTTCALWDIETGQQTTTFTGHTGDVMSLSLAPDTRLFVSGACDASAKLWDVREGMCRQTFTGHESDINAICFFPNGNAFATGSDDATCRLFDLRADQELMTYSHDNIICGITSVSFSKSGRLLLAGYDDFNCNVWDALKADRAGVLAGHDNRVSCLGVTDDGMAVATGSWDSFLKIWN
>8U8F_1|Chain A|Guanine nucleotide-binding protein G(s) subunit alpha isoforms short|Homo sapiens (9606)
MGCLGNSKTEDQRNEEKAQREANKKIEKQLQKDKQVYRATHRLLLLGAGESGKNTIVKQMRILHVNGFNGEGGEEDPQAARSNSDGEKATKVQDIKNNLKEAIETIVAAMSNLVPPVELANPENQFRVDYILSVMNVPDFDFPPEFYEHAKALWEDEGVRACYERSNEYQLIDCAQYFLDKIDVIKQADYVPSDQDLLRCRVLTSGIFETKFQVDKVNFHMFDVGAQRDERRKWIQCFNDVTAIIFVVASSSYNMVIREDNQTNRLQAALKLFDSIWNNKWLRDTSVILFLNKQDLLAEKVLAGKSKIEDYFPEFARYTTPEDATPEPGEDPRVTRAKYFIRDEFLRISTASGDGRHYCYPHFTCSVDTENIRRVFNDCRDIIQRMHLRQYELL

How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.

There are thousands of homologs, incuding human, pygmy chimpanzee, olive babboon, cotton-top tamarin, etc. The protein seems highly conserved and not changed.

Does your protein belong to any protein family?

G Protein-Coupled Receptor (GPCR) Family

Identify the structure page of your protein in RCSB
- When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)
The structure is solved around 2023 September and released 2024 Match. The method id electron microscopy but resolution 3.49 Å.
- Are there any other molecules in the solved structure apart from protein?
Yes, I see palmitic acid in the structure apart from protein.
- Does your protein belong to any structure classification family?
It belongs to a membrain protein, and falls under 7-transmembrane receptive (GPCR).
Open the structure of your protein in any 3D molecule visualization software:
- PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands)
- Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.
Cartoon Ribbon Ball and stick
- Color the protein by secondary structure. Does it have more helices or sheets? It has a lot more helices than sheets.
- Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?
I used an additional script to label the hydrophobicity scale. Hydrophobic residues are red and hydrophilic (polar/charged) residues are white. It is slightly more hydrophobic.

   #https://pymolwiki.org/index.php/Color_h
   from pymol import cmd

def color_h(selection='all'):
        s = str(selection)
        print(s)
        cmd.set_color('color_ile',[0.996,0.062,0.062])
        cmd.set_color('color_phe',[0.996,0.109,0.109])
        cmd.set_color('color_val',[0.992,0.156,0.156])
        cmd.set_color('color_leu',[0.992,0.207,0.207])
        cmd.set_color('color_trp',[0.992,0.254,0.254])
        cmd.set_color('color_met',[0.988,0.301,0.301])
        cmd.set_color('color_ala',[0.988,0.348,0.348])
        cmd.set_color('color_gly',[0.984,0.394,0.394])
        cmd.set_color('color_cys',[0.984,0.445,0.445])
        cmd.set_color('color_tyr',[0.984,0.492,0.492])
        cmd.set_color('color_pro',[0.980,0.539,0.539])
        cmd.set_color('color_thr',[0.980,0.586,0.586])
        cmd.set_color('color_ser',[0.980,0.637,0.637])
        cmd.set_color('color_his',[0.977,0.684,0.684])
        cmd.set_color('color_glu',[0.977,0.730,0.730])
        cmd.set_color('color_asn',[0.973,0.777,0.777])
        cmd.set_color('color_gln',[0.973,0.824,0.824])
        cmd.set_color('color_asp',[0.973,0.875,0.875])
        cmd.set_color('color_lys',[0.899,0.922,0.922])
        cmd.set_color('color_arg',[0.899,0.969,0.969])
        cmd.color("color_ile","("+s+" and resn ile)")
        cmd.color("color_phe","("+s+" and resn phe)")
        cmd.color("color_val","("+s+" and resn val)")
        cmd.color("color_leu","("+s+" and resn leu)")
        cmd.color("color_trp","("+s+" and resn trp)")
        cmd.color("color_met","("+s+" and resn met)")
        cmd.color("color_ala","("+s+" and resn ala)")
        cmd.color("color_gly","("+s+" and resn gly)")
        cmd.color("color_cys","("+s+" and resn cys)")
        cmd.color("color_tyr","("+s+" and resn tyr)")
        cmd.color("color_pro","("+s+" and resn pro)")
        cmd.color("color_thr","("+s+" and resn thr)")
        cmd.color("color_ser","("+s+" and resn ser)")
        cmd.color("color_his","("+s+" and resn his)")
        cmd.color("color_glu","("+s+" and resn glu)")
        cmd.color("color_asn","("+s+" and resn asn)")
        cmd.color("color_gln","("+s+" and resn gln)")
        cmd.color("color_asp","("+s+" and resn asp)")
        cmd.color("color_lys","("+s+" and resn lys)")
        cmd.color("color_arg","("+s+" and resn arg)")
cmd.extend('color_h',color_h)

def color_h2(selection='all'):
        s = str(selection)
        print(s)
        cmd.set_color("color_ile2",[0.938,1,0.938])
        cmd.set_color("color_phe2",[0.891,1,0.891])
        cmd.set_color("color_val2",[0.844,1,0.844])
        cmd.set_color("color_leu2",[0.793,1,0.793])
        cmd.set_color("color_trp2",[0.746,1,0.746])
        cmd.set_color("color_met2",[0.699,1,0.699])
        cmd.set_color("color_ala2",[0.652,1,0.652])
        cmd.set_color("color_gly2",[0.606,1,0.606])
        cmd.set_color("color_cys2",[0.555,1,0.555])
        cmd.set_color("color_tyr2",[0.508,1,0.508])
        cmd.set_color("color_pro2",[0.461,1,0.461])
        cmd.set_color("color_thr2",[0.414,1,0.414])
        cmd.set_color("color_ser2",[0.363,1,0.363])
        cmd.set_color("color_his2",[0.316,1,0.316])
        cmd.set_color("color_glu2",[0.27,1,0.27])
        cmd.set_color("color_asn2",[0.223,1,0.223])
        cmd.set_color("color_gln2",[0.176,1,0.176])
        cmd.set_color("color_asp2",[0.125,1,0.125])
        cmd.set_color("color_lys2",[0.078,1,0.078])
        cmd.set_color("color_arg2",[0.031,1,0.031])
        cmd.color("color_ile2","("+s+" and resn ile)")
        cmd.color("color_phe2","("+s+" and resn phe)")
        cmd.color("color_val2","("+s+" and resn val)")
        cmd.color("color_leu2","("+s+" and resn leu)")
        cmd.color("color_trp2","("+s+" and resn trp)")
        cmd.color("color_met2","("+s+" and resn met)")
        cmd.color("color_ala2","("+s+" and resn ala)")
        cmd.color("color_gly2","("+s+" and resn gly)")
        cmd.color("color_cys2","("+s+" and resn cys)")
        cmd.color("color_tyr2","("+s+" and resn tyr)")
        cmd.color("color_pro2","("+s+" and resn pro)")
        cmd.color("color_thr2","("+s+" and resn thr)")
        cmd.color("color_ser2","("+s+" and resn ser)")
        cmd.color("color_his2","("+s+" and resn his)")
        cmd.color("color_glu2","("+s+" and resn glu)")
        cmd.color("color_asn2","("+s+" and resn asn)")
        cmd.color("color_gln2","("+s+" and resn gln)")
        cmd.color("color_asp2","("+s+" and resn asp)")
        cmd.color("color_lys2","("+s+" and resn lys)")
        cmd.color("color_arg2","("+s+" and resn arg)")
cmd.extend('color_h2',color_h2)

Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

Yes it appears to have a hole in the middle.

Part C. Using ML-Based Protein Design Tools

Assignees for the following sections

MIT/Harvard students	Required
Committed Listeners	Required

In this section, we will learn about the capabilities of modern protein AI models and test some of them in your chosen protein.

Copy the HTGAA_ProteinDesign2026.ipynb notebook and set up a colab instance with GPU.
Choose your favorite protein from the PDB.
We will now try multiple things in the three sections below; report each of these results in your homework writeup on your HTGAA website:

C1. Protein Language Modeling

Picture Source: Bordin, Nicola et al (2023). Novel machine learning approaches revolutionize protein knowledge. Trends in Biochemical Sciences, Volume 48, Issue 4, 345 - 359

Deep Mutational Scans
1. Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.
Latent Space Analysis
1. Use the provided sequence dataset to embed proteins in reduced dimensionality.

C2. Protein Folding

Picture Source: Lin et al (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model.

Folding a protein

Fold your protein with ESMFold. Do the predicted coordinates match your original structure?

Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?

I tried changing small snippets of the sequence and it wasn’t as visible, but adding longer sequences of the same amino acid allowed twists to be more visible.

C3. Protein Generation

Picture Source: 1. Post from Sergey Ovchinnikov 2. Roney, Ovchinnikov et al (2022). State-of-the-art estimation of protein model accuracy using AlphaFold. Phys. Rev. Lett. 129, 238101

Inverse-Folding a protein: Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN

Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one.

Using the fixed-backbone design, we kept the 3D shape of 8U8F Chain A and reskinned a sequence. ProteinMPNN ended up rewriting 75% of the protein, there is a high frequency of Leucine and Lysine. My results look like:

Model weights found in ProteinMPNN/vanilla_model_weights
Using device: cuda:0
Number of edges: 48
Training noise level: 0.2A
Model loaded
{'8u8f': (['A'], [])}
Length of chain A is 381
Generating sequences...
>8u8f, score=2.1622, fixed_chains=[], designed_chains=['A'], model_name=v_48_020
NEEKAQREANKKIEKQLQKDKQVYRATHRLLLLGAGESGKNTIVKQMRIXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXSGIFETKFQVDKVNFHMFDVGAQRDERRKWIQCFNDVTAIIFVVASSSYXXXXXXXXQTNRLQAALKLFDSIWNNKWLRDTSVILFLNKQDLLAEKVLAGKSKIEDYFPEFARYTTPEDATPEPGEDPRVTRAKYFIRDEFLRISTASGDGRHYCYPHFTCSVDTENIRRVFNDCRDIIQRMHLRQYELL
>T=0.1, sample=0, score=1.0949, seq_recovery=0.2511
ELLKLLEELLKKLAEKLKKEEEEEKKIKKILLLGSPSSGKTTLLKNIKKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXEPEEVVEFTIDGKKYKIYDLKNQPPDLREVLAKYKDAKVIIYVFPLGSFXXXXXXXXPEDLEKVALEELEWIWNHPDLKNVPILVIFNRPELLRERVLSGKNPIEERFPEYKGYELPKEVKPPEGVPEEWVKVLAFIIDKILKFANKNRGGIREVYPVISSPESKDIKQIIYDAIKKAEERKKLIAEGKL
>T=0.1, sample=0, score=1.1122, seq_recovery=0.2338
LLLLLLLLLLLLLLVLLLLKLLEESKIKKLLLLGSPSSGKTSLLENIEKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXEPERVLEFEIDGVKYRIIDLSNLPPDLSDVLSEYSDCEIIIYVFSTGSYXXXXXXXXPEDLESVDLERLKWIWNHPALKNTPILVIFNRPELLAKRVLSGEKPIEERFPEYKGYKLPENVKPPPGVPEETVKVLSFLIDKVLEFANQNRGGIREVYPVISSVKSKEIKEIIYEAVKKAEERKKLIAQGLL
>T=0.1, sample=0, score=1.0975, seq_recovery=0.2554
KEEEKKKELEEKLKKEEEKKKEEEEKVIKLLLLGLPNSGKTTILENIKKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXEPEEVIEFEIEGKKYRIVDLKNLPPDLSEILEKYSDCKILVYIFPTGSFXXXXXXXXPENLEKEALELLKRIWNHPSLKNVPLLVIFNRAEKLKEIVLSGEKPIEEYFPEYKGYKLPESAKPPPNTDPEVVKVLSFLIDKILEYANQNRGGIRKVFPVISSPESKDIREIIYKAVKEAEERKKLIALGLL
>T=0.1, sample=0, score=1.1196, seq_recovery=0.2857
AALAEELAKKKALAALKKKEEEEESKVKKLLLLGGPSSGKTTLLENISKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXSSIRELEFEIDGVKYKILDLENRPEDLSEILSEFKDCEIIIYVFPLGSFXXXXXXXXPENLLKKALEEFERIWNHPDLKDVPILVLFNRPELLKEKVLSGKKPLEEIFPEYKGWELPEDAKPPPNTPLEWVKALYFLKEKVLEIANKNRGGRREVFPFIVSPKSKDIKEIIYNAVKEAEKRKALIAAGLL
>T=0.1, sample=0, score=1.1445, seq_recovery=0.2381
LLLLLLLALLLALAALLAALAEEEKKVRKLLLLGLPNSGKTTLLKNISKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXEPEEILKFEIDGVKYEIKDLKNRPPDLSDILKEYSDCDIIIYVFPSGLFXXXXXXXXPENLEEVALEQLKNLLNNPDLKNVPILVLFNRPELLKKIVESGKRPLEEIFPEYKGYELPESAVCPPNTPLEWCKAIYFLIDKILEFANQNRGGISEVYPHITSPDSKDIKQIIYDAVKKAEERKKLIAAGKL

New Sequence: DKKIKKDDKKIIKDIKIIDDDDDIIHIIHKKKFRNRRFISSKKIMHIIYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYDNDDTTDESHCFIIWIHWCKIMPNNCKQDTKIWICITHHWTENKFREYYYYYYYYNDCKDITKDDKDVHVMGNCKIMTNHKTHEMQNDKKQDQTKRFIMNHDDQENDWIFWDKNIDTINNDFTNDDVTITKEHHCIHKIEMIMQFFHQDTWNTHRRNDRICHIPHHWCHIIDDQIIKHDFIK

============================================================
Summary
============================================================
Sequence 1: score=1.0949, recovery=25.11%
Sequence 2: score=1.1122, recovery=23.38%
Sequence 3: score=1.0975, recovery=25.54%
Sequence 4: score=1.1196, recovery=28.57%
Sequence 5: score=1.1445, recovery=23.81%

Google Colab doesn’t work with GPU acceleration so I’ve cloned to work locally.

Input this sequence into ESMFold and compare the predicted structure to your original.

DKKIKKDDKKIIKDIKIIDDDDDIIHIIHKKKFRNRRFISSKKIMHIIYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYDNDDTTDESHCFIIWIHWCKIMPNNCKQDTKIWICITHHWTENKFREYYYYYYYYNDCKDITKDDKDVHVMGNCKIMTNHKTHEMQNDKKQDQTKRFIMNHDDQENDWIFWDKNIDTINNDFTNDDVTITKEHHCIHKIEMIMQFFHQDTWNTHRRNDRICHIPHHWCHIIDDQIIKHDFIK

The predicted structure has retained the structure but upon comparison on PyMOL, the white structure (new) looks displaced.

Part D. Group Brainstorm on Bacteriophage Engineering

Assignees for the following sections

MIT/Harvard students	Optional
Committed Listeners	Required

Find a group of ~3–4 students
Read through the Phage Reading material listed under “Reading & Resources” below.
Review the Bacteriophage Final Project Goals for engineering the L Protein:
- Increased stability (easiest)
- Higher titers (medium)
- Higher toxicity of lysis protein (hard)
Brainstorm Session
- Choose one or two main goals from the list that you think you can address computationally (e.g., “We’ll try to stabilize the lysis protein,” or “We’ll attempt to disrupt its interaction with E. coli DnaJ.”).
optimizing protein’s binding affinity to e coli to accelerate lysis trigger increasing stability of L protein, ensuring proteins are folded and integrated into membrane to perform function.
- Write a 1-page proposal (bullet points or short paragraphs) describing:
  - Which tools/approaches from recitation you propose using (e.g., “Use Protein Language Models to do in silico mutagenesis, then AlphaFold-Multimer to check complexes.”).
  We would like to use protein language models such as ESM2 in the colab document to perform in sillilco mutagenesis. We will calculate single point mutations in the L protein sequence, and try to idenitfy mutations that are more evolutionarily favorable. Like the assignment I am interested to use ProteinMPNN for to redesign and generate a new sequence. Given the backbone structure of the L protein, this tool will help us generate alternative sequences that maintain the same fold but with higher thermal stability, thereby achieving our goals. AlphaFold Multimer was particularly interesting too, as it predicts 3D structures of protein complexes (co-folding multiple chains). Novel complexes create range and breadth.
  - Why do you think those tools might help solve your chosen sub-problem?
  ProteinMPNN was very robust in developing sequences that fit a specific shape, there is guarantee we will be able to increase protein stabililty. ESM2 allows us to scan so many mutations at once, which allows us to very quickly narrow down a direction that we couldn’t perform in wet lab setting.
  - Name one or two potential pitfalls (e.g., “We lack enough training data on phage–bacteria interactions.”).
  L protein is a membrane protein. Most standard protein models like AlphaFold multimer seem to be trained primarily on soluble proteins. The specific lipid-protein interactions required for lysis may not be fully captured, leading to “stable” designs that fail to insert into the membrane. In my assignment I don’t understsand still how the shape will fit as it seems displaced?
  - Include a schematic of your pipeline.
- This resource may be useful: HTGAA Protein Engineering Tools
Each individually put your plan on your HTGAA website
- Include your group’s short plan for engineering a bacteriophage

Input a L protein sequence > use ESM2 to generate favorable mutations, the heat map should show us green-light vs no-go directions in the sequence > Use protein MPNN to generate and find a skeleton template for core stability > add complexity via alphaFold, predicting an interaction. >use PyMOL to check shape and geometry > calculate binding affinity score via colab > and select best candidates!

Reading & Resources (click to expand)

Tools

HTGAA Protein Engineering Tools spreadsheet
NGLViewer: NGL Viewer is a collection of tools for web-based molecular graphics. WebGL is employed to display molecules like proteins and DNA/RNA with a variety of representations.
- Web application (really cool demos)
- Jupyter Widget Tutorial
PyMOL(https://pymol.org/edu/?q=educational): PyMOL is a user-sponsored molecular visualization system on an open-source foundation, maintained and distributed by Schrödinger.
- Practical PyMOL for Beginners
- Video Tutorials: Video 1 Video2 (and tons more… just search “PyMOL tutorial” in youtube).
- Cheat Sheet
- Advanced Cheat Sheet
Chimera: A highly extensible program for interactive visualization and analysis of molecular structures and related data, including density maps, supramolecular assemblies, sequence alignments, docking results, trajectories, and conformational ensembles.
- Chimera Tutorials
- Video Tutorials: Video 1 Video 2 (and tons more… just search “Chimera tutorial” in youtube).
VMD: A molecular visualization program for displaying, animating, and analyzing large biomolecular systems using 3-D graphics and built-in scripting
- VMD Tutorials
- Video Tutorials: Video 1 Video 2 (and tons more… you know the drill)
https://search.foldseek.com/search

Week 4: Protein Design Part I

Objective:

Part A. Conceptual Questions

Part B: Protein Analysis and Visualization

Part C. Using ML-Based Protein Design Tools

C1. Protein Language Modeling

C2. Protein Folding

C3. Protein Generation

Part D. Group Brainstorm on Bacteriophage Engineering

Reading & Resources (click to expand)

Tools

Phage Reading