Week 4 HW: hw-protein-design-part-i
🐉 Project Objective: Bacteriophage Engineering
This document outlines the core learning experience and the collaborative framework designed to drive an optimized bacteriophage project.
1. Mastery of Basic Concepts
- Phage Biology: Understanding the lytic and lysogenic life cycles, and the structural modularity of viral components (Capsid, Tail, Baseplate).
- Synthetic Biology Framework: Introduction to the “Design-Build-Test-Learn” (DBTL) cycle in viral engineering.
- Therapeutic Potential: Exploring the role of phages in addressing antimicrobial resistance (AMR) and precision microbiome editing.
2. Amino Acid Structure & Biochemistry
- Chemical Taxonomy: Categorization of the 20 standard amino acids based on hydrophobicity, charge, and polarity.
- Side-Chain Interactions: Analyzing how hydrogen bonds, salt bridges, and disulfide bridges dictate protein stability.
- Conformational Constraints: Understanding the Ramachandran plot and the energetic landscape of protein folding.
3. 3D Protein Visualization & Analysis
- Software Proficiency: Hands-on training with professional-grade tools such as PyMOL, ChimeraX, or NGL Viewer.
- Structural Mapping: Visualizing surface electrostatic potentials, hydrophobicity, and potential binding pockets.
- Superimposition: Learning to align wild-type and mutant structures to assess structural deviations (RMSD).
4. Diversity of ML-based Design Tools
- Structure Prediction: Leveraging AlphaFold 3 or RoseTTAFold for high-accuracy 3D modeling of viral proteins.
- Fixed-Backbone Design: Using ProteinMPNN to redesign amino acid sequences for a specific structural scaffold.
- Generative Scaffolding: Implementing RFdiffusion for de novo design of receptor-binding motifs and functional binders.
- Sequence Modeling: Utilizing Protein Language Models (e.g., ESM-3) to predict the impact of specific mutations on protein function.
👩🦰 Part A: Fundamental Principles & Frontiers in Protein Engineering
This section covers fundamental inquiries into biochemistry, evolutionary biology, and structural protein design.
1. Quantitative Biochemistry: Amino Acids in Nutrition
Question: How many molecules of amino acids do you consume with a 500g piece of meat? (Assume an average amino acid mass of $\approx 100$ Daltons).
Answer:
- Step 1: Calculate the total mass of the protein. Meat is roughly 20% protein. $500\text{g} \times 0.20 = 100\text{g}$ of protein.
- Step 2: Determine the moles of amino acids. $100\text{g} / 100\text{g/mol} = 1\text{ mole}$.
- Step 3: Convert to molecules using Avogadro’s number. Result: $\approx 6.022 \times 10^{23}$ molecules.
2. Biological Identity & Genetics
Question: Why do humans eat beef or fish without transforming into a cow or a fish? Would the pioneers of DNA (Sanger, Darwin, Mendel, Watson, Crick, and Franklin) be furious if they knew you asked this?
Answer: Digestion breaks down foreign proteins into their constituent monomers (individual amino acids). These building blocks are then reassembled according to your unique genetic blueprint encoded in your DNA. While the pioneers of genetics would likely be amused rather than furious, the question highlights the elegance of the Central Dogma: the information flows from your DNA, not from the food you ingest.
3. The Evolution of the 20 Natural Amino Acids
Question: Why are there only 20 natural amino acids?
Answer: The current set is the result of three major evolutionary stages:
- Primordial Foundation: The first ten amino acids provided the basic requirements for folding and catalysis at the origin of life.
- The Great Oxidation Event (2.6 Gya): The rise of atmospheric oxygen allowed for the evolution of redox-active amino acids like Cysteine and Methionine.
- Translational Fidelity: The tRNA/aminoacyl-tRNA synthetase recognition system reached an evolutionary “frozen accident” state, ensuring the stable and universal use of these 20 building blocks.
4. Non-Natural Amino Acids (ncAAs) & Synthetic Design
Question: Can you design non-natural amino acids? What are some examples?
Answer: Using technologies like multiplex rare-codon recoding and engineered synthetases, we can now incorporate ncAAs for specific functions:
- Photoregulation: Azophenylalanine (AzoPhe)
- Bioorthogonal Chemistry: Azidohomoalanine (Aha), Tetrazine-Lysine (Tetrazine-Lys)
- Metal Coordination: Ferrocene-alanine (Fc-Ala)
- Smart Responsiveness: Spiropyran-alanine (Spiropyran-Ala), Phenylboronic acid leucine (PheB-Leu)
- Others: Diselenocysteine (SeCys), Fluorosulfate-tyrosine (Fluorosulfate-Tyr), Ethynyl-tryptophan (Ethynyl-Trp).
5. Prebiotic Origins
Question: Where did amino acids come from before life and enzymes existed?
Answer:
- Miller–Urey Reactions: Spark discharges in reducing atmospheres (CH₄, NH₃, H₂) produce Glycine and Alanine.
- Strecker Synthesis: Reaction of aldehydes, ammonia, and hydrogen cyanide (common in early Earth).
- Hydrothermal Vents: Alkaline vents provide mineral catalysts and temperature gradients to concentrate precursors.
- Extraterrestrial Delivery: Meteorites (e.g., Murchison) contain over 80 different amino acids, seeding early Earth with organic material.
6. Chirality & Helix Handedness
Question: If you make an α-helix using D-amino acids, what handedness would you expect?
Answer: Natural L-amino acids favor a right-handed α-helix. Due to the mirror-image relationship, a polymer made entirely of D-amino acids will form a left-handed α-helix. The hydrogen-bonding pattern remains the same, but the spatial orientation is inverted.
7. Diversity of Protein Helices
Question: What other types of helices exist in proteins beyond the α-helix?
Answer: 310-helix: A tighter coil defined by $i \to i+3$ hydrogen bonds ($10$-atom ring); typically found as short segments at the boundaries of α-helices. π-helix: A wider coil defined by $i \to i+5$ hydrogen bonds ($16$-atom ring); often appears as a functional bulge or “kink” within an α-helix to accommodate active sites. Polyproline Helices (PPI & PPII): Stabilized by steric effects and ring puckering rather than intrachain H-bonds. PPII is left-handed and common in disordered regions. PPI is right-handed and much rarer in globular proteins. **Left-handed α-helix: Thermodynamically unfavorable for L-amino acids; primarily found in short, specialized motifs or as isolated residues (often Glycine) in strained loops.
8. Stereochemical Dominance
Question: Why are most molecular helices right-handed?
Answer: This is driven by the principle of minimum energy. For L-amino acids, a right-handed twist allows side chains to project outward with minimal steric crowding. A left-handed twist with L-amino acids would force side chains into energetically unfavorable positions, leading to instability.
9. Mechanisms of β-helix Aggregation
Question: Why do β-helix tend to aggregate and what is the driving force?
Answer: β-helix feature “open” edges with unfulfilled hydrogen-bonding potential. The primary driving force is the hydrophobic effect (reducing water exposure of non-polar side chains), which is further amplified by a repetitive, extended geometry that facilitates cooperative, “runaway” inter-strand H-bonding.
10. Amyloids: Disease & Materials
Question: Why do amyloid diseases form β-Sheets$, and can they be used as materials?
Answer: Amyloids (Alzheimer’s, Parkinson’s) result from proteins misfolding into hyper-stable, fibrillar β-Sheets$. As Materials: Yes! Due to their extreme mechanical and chemical stability, engineered amyloids are used for:
- Nanotech: Nanowires and templates.
- Biomedicine: Drug delivery scaffolds and antimicrobial coatings.
- Industry: High-strength adhesives and hydrogels.
11. Motif Design
Question: Design a β-helix motif that forms a well-ordered structure.
Answer: The hexapeptide VQIVYK (from the Tau protein) is a classic model. Its sequence (Val-Gln-Ile-Val-Tyr-Lys) promotes highly ordered, cross-Β
structures through perfect steric zippers and balanced hydrophobic/polar interactions.👨🦰 Part B: Protein Structural Analysis & Visualization
Overview
In this section, you will leverage online bioinformatics databases (e.g., PDB, UniProt) and 3D visualization software (e.g., PyMOL, ChimeraX) to explore the molecular architecture of a protein.
Task: Select a protein with a resolved 3D structure and provide the following details:
1. Protein Selection & Rationale
Selected Protein: The Light-Harvesting Complex II - Photosystem II (LHCII-PSII) Supercomplex
Rationale: The LHCII-PSII supercomplex serves as the primary machinery for solar energy conversion in plants, algae, and cyanobacteria. As the “engine” of photosynthesis, it orchestrates the intricate processes of light absorption, excitation energy transfer, and charge separation.
Selecting this complex is driven by its multi-faceted importance:
- Fundamental Biology: It represents the pinnacle of biological energy transduction and quantum efficiency.
- Agricultural Innovation: Understanding its structural bottlenecks is key to optimizing photosynthetic efficiency and crop yields.
- Sustainable Energy: It provides a natural blueprint for the development of bio-inspired solar cells and artificial photosynthetic systems.
2. Primary Structure: Subunit Selection and Sequence
Note: As the LHCII-PSII is a massive multi-subunit supercomplex, this analysis focuses on the D1 Reaction Center Protein, the functional heart of the complex.
Selected Subunit: PsbA (Photosystem II Reaction Center Protein D1)
Source Organism: Arabidopsis thaliana
UniProt ID: P83755
Biological Rationale: Photosystem II (PSII) is a light-driven water:plastoquinone oxidoreductase that uses light energy to abstract electrons from $H_2O$, generating $O_2$ and a proton gradient subsequently used for ATP formation. It consists of a core antenna complex that captures photons and an electron transfer chain that converts photonic excitation into charge separation. The D1/D2 (PsbA/PsbD) reaction center heterodimer is critical, as it binds P680, the primary electron donor of PSII, as well as several subsequent electron acceptors.
[ FASTA SEQUENCE ]
sp | P83755 | PSBA_ARATH Photosystem II protein D1 OS = Arabidopsis thaliana | OX = 3702 | GN = psbA | PE = 1 | SV = 2
MTAIL ERRES ESLWG RFCNW ITSTE NRLYI GWFGV LMIPT LLTAT SVFII AFIAA PPVDI
DGIRE PVSGS LLYGN NIISG AIIPT SAAIG LHFYP IWEAA SVDEW LYNGG PYELI VLHFL
LGVAC YMGRE WELSF RLGMR PWIAV AYSAP VAAAT AVFLI YPIGQ GSFSD GMPLG ISGTF
NFMIV FQAEH NILMH PFHML GVAGV FGGSL FSAMH GSLVT SSLIR ETTEN ESANE GYRFG
QEEET YNIVA AHGYF GRLIF QYASF NNSRS LHFFL AAWPV VGIWF TALGI STMAF NLNGF
NFNQS VVDSQ GRVIN TWADI INRAN LGMEV MHERN AHNFP LD LAA VEAPS TNG
How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids. from collections import Counter
sp|P83755|PSBA_ARATH Photosystem II protein D1 Length: 353 amino acids
Important
Most frequent: A (35 times, 9.9%)
How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.
Does your protein belong to any protein family?
Sequence similarities:Belongs to the reaction center PufL/M/PsbA/D family. UniRule annotation
Keywords:Domain: Transmembrane
#Transmembrane
#Transmembrane helix

3.Identify the structure page of your protein in RCSB
When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)
This is a landmark study published in Nature Plants (van Bezouwen et al.), which gained significant importance for utilizing state-of-the-art cryo-electron microscopy (cryo-EM) at the time to reveal, for the first time at a near-atomic level, the detailed structure of the C2S2M2-type photosystem II (PSII) supercomplex from a higher plant (Arabidopsis thaliana).
van Bezouwen, L. S., Caffarri, S., Kale, R. S., Kouřil, R., Thunnissen, A. M. W., Oostergetel, G. T., & Boekema, E. J. (2017). Subunit and chlorophyll organization of the plant photosystem II supercomplex. Nature plants, 3(7), 1-11.

Are there any other molecules in the solved structure apart from protein?
Yes. Apart from proteins, the structure contains:Numerous pigments: Including Chlorophylls (Chls) and Pheophytins.Metal clusters: Most notably the Mn4CaO5 water-splitting center.Lipids: Which provide structural integrity within the thylakoid membrane.Quinones: For electron transport.
Does your protein belong to any structure classification family?
The protein belongs to the PsbA/PsbD family, characterized by a 5-transmembrane-helix fold. It forms a heterodimer with the D2 protein (PsbD), providing the structural scaffold for essential cofactors like the special pair chlorophylls and the $Mn_4CaO_5$ cluster.
Open the structure of your protein in any 3D molecule visualization software:
PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands)
Documentation: Importing D1 Protein (P83755) into PyMOL

Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.
hide all
show sticks
show spheres
set sphere_scale, 0.25
PyMOL Color Customization
Automatically assign standard colors to Alpha-helices, Beta-sheets, and Loops using the following command:

Color the protein by secondary structure. Does it have more helices or sheets?
PyMOL Analysis: Secondary Structure Distribution
To determine the composition of the protein structure, we performed secondary structure coloring and residue counting using PyMOL’s internal Python API.
1. Visual Inspection (Coloring)
Run this command to visually distinguish the secondary structures:

Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?
1. Automated Coloring Script (Custom Scheme)
PyMOL does not have a single “hydrophobic” command. We use a Python script to categorize and color residues based on their chemical properties.
| Property Type | Residue Count | Visual Representation |
|---|---|---|
| Hydrophobic (Non-polar) | 4,166 | ■ Red |
| Hydrophilic (Polar/Charged) | 3,170 | ■ Blue |

PyMOL Analysis: Hydrophobicity Distribution
This document describes the process of coloring the protein by residue type to analyze the distribution of Hydrophobic vs. Hydrophilic residues.
Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?



💕Part C. Using ML-Based Protein Design Tools

Picture Source: Bordin, Nicola et al (2023). Novel machine learning approaches revolutionize protein knowledge. Trends in Biochemical Sciences, Volume 48, Issue 4, 345 - 359.
Structure of PSII-I prime
(PSII with Psb28 and Psb34)
📋 Metadata
| Attribute | Details |
|---|---|
| PDB DOI | 10.2210/pdb7NHQ/pdb |
| EM Map | EMD-12337 (EMDB EMDataResource) |
| Classification | PHOTOSYNTHESIS |
| Organism(s) | Thermosynechococcus vestitus BP-1 |
| Mutation(s) | None |
| Membrane Protein | Yes |
🧬 Databases & Cross-References
OPM
PDBTM
MemProtMD
mpstruc
📅 Deposition Details
- Deposited:
2021-02-11 - Released:
2021-05-05 - Funding: German Research Foundation (DFG)
Deposition Authors:
Zabret, J., Bohn, S., Schuller, S.K., Arnolds, O., Chan, A., Tajkhorshid, E., Stoll, R., Engel, B.D., Rudack, T., Schuller, J.M., Nowaczyk, M.M.
🧬 Target Sequence: Photosystem II Protein D1 Protein: Photosystem II protein D1 1
Source: Thermosynechococcus elongatus BP-1 > PDB Reference: 7NHQ | Chain: A
7NHQ_1|Chain A|Photosystem II protein D1 1|Thermosynechococcus elongatus BP-1 (197221) MTTTLQRRESANLWERFCNWVTSTDNRLYVGWFGVIMIPTLLAATICFVIAFIAAPPVDIDGIREPVSGSLLYGNN IITGAVVPSSNAIGLHFYPIWEAASLDEWLYNGGPYQLIIFHFLLGASCYMGRQWELSYRLGMRPWICVAYSAPLA SAFAVFLIYPIGQGSFSDGMPLGISGTFNFMIVFQAEHNILMHPFHQLGVAGVFGGALFCAMHGSLVTSSLIRETT ETESANYGYKFGQEEETYNIVAAHGYFGRLIFQYASFNNSRSLHFFLAAWPVVGVWFTALGISTMAFNLNGFNFNH SVIDAKGNVINTWADIINRANLGMEVMHERNAHNFPLDLASAESAPVAMIAPSING
Deep Mutational Scanning: Mapping the Fitness Landscapes of Proteins
RESULTS
(21, 360) MTTTLQRRESANLWERFCNWVTSTDNRLYVGWFGVIMIPTLLAATICFVIAFIAAPPVDIDGIREPVSGSLLYGNN IITGAVVPSSNAIGLHFYPIWEAASLDEWLYNGGPYQLIIFHFLLGASCYMGRQWELSYRLGMRPWICVAYSAPLA SAFAVFLIYPIGQGSFSDGMPLGISGTFNFMIVFQAEHNILMHPFHQLGVAGVFGGALFCAMHGSLVTSSLIRETT ETESANYGYKFGQEEETYNIVAAHGYFGRLIFQYASFNNSRSLHFFLAAWPVVGVWFTALGISTMAFNLNGFNFNH SVIDAKGNVINTWADIINRANLGMEVMHERNAHNFPLDLASAESAPVAMIAPSING [‘T’, ‘T’, ‘T’, ‘L’, ‘Q’, ‘R’, ‘R’, ‘E’, ‘S’, ‘A’, ‘N’, ‘L’, ‘W’, ‘E’, ‘R’, ‘F’, ‘C’, ‘N’, ‘W’, ‘V’, ‘T’, ‘S’, ‘T’, ‘D’, ‘N’, ‘R’, ‘L’, ‘Y’, ‘V’, ‘G’, ‘W’, ‘F’, ‘G’, ‘V’, ‘I’, ‘M’, ‘I’, ‘P’, ‘T’, ‘L’, ‘L’, ‘A’, ‘A’, ‘T’, ‘I’, ‘C’, ‘F’, ‘V’, ‘I’, ‘A’, ‘F’, ‘I’, ‘A’, ‘A’, ‘P’, ‘P’, ‘V’, ‘D’, ‘I’, ‘D’, ‘G’, ‘I’, ‘R’, ‘E’, ‘P’, ‘V’, ‘S’, ‘G’, ‘S’, ‘L’, ‘L’, ‘Y’, ‘G’, ‘N’, ‘N’, ’ ‘, ‘I’, ‘I’, ‘T’, ‘G’, ‘A’, ‘V’, ‘V’, ‘P’, ‘S’, ‘S’, ‘N’, ‘A’, ‘I’, ‘G’, ‘L’, ‘H’, ‘F’, ‘Y’, ‘P’, ‘I’, ‘W’, ‘E’, ‘A’, ‘A’, ‘S’, ‘L’, ‘D’, ‘E’, ‘W’, ‘L’, ‘Y’, ‘N’, ‘G’, ‘G’, ‘P’, ‘Y’, ‘Q’, ‘L’, ‘I’, ‘I’, ‘F’, ‘H’, ‘F’, ‘L’, ‘L’, ‘G’, ‘A’, ‘S’, ‘C’, ‘Y’, ‘M’, ‘G’, ‘R’, ‘Q’, ‘W’, ‘E’, ‘L’, ‘S’, ‘Y’, ‘R’, ‘L’, ‘G’, ‘M’, ‘R’, ‘P’, ‘W’, ‘I’, ‘C’, ‘V’, ‘A’, ‘Y’, ‘S’, ‘A’, ‘P’, ‘L’, ‘A’, ’ ‘, ‘S’, ‘A’, ‘F’, ‘A’, ‘V’, ‘F’, ‘L’, ‘I’, ‘Y’, ‘P’, ‘I’, ‘G’, ‘Q’, ‘G’, ‘S’, ‘F’, ‘S’, ‘D’, ‘G’, ‘M’, ‘P’, ‘L’, ‘G’, ‘I’, ‘S’, ‘G’, ‘T’, ‘F’, ‘N’, ‘F’, ‘M’, ‘I’, ‘V’, ‘F’, ‘Q’, ‘A’, ‘E’, ‘H’, ‘N’, ‘I’, ‘L’, ‘M’, ‘H’, ‘P’, ‘F’, ‘H’, ‘Q’, ‘L’, ‘G’, ‘V’, ‘A’, ‘G’, ‘V’, ‘F’, ‘G’, ‘G’, ‘A’, ‘L’, ‘F’, ‘C’, ‘A’, ‘M’, ‘H’, ‘G’, ‘S’, ‘L’, ‘V’, ‘T’, ‘S’, ‘S’, ‘L’, ‘I’, ‘R’, ‘E’, ‘T’, ‘T’, ’ ‘, ‘E’, ‘T’, ‘E’, ‘S’, ‘A’, ‘N’, ‘Y’, ‘G’, ‘Y’, ‘K’, ‘F’, ‘G’, ‘Q’, ‘E’, ‘E’, ‘E’, ‘T’, ‘Y’, ‘N’, ‘I’, ‘V’, ‘A’, ‘A’, ‘H’, ‘G’, ‘Y’, ‘F’, ‘G’, ‘R’, ‘L’, ‘I’, ‘F’, ‘Q’, ‘Y’, ‘A’, ‘S’, ‘F’, ‘N’, ‘N’, ‘S’, ‘R’, ‘S’, ‘L’, ‘H’, ‘F’, ‘F’, ‘L’, ‘A’, ‘A’, ‘W’, ‘P’, ‘V’, ‘V’, ‘G’, ‘V’, ‘W’, ‘F’, ‘T’, ‘A’, ‘L’, ‘G’, ‘I’, ‘S’, ‘T’, ‘M’, ‘A’, ‘F’, ‘N’, ‘L’, ‘N’, ‘G’, ‘F’, ‘N’, ‘F’, ‘N’, ‘H’, ’ ‘, ‘S’, ‘V’, ‘I’, ‘D’, ‘A’, ‘K’, ‘G’, ‘N’, ‘V’, ‘I’, ‘N’, ‘T’, ‘W’, ‘A’, ‘D’, ‘I’, ‘I’, ‘N’, ‘R’, ‘A’, ‘N’, ‘L’, ‘G’, ‘M’, ‘E’, ‘V’, ‘M’, ‘H’, ‘E’, ‘R’, ‘N’, ‘A’, ‘H’, ‘N’, ‘F’, ‘P’, ‘L’, ‘D’, ‘L’, ‘A’, ‘S’, ‘A’, ‘E’, ‘S’, ‘A’, ‘P’, ‘V’, ‘A’, ‘M’, ‘I’, ‘A’, ‘P’, ‘S’, ‘I’, ‘N’, ‘G’]

PsbA (D1 Protein) Sequence Analysis Report
1. Sequence Metadata
- Protein Identity: Photosystem II Reaction Center Protein A (PsbA / D1).
- Input Length: 360 units (comprising 356 amino acid residues and 4 placeholder spaces).
- Core Function: The heart of Photosystem II (PSII), responsible for harboring the electron transport chain and the Oxygen-Evolving Complex (OEC).
2. Physical Property Analysis
A. Hydrophobicity and Transmembrane Structure
The sequence exhibits hallmark characteristics of a multi-pass transmembrane protein:
- Hydrophobic Core: There are 5 highly hydrophobic regions (rich in
L, V, I, F, W), corresponding to the five transmembrane $\alpha$-helices (TMH I-V). - Aromatic Residue Distribution: High density of
W (Tryptophan)andF (Phenylalanine). These residues are crucial for anchoring chlorophyll and pheophytin pigments within the membrane.
B. Charge and Electrostatic Environment
- Luminal Side (Loops): Specific
D (Aspartic acid)andE (Glutamic acid)residues cluster spatially to create the coordination environment for the Manganese cluster ($Mn_4CaO_5$). - Structural Flexibility: The distribution of
P (Proline)andG (Glycine)defines the tilt angles of the helices and the flexibility of the loops between transmembrane segments.
3. Deep Mutational Scanning (DMS) Prediction Logic
Based on the unsupervised learning principles of ESM-2, the mutational pressure is primarily concentrated in the following hotspots:
| Key Residue | Amino Acid | Functional Significance | ESM-2 Prediction Trend |
|---|---|---|---|
| H198 | His | Ligand for P680 chlorophyll special pair | Extreme Penalty: Any mutation likely leads to total loss of RC function. |
| Y161 | Tyr | $Y_Z$ radical donor | High Penalty: $Y \to F$ is penalized as the loss of hydrogen bonding disrupts electron transfer. |
| D170 | Asp | Ligand for the Mn-cluster | Extreme Penalty: Loss of acidic side chain directly destroys the OEC. |
| TM Domains | L/V/I | Structural stability | Mid-High Penalty: Mutations to charged residues (D/E/R/K) cause misfolding. |
4. Technical Script for Sequence Analysis
The following code demonstrates how to handle your 360-unit list and simulate the logic for extracting data from a 21×360 ESM-2 matrix.
5.c. (Bonus) Find sequences for which we have experimental scans, and compare the prediction of the language model to experiment.
Deep Mutational Scanning (DMS) Process
Here is the structured breakdown of the Deep Mutational Scanning (DMS) process, translated into an algorithmic and professional format suitable for Markdown and HTML rendering.
1. Library Construction
Scientists first use genetic engineering to create a diverse population of protein variants.
- Method: By using PCR (Polymerase Chain Reaction) or synthetic DNA synthesis, mutations are introduced at every single position along the protein sequence.
- Result: A “library” containing millions of distinct DNA plasmids is generated, where each plasmid corresponds to a specific mutation.
2. Screening and Selection
This is the most critical stage, acting much like a “survival of the fittest” competition. Scientists assign a specific task to these proteins:
- For Antibiotic Resistance: Bacteria containing the mutant proteins are placed in petri dishes filled with antibiotics. The bacteria that survive carry beneficial mutations; those that die carry harmful ones.
- For Fluorescence (e.g., GFP): Scientists use FACS (Fluorescence-Activated Cell Sorting). The instrument scans every cell with rapid-fire precision, sorting those with strong fluorescence to one side and non-fluorescent ones to the other.
- For Binding Affinity: Much like a magnet, a target molecule is used to “pull” the proteins. Strong binders are retained, while weak ones are washed away.
3. High-Throughput Sequencing (NGS)
Once the “competition” ends, scientists must determine the winners.
- They use Next-Generation Sequencing (NGS) to count the abundance of each DNA variant both before and after the selection process.
- The Logic:
- Enrichment: If a specific mutation becomes more frequent after the competition, it indicates enhanced function.
- Depletion: If a mutation disappears after the competition, it indicates that the mutation was lethal or deleterious.
4. Data Transformation (The Score)
Finally, using computational algorithms, scientists convert the changes in sequencing frequency into a numerical value: the DMS Fitness Score.
Simplified Formula: $$F = \log\left(\frac{\text{Count}{\text{post-selection}}}{\text{Count}{\text{pre-selection}}}\right)$$
Guide: Comparing Protein Language Model Predictions with Experimental Data
To complete the “Prediction vs. Experiment” comparison task, the workflow is generally divided into three core stages: Data Preparation, AI Inference, and Statistical Analysis.
Step 1: Data Collection (Obtaining the “Ground Truth”)
You need a dataset containing mutations and their corresponding experimental scores.
- Access Databases: Visit ProteinGym or MaveDB.
- Download Data: Search for Deep Mutational Scanning (DMS) data in CSV format.
- Identify Key Columns: Ensure the table includes at least these two columns:
mutant: Mutation information (e.g.,A12Vindicates Alanine at position 12 mutated to Valine).DMS_score: The functional score measured experimentally.
Step 2: Model Setup (Configuring the Protein Language Model)
You need an AI model capable of scoring sequences.
- Select Model: Meta’s open-source ESM-2 (e.g.,
esm2_t33_650M_UR50D) is recommended. - Environment Setup:
Step 3: Inference/Scoring (Generating AI Predictions)
This is the critical technical step to calculate the AI’s “preference” for specific mutations.
- Calculate Log-Likelihood Ratio:
- Input the Wild-type (WT) sequence into the model to obtain the probability distribution of amino acids at each position.
- Extract the probability of the mutated amino acid at that position, $P_{mut}$.
- Extract the probability of the original (wild-type) amino acid at that position, $P_{wt}$.
- Compute Score: $S = \log(P_{mut}) - \log(P_{wt})$.
- Save Results: Append the AI-calculated scores to your dataset as a new column named
prediction_score.
Step 4: Comparison & Analysis (Evaluation)
Use mathematical methods to assess the accuracy of AI predictions.
- Calculate Correlation (Spearman Correlation):
- Use Python’s
scipylibrary to calculate the Spearman Rank Correlation Coefficient betweenDMS_scoreandprediction_score. - Interpretation: A coefficient closer to 1 indicates high accuracy; a value near 0 suggests the AI is guessing randomly.
- Use Python’s
- Visualization:
- Create a Scatter Plot: X-axis = Experimental Score, Y-axis = AI Prediction Score.
- A diagonal distribution of points indicates a successful correlation.
Step 5: Bonus Report (Analysis & Interpretation)
Finally, interpret the comparison results:
- High Accuracy Cases: Highly conserved enzymes usually yield better predictions.
- Low Accuracy Cases: For instance, mutations on the protein surface might be deemed “low impact” by the AI due to evolutionary frequency, but experiments might show they significantly affect binding.
💡 Pro-Tip: Quick Start
If you want to start immediately without running the models yourself:
- Go directly to the ProteinGym website and download their pre-compiled
Reference_Scores.csv. - This file already provides Experimental Scores alongside Prediction Scores from various models (ESM, RoseTTAFold, etc.).
- You only need to use Python for data pivoting and plotting to complete this Bonus task.



Data Acquisition: ProteinGym DMS Substitutions
To obtain the experimental mutation data from ProteinGym, follow the steps below to download the dataset:
🧬 Part D: Phage MS2 L Protein Optimization
1. Selected Goals
- Primary: Increased Stability (Enhancing protein persistence).
- Secondary: Disrupt Interaction with E. coli DnaJ (Modulating lysis toxicity).
2. Proposed Computational Pipeline
- Step 1:
ESM-2for in silico Mutagenesis (Identifying stabilizing mutations). - Step 2:
AlphaFold 3for Structural Folding (Verifying fold integrity). - Step 3:
AlphaFold-Multimerfor Complex Modeling (Mapping the DnaJ interface). - Step 4:
Rosettafor Binding Affinity Estimation (Designing disruptive mutants).
3. Why These Tools?
- ESM-2 (PLM): Enables fast, zero-shot prediction of mutational effects based on evolutionary patterns.
- AlphaFold-Multimer: Provides high-accuracy prediction of protein-protein interaction (PPI) interfaces.
4. Potential Pitfalls
- Data Bias: Lack of phage-specific data in training sets may lead to lower prediction accuracy.
- Functional Trade-off: Increased stability might reduce protein flexibility required for lysis activity.
📊 Pipeline Schematic
graph TD
A[Wild-type L Protein] --> B(ESM-2 Mutation Scanning)
B --> C{Top Stable Candidates}
C --> D[AlphaFold 3: Folding Check]
C --> E[AF-Multimer: DnaJ Complex]
D --> F[Final Optimized Design]
E --> G[Identify Interaction Hotspots]
G --> H[Design Disruptive Mutants]
H --> F

