Subsections of 2026a-maliek-boutinkhar

Homework

Weekly homework submissions:

  • Week 1 HW: Principles and Practices

    1. Biological Engineering Application Proposed application: Engineered microbial biofactories for small-molecule drug production. Having a background in Biomedical Science and current training in translational physiology and pharmacology, I am particularly interested in small-molecule drug development. This project proposes engineering commercially viable cells capable of producing small-molecule drugs that are difficult or costly to synthesize using traditional chemical methods.
  • Week 2 HW: DNA Read, Write, & Edit

    Part 1: Benchling & In-silico Gel Art This is the Lambda Sequence This is the Lambda sequence with the cuts First, I tried to use Ronan’s website to get a template, and then I made it in Benchling Template: Benchling: 3.1. Choose your protein. I decided to use the Kappa Opioid GPCR, as this is the target for my biofactory, which is for my final project. It comes from the OPKR1 gene, and UniProt entry P41145 (OPRK_HUMAN) lists the canonical human kappa opioid receptor as 380 amino acids with this sequence:

  • Week 3 HW: Lab Automation

    Homework Submission Design Explanation For the art, I first sketched ideas on my tablet and then tried to upload them as an image on OpenTrons. The image with birds flying through the sky was pleasing to me aesthetically, but there were limited colours available to us. I chose the following palm tree and adapted some of the colours.

  • Week 4 HW: Protein Design Part I

    Protein Design Part I (Thras Karydis, Jon Kaufman)
    Lab: Protein Design I

  • Week 5 HW: Protein Design Part II

    Protein Design Part II (Pranam Chatterjee, Gabriele Corso)
    Lab: Protein Design II

  • Week 6 HW: Genetic Circuits Part I

    Genetic Circuits Part I: Assembly Technologies (Doug Densmore, Traci Haddock)
    Lab: Gibson Assembly

  • Week 7 HW: Genetic Circuits Part II

    Genetic Circuits Part II: Neuromorphic Circuits (Ron Weiss, Evan Holbrook)
    Lab: Neuromorphic Circuits

  • Week 10 HW: Advanced Imaging & Measurement Technology

    Advanced Imaging & Measurement Tech (Evan Daugharthy, Waters Corp.)
    Mass Spectrometry

Subsections of Homework

Week 1 HW: Principles and Practices

cover image cover image

1. Biological Engineering Application

Proposed application: Engineered microbial biofactories for small-molecule drug production.

Having a background in Biomedical Science and current training in translational physiology and pharmacology, I am particularly interested in small-molecule drug development. This project proposes engineering commercially viable cells capable of producing small-molecule drugs that are difficult or costly to synthesize using traditional chemical methods.

Inspired by microbial production of compounds such as penicillin, this approach would use bacterial or alternative host cells to generate either full drug molecules or high-value intermediates, depending on chemical feasibility. Using CRISPR or prime editing, metabolic pathways would be modified to enhance yield and specificity, similar in concept to the work by Paddon et al. (Nature, 2013).

As a proof of concept, the kappa opioid receptor (KOR) is selected as the biological target, with Salvinorin A as the compound of interest. Chemical synthesis of Salvinorin A suffers from extremely low yields (~0.15–5%), making it expensive and impractical for large-scale research. Improving yield through microbial biosynthesis would reduce costs, accelerate KOR research, and support the development of novel analgesics.


2. Governance & Policy Goals

Goal 1: Safety

  • 1a. Prevent misuse of engineered microbes to produce psychoactive or harmful substances
  • 1b. Prevent harmful exposure to laboratory personnel

Goal 2: Equal Opportunity

  • 2a. Maintain low production costs to ensure global accessibility
  • 2b. Avoid monopolization of the technology and promote open access

Goal 3: Ethical Innovation

  • 3a. Encourage transparent reporting of methods, yields, and failures
  • 3b. Align research incentives with public health goals, particularly analgesic development

3. Governance Actions

Action 1: Biosafety Review (DURC)

  • Purpose: Identify misuse and safety risks in engineered microbes producing bioactive compounds
  • Design: Mandatory dual-use and toxicity assessments by IBCs; compliance tied to funding and regulatory approval
  • Assumptions: Honest reporting; predictable risks
  • Risks & Success:
    • Failure: Bureaucratic burden slows research
    • Success: Improved biosecurity and public trust

Action 2: Genetic Kill Switches

  • Purpose: Prevent environmental escape or uncontrolled proliferation
  • Design: Engineered auxotrophy and kill-switch mechanisms; incentives via funding and approvals
  • Assumptions: Stability and affordability of safeguards
  • Risks & Success:
    • Failure: Mutation or safeguard failure
    • Success: Reduced environmental risk

Action 3: Pharmacovigilance

  • Purpose: Monitor production and use of KOR-targeted molecules
  • Design: Controlled distribution; adverse-event reporting by clinicians
  • Assumptions: Reliable detection and reporting
  • Risks & Success:
    • Failure: Under-reporting or diversion
    • Success: Safe translation without blocking research

4. Governance Scoring Matrix

Policy GoalOption 1Option 2Option 3
Prevent biosecurity incidents112
Respond to incidents221
Prevent lab safety incidents11n/a
Environmental protection21n/a
Minimize burden223
Feasibility122
Avoid impeding research223
Promote constructive use112

5. Prioritization & Ethical Reflection

Based on the scoring, Options 1 (DURC biosafety review) and 2 (genetic kill switches) are the highest priorities. These address the most immediate ethical risks associated with misuse, environmental contamination, and accidental exposure.

While pharmacovigilance is important, it becomes more relevant at later translational stages. Trade-offs include increased upfront costs and longer development timelines; however, these are justified by improved safety, transparency, and public trust.

The primary ethical concern identified during this week’s coursework is dual-use misuse of engineered microbes, including unauthorized production or environmental release. Strong oversight, transparency, and adherence to biosafety protocols should sufficiently mitigate these risks.


Homework – Lecture 2 Questions

George Church Question

Question: What are the 10 essential amino acids in all animals, and how does this affect the “Lysine Contingency”?

Answer:
The 10 essential amino acids in animals are:

  • Histidine
  • Isoleucine
  • Leucine
  • Lysine
  • Methionine
  • Phenylalanine
  • Threonine
  • Tryptophan
  • Valine
  • Arginine

Animals cannot synthesize lysine endogenously and must obtain it from their environment. This weakens the concept of the lysine contingency as a biosafety mechanism for engineered organisms. Many natural environments already contain lysine, meaning deprivation is unreliable. Additionally, organisms can evolve around this dependency, making lysine-based containment a fragile and insufficient safety strategy on its own.


Week 2 HW: DNA Read, Write, & Edit

Part 1: Benchling & In-silico Gel Art

This is the Lambda Sequence

sequence sequence

This is the Lambda sequence with the cuts

Lambda seq Lambda seq

First, I tried to use Ronan’s website to get a template, and then I made it in Benchling

Template:

template template

Benchling:

Benchling Benchling

3.1. Choose your protein.

I decided to use the Kappa Opioid GPCR, as this is the target for my biofactory, which is for my final project. It comes from the OPKR1 gene, and UniProt entry P41145 (OPRK_HUMAN) lists the canonical human kappa opioid receptor as 380 amino acids with this sequence:

MDSPIQIFRGEPGPTCAPSACLPPNSSAWFPGWAEPDSNGSAGSEDAQLEPAHISPAIPVIITAVYSVVFVVGLVGNSLVMFVIIRYTKMKTATNIYIFNLALADALVTTTMPFQSTVYLMNSWPFGDVLCKIVISIDYYNMFTSIFTLTMMSVDRYIAVCHPVKALDFRTPLKAKIINICIWLLSSSVGISAIVLGGTKVREDVDVIECSLQFPDDDYSWWDLFMKICVFIFAFVIPVLIIIVCYTLMILRLKSVRLLSGSREKDRNLRRITRLVLVVVAVFVVCWTPIHIFILVEALGSTSHSTAALSSYYFCIALGYTNSSLNPILYAFLDENFKRCFRDFCFPLKMRMERQSTSRVRNTVQDPAYLRDIDGMNKPV

3.2. Reverse Translate: Protein (amino acid) sequence to DNA (nucleotide) sequence.

A convenient “answer key” for the corresponding human CDS is provided in KEGG for hsa:4986 (OPRK1), showing a 1143 nt coding region (380 aa + stop) https://www.genome.jp/dbget-bin/www_bget?hsa:4986

3.3. Codon optimisation.

Codon optimisation is done to match the host codon usage to improve the translation efficiency and protein yield. For the sake of this protein, I chose E. coli for simplicity so I can practice this according to the homework

https://en.vectorbuilder.com/tool/codon-optimization/c1dd076b-99f7-469f-84a3-c205c444346c.html

ATGGATAGCCCGATTCAGATTTTTCGCGGCGAACCGGGCCCGACGTGCGCCCCGAGCGCGTGTCTGCCGCCGAACAGCAGCGCGTGGTTTCCGGGCTGGGCGGAACCGGATAGCAATGGCAGCGCGGGTAGCGAAGATGCGCAGCTGGAACCGGCGCATATTAGTCCGGCGATCCCGGTGATTATTACCGCCGTGTATAGCGTGGTCTTTGTGGTGGGTCTGGTGGGCAACAGCCTGGTGATGTTTGTTATTATTCGCTATACCAAAATGAAAACCGCAACCAACATCTACATCTTCAACCTGGCACTGGCGGATGCGCTGGTGACCACCACCATGCCGTTTCAGAGCACCGTGTATCTGATGAATAGCTGGCCGTTCGGCGACGTGCTGTGTAAAATTGTGATTAGCATCGATTACTATAATATGTTTACCAGCATTTTTACCCTCACCATGATGAGCGTGGATCGTTACATTGCCGTGTGCCATCCGGTGAAAGCGCTGGATTTTCGTACGCCGCTGAAAGCGAAAATTATTAATATTTGCATTTGGCTGCTGAGCAGCAGCGTGGGCATTAGCGCGATTGTGCTGGGCGGCACCAAAGTGCGTGAAGATGTGGATGTGATCGAATGCAGCCTGCAGTTTCCGGATGACGATTATTCATGGTGGGATCTGTTTATGAAAATCTGCGTATTTATTTTTGCCTTTGTGATCCCTGTGCTGATTATTATTGTGTGCTACACCCTGATGATTCTGCGTCTGAAATCTGTGCGCCTGCTGAGCGGCAGCCGCGAAAAAGATCGTAATCTGCGCCGCATTACCCGCCTGGTGCTGGTGGTGGTGGCCGTGTTTGTGGTGTGCTGGACCCCGATCCACATTTTTATCCTGGTGGAAGCGCTGGGCTCGACGTCACATAGCACCGCGGCGCTGAGCAGCTATTACTTTTGCATTGCCCTGGGCTATACCAACAGCAGCCTGAATCCGATTCTGTATGCCTTTCTGGACGAAAATTTTAAACGCTGCTTTCGCGATTTTTGTTTTCCGCTGAAAATGCGCATGGAACGCCAGAGTACCAGCCGCGTGCGCAACACCGTGCAGGATCCGGCGTACCTGCGCGACATTGATGGTATGAACAAACCGGTGTAA

3.4. You have a sequence! Now what?

Technology 1: Clone codon-optimized CDS into a bacterial plasmid (T7/lac promoter), transform into E. coli. Codon optimization is used to match host codon bias to improve expression.

Technology 2: Induce expression; proteinproduction follows the same central dogma (DNA to RNA to protein), but membrane insertion/folding for 7TM proteins is a key challenge in bacteria.

Part 4: Prepare a Twist DNA Synthesis Order

This is my Benchling sequence link https://benchling.com/s/seq-pBSv19pNJ2bIdI5DcYzM?m=slm-17Elwj2xG3ryxk0DBJrw

This is the full OPRK1 Sequence:

OPRK1 OPRK1

Finally, this is my proposed vector:

Vector Vector

5.1 DNA Read (i) What DNA would you want to sequence (e.g., read) and why? This could be DNA related to human health (e.g. genes related to disease research), environmental monitoring (e.g., sewage waste water, biodiversity analysis), and beyond (e.g. DNA data storage, biobank). I think it could be interesting to sequence small genetic regions related to caffeine metabolism and sensitivity; CYP1A2, AHR, ADORA2A. This is relevant as many individuals experience coffee and caffene differently and do not enjoy it as much as others, or experience more abhorrent side effects than others. By analysing the metabolism of caffeine from CYP1A2 and investigating the adenosine receptors, a specialised suggestion of caffeine intake, bean type, and coffee type can be permutated to give people the best experience with minimal side effects.

(ii) In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why? Also answer the following questions:

  1. Is your method first-, second- or third-generation or other? How so? Sanger sequencing is 1st-generation sequencing. It reads DNA by creating DNA fragments terminated by special nucleotides (ddNTPs) and separating them by capillary electrophoresis to infer the base order
  2. What is your input? How do you prepare your input (e.g. fragmentation, adapter ligation, PCR)? List the essential steps. The input would be genomic DNA from a cheek swab or saliva sample. The steps to prepare them would be to Extract DNA from the sample, PCR amplify the short region(s) containing the SNP(s), to purify the PCR product and then set up Sanger sequencing reaction
  3. What are the essential steps of your chosen sequencing technology, how does it decode the bases of your DNA sample (base calling)? In Sanger sequencing, you first make many DNA copies, but the copying sometimes stops when a special “terminator” base (a fluorescent ddNTP) is added. This creates lots of DNA fragments of different lengths, each ending in a colored base. The fragments are then separated by capillary electrophoresis, and a detector reads the color signal as fragments pass by. The sequencing software converts the color peaks into A, C, G, T letters
  4. What is the output of your chosen sequencing technology? The output of Sanger sequencing is usually a chromatogram/trace file (often .ab1) showing colored peaks, plus a text DNA sequence that the software called from those peaks

5.2 DNA Write (i) What DNA would you want to synthesize (e.g., write) and why? These could be individual genes, clusters of genes or genetic circuits, whole genomes, and beyond. As described in class thus far, applications could range from therapeutics and drug discovery (e.g., mRNA vaccines and therapies) to novel biomaterials (e.g. structural proteins), to sensors (e.g., genetic circuits for sensing and responding to inflammation, environmental stimuli, etc.), to art (DNA origamis). If possible, include the specific genetic sequence(s) of what you would like to synthesize! You will have the opportunity to actually have Twist synthesize these DNA constructs! :) I want to synthesize plasmid DNA vectors that turn bacteria into biofactories for making human-useful therapeutics, specifically: a therapeutic hormone for metabolic disease (example: human insulin) a small-molecule product relevant to cardiovascular/metabolic health (example: Coenzyme Q10 (CoQ10) as a medically used antioxidant supplement with cardiovascular relevance)

(ii) What technology or technologies would you use to perform this DNA synthesis and why? Also answer the following questions:

  1. What are the essential steps of your chosen sequencing methods? The essential steps are: (1) computational design of the DNA construct (promoters/RBS/genes/terminators plus plasmid features), (2) chemical DNA writing using cyclic solid-phase phosphoramidite synthesis to generate oligos, (3) assembly of oligos into longer gene-length fragments when needed, (4) cloning/packaging into a plasmid backbone for bacterial expression, and (5) sequence verification/quality control so the final plasmid matches the intended design. These steps reflect the standard phosphoramidite cycle used for oligo construction and the gene/plasmid workflow offered by commercial synthesis providers
  2. What are the limitations of your writing method (if any) in terms of speed, accuracy, scalability? A key limitation is that errors accumulate as DNA length increases, because each chemical base-addition step is not perfectly efficient; this means long constructs are more likely to contain substitutions or deletions and often require assembly from shorter pieces plus verification. In addition, although high-throughput synthesis platforms scale very well for many sequences in parallel, overall cost and turnaround time can still be bottlenecks for very large libraries or long, complex constructs, and certain sequence patterns (like repeats or extreme GC content) can reduce synthesis success and increase the need for troubleshooting or redesign

5.3 DNA Edit (i) What DNA would you want to edit and why? In class, George shared a variety of ways to edit the genes and genomes of humans and other organisms. Such DNA editing technologies have profound implications for human health, development, and even human longevity and human augmentation. DNA editing is also already commonly leveraged for flora and fauna, for example in nature conservation efforts, (animal/plant restoration, de-extinction), or in agriculture (e.g. plant breeding, nitrogen fixation). What kinds of edits might you want to make to DNA (e.g., human genomes and beyond) and why? I would want to edit human DNA SNPs that cause metabolic/cardiovascular disease, especially those that lead to very high LDL cholesterol and early heart disease risk, such as variants involved in familial hypercholesterolemia (FH). FH is commonly linked to harmful variants in LDLR (and sometimes related genes like APOB or PCSK9), and editing these could lower lifelong LDL exposure and reduce cardiovascular

(ii) What technology or technologies would you use to perform these DNA edits and why? Also answer the following questions:

  1. How does your technology of choice edit DNA? What are the essential steps? In prime editing, a Cas9 nickase fused to a reverse transcriptase is guided to a specific DNA site by a pegRNA that also contains the template for the desired change; the system nicks DNA and then “writes” the corrected sequence, which cellular repair processes finalize into a stable edit. This avoids relying on the same double-strand-break repair competition that often makes precise HDR edits difficult
  2. What preparation do you need to do (e.g. design steps) and what is the input (e.g. DNA template, enzymes, plasmids, primers, guides, cells) for the editing? The main preparation is selecting the exact SNP to fix (e.g., an LDLR pathogenic variant) and designing the appropriate guide/pegRNA to target it. The key inputs are the editor components (prime editor or base editor), the guide RNA(s), and the target human cells/tissue context (for cholesterol disorders this is often discussed in relation to the liver because it controls LDL metabolism).
  3. What are the limitations of your editing methods (if any) in terms of efficiency or precision? Major limitations include variable editing efficiency, possible off-target changes or unintended byproducts, and the practical challenge of safe, effective delivery of the editing system to the correct tissue in humans.

Week 3 HW: Lab Automation

cover image cover image

Homework Submission

Design Explanation

For the art, I first sketched ideas on my tablet and then tried to upload them as an image on OpenTrons. The image with birds flying through the sky was pleasing to me aesthetically, but there were limited colours available to us. I chose the following palm tree and adapted some of the colours.

Thus, I settled on this design: https://opentrons-art.rcdonovan.com/?id=4p046971r45mo4c

After using Microsoft Copilot to help me code in OpenColab, I settled on the following code and image: https://colab.research.google.com/drive/1K-qcZjtwZKHQ8mIlfLnY42qBpRsIZFts#scrollTo=pczDLwsq64mk&uniqifier=1

Furthermore, after receiving feedback on my design, I minimised colours and removed an outer edge to ensure the image is clearer and to prevent spillage

Homework Questions

1. Find and describe a published paper that utilizes the Opentrons or an automation tool to achieve novel biological applications.

Norton-Baker B, Denton MCR, Murphy NP, Fram B, Lim S, Erickson E, et al. Enabling high-throughput enzyme discovery and engineering with a low-cost, robot-assisted pipeline. Sci Rep. 2024;14(1):14449. http://dx.doi.org/10.1038/s41598-024-64938-0

This article by Baker et al. describes a robot-assisted pipeline for enzyme engineering. The article explains how, with the advances of AI and genome data, it is extremely taxing to manually test every permutation. There is simply too much to test using the standard laboratory methods. The main bottleneck they have found is the production, purification and characterisation of the proteins. The author’s attempt to make a generalisable protocol that is cost-effective and high-throughput to streamline enzyme discovery. Their idea is to use robot-assisted pipelines using the opentrons OT-2 liquid handling robot on a 96-well plate to scale to hundreds of proteins per week.

The workflow specifically uses the OT-2 to automate time-consuming steps like transforming E. coli with plasmids, inoculation, lysing, and protein purification using Ni-charged magnetic beads. The workflow also cleaves proteins with a protease instead of eluting to make it easier to use in downstream assays. The authors did this for PET-degrading proteins to screen for enzymes to degrade plastics. They did this using 23 published hydrolases, which were expressed and purified and then measured their thermostability and activity.

Using this pipeline, they were able to have repeatable results after doing 3 separate trials. Across replicate wells and runs, they obtained reproducible enzyme yields reaching up to 400 μg for some proteins, and verified that the samples were sufficiently pure and correctly sized using SDS-PAGE. Finally, the authors use the purified enzymes to generate a benchmark dataset by testing stability (via DSF melting temperatures) and activity across a large matrix of conditions (including different pH values, temperatures, substrates, and timepoints), which lets them rank enzyme performance in a standardised way. In their side-by-side benchmark, LCC-A2 consistently generated the largest amounts of PET breakdown products confirmed by UV-Vis and HPLC ratios, making it the strongest overall performer under their assay setup.

2. Write a description about what you intend to do with automation tools for your final project.

For my final project involving decaffeination using synthetic bacteria, automation could help in testing many enzymes and/or bacterial strains simultaneously. Similar to the article I described in the previous question, it can be helpful to use automation, such as OpenTrons, to identify the best enzyme rather than manually testing each prospective enzyme. Similar to the article by Baker et al., a 96-well screen could be done to test a different enzyme/strain condition (dose, pH, temperature, time). The OT-2 would automate all pipetting by adding tea/coffee, buffers/cofactors, enzyme/strain inputs, and pulling timed samples into a quench plate making screening faster.To ensure this doesn’t create unwanted flavour-related byproducts, I’d measure not only caffeine reduction but also methylxanthine byproducts, which can vary depending on the enzyme/strain used.

Furthermore, as for my idea regarding the biological synthesis of Salvinorin A or Paclitaxel, a common class of molecules among them is terpenes. This is a type of molecule usually derived via plant extraction. Thus, this could be a good target to screen using automation. The principle is similar to decaffeination, where an array of enzymes and pathways could be tested at the same time with an OT-2 workflow with P450 enzymes, followed by analytical tools like HPLC readouts to examine successful synthesis.

Week 04: HW protein design part I

cover image cover image

This week will focus on how sequence, structure, and energetics can be modeled and manipulated to create or optimize proteins with specified functions…

Objective:

  1. Learn basic concepts:
    • amino acid structure
    • 3D protein visualization
    • the variety of ML-based design tools
  2. Brainstorm as a group how to apply these tools to engineer a better bacteriophage (setting the stage for the final project).

Part A. Conceptual Questions

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

I answered nine of the conceptual questions below.

1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)

It is about 3 *10^{24} amino acid molecules.

2. Why do humans eat beef but do not become a cow, eat fish but do not become fish?

We digest proteins into amino acids and then rebuild them into human proteins.

3. Why are there only 20 natural amino acids?

Life evolved to use a small set that is enough for many protein functions.

5. Where did amino acids come from before enzymes that make them, and before life started?

They may have formed through prebiotic chemistry on early Earth or arrived from space.

6. If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?

I would expect a left-handed helix.

8. Why are most molecular helices right-handed?

Most natural proteins use L-amino acids, which usually favor right-handed helices.

9. Why do β-sheets tend to aggregate? What is the driving force for β-sheet aggregation?

β-sheets aggregate because their backbones can make many hydrogen bonds. Hydrophobic interactions also help drive aggregation.

10. Why do many amyloid diseases form β-sheets? Can you use amyloid β-sheets as materials?

Amyloid proteins often stack into stable β-sheet-rich fibers. Yes, amyloid β-sheets can also be used as strong biomaterials.

11. Design a β-sheet motif that forms a well-ordered structure.

A simple design is an alternating pattern of hydrophobic and polar residues, such as Val-Lys-Val-Glu-Val-Lys-Val-Glu. This can help form a stable β-sheet.


Part B: Protein Analysis and Visualization

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

In this part of the homework, I used online resources and 3D visualization software to analyze a protein of interest that has a known 3D structure.

1. Briefly describe the protein you selected and why you selected it.

From now on I have decided to focus on paclitaxel for my final project and to optimize a crucial bottleneck in its current biosynthesis. Thus, the protein I wanted to visualize is beta-tubulin, which is the target of paclitaxel in cancer treatment.

2. Identify the amino acid sequence of your protein.

I used the UniProt entry for Beta-tubulin 1 subunit:
https://www.uniprot.org/uniprotkb/Q9H4B7/entry

The Beta-tubulin 1 subunit is 451 amino acids long.

Sequence:

MREIVHIQIGQCGNQIGAKFWEMIGEEHGIDLAGSDRGASALQLERISVYYNEAYGRKYVPRAVLVDLEPGTMDSIRSSKLGALFQPDSFVHGNSGAGNNWAKGHYTEGAELIENVLEVVRHESESCDCLQGFQIVHSLGGGTGSGMGTLLMNKIREEYPDRIMNSFSVMPSPKVSDTVVEPYNAVLSIHQLIENADACFCIDNEALYDICFRTLKLTTPTYGDLNHLVSLTMSGITTSLRFPGQLNADLRKLAVNMVPFPRLHFFMPGFAPLTAQGSQQYRALSVAELTQQMFDARNTMAACDLRRGRYLTVACIFRGKMSTKEVDQQLLSVQTRNSSCFVEWIPNNVKVAVCDIPPRGLSMAATFIGNNTAIQEIFNRVSEHFSAMFKRKAFVHWYTSEGMDINEFGEAENNIHDLVSEYQQFQDAKAVLEEDEEVTEEAEMEPEDKGH

The most frequently occurring amino acid is alanine (A), with a total of 34 occurrences.

Based on the UniProt BLAST results, I found approximately 250 homologs for this protein:
https://www.uniprot.org/blast/uniprotkb/ncbiblast-R20260305-180445-0890-48978074-p2m/overview

This protein belongs to the tubulin family.

3. Identify the structure page of your protein in RCSB.

I used the following RCSB structure page:
https://www.rcsb.org/structure/7QUC

The structure was solved in 2022 using electron microscopy, with a resolution of 3.20 Å. This is a reasonable quality structure, although it is not as high resolution as some crystal structures.

There are other molecules in the solved structure apart from the protein. Specifically, the structure is a dimer of the beta and alpha tubulin subunits, so this structure contains one beta subunit and one alpha subunit.

Based on what I found, it does not appear to belong to a structure classification family in the way asked by the prompt.

4. Open the structure of your protein in any 3D molecule visualization software.

Because the resolved structure contains two proteins from the microtubule, I colored the alpha-tubulin yellow and the beta-tubulin green, since paclitaxel binds to beta-tubulin in cancer treatment.

Cartoon representation of the alpha- and beta-tubulin structure.

Cartoon representation of the alpha- and beta-tubulin structure.

When coloring the protein by secondary structure, the structure seemed to have more helices than sheets, with a ratio of approximately 3:1 (3501:1074).

When coloring the protein by residue type, the visualization showed hydrophobic residues with 2704 atoms in orange, and hydrophilic residues with 3732 atoms in cyan/magenta. The hydrophilic residues seemed to be distributed more on the outer parts of the structure, while the hydrophobic parts were more concentrated on the inner regions.

When visualizing the surface of the protein, it looked like there is a binding pocket between the two proteins. I compared my image to the known paclitaxel binding pocket and indicated it with an arrow. However, this visualization alone is not completely clear, and other methods would analyze this more reliably.

Part C. Using ML-Based Protein Design Tools

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

In this section, I tested several modern protein AI models on my chosen protein.

I selected the following PDB sequence, which was slightly different from the UniProt sequence:

MREIVHIQAGQCGNQIGAKFWEIISDEHGIDATGAYHGDSDLQLERINVYYNEASGGKYVPRAVLVDLEPGTMDSVRSGPFGQIFRPDNFVFGQSGAGNNWAKGHYTEGAELVDSVLDVVRKEAESCDCLQGFQLTHSLGGGTGSGMGTLLISKIREEYPDRIMNTYSVVPSPKVSDTVVEPYNATLSVHQLVENTDETYCIDNEALYDICFRTLKLTTPTYGDLNHLVSLTMSGVTTCLRFPGQLNADLRKLAVNMVPFPRLHFFMPGFAPLTSRGSQQYRALTVPELTQQMFDAKNMMAACDPRHGRYLTVAAIFRGRMSMKEVDEQMLNIQNKNSSYFVEWIPNNVKTAVCDIPPRGLKMSATFIGNSTAIQELFKRISEQFTAMFRRKAFLHWYTGEGMDEMEFTEAESNMNDLVSEYQQYQEATADEDAEFEEEQEAEVDEN

C1. Protein Language Modeling

1. Deep Mutational Scans

b. Can you explain any particular pattern? (choose a residue and a mutation that stands out)

Positions 140–160 form a structurally constrained region where most substitutions are harmful. In particular, position 158 is intolerant to bulky hydrophobic residues such as tryptophan, tyrosine, and valine, while position 153 tolerates only flexible or polar residues such as serine, arginine, glycine, proline, and asparagine. This suggests loop-specific flexibility requirements.

c. Place your protein in the resulting map and explain its position and similarity to its neighbors.

2. Latent Space Analysis

b. Analyze the different formed neighborhoods: do they approximate similar proteins?

Yes. In the 3D latent space, nearby points form local neighborhoods of proteins with similar sequence features, indicating that the embedding groups related proteins together.

C2. Protein Folding

Folding a protein

1. Fold your protein with ESMFold. Do the predicted coordinates match your original structure?

The structure predicted by ESMFold aligns very well with the experimentally determined β-tubulin structure. The low RMSD of 0.782 Å after alignment indicates that the predicted atomic coordinates closely match the original structure.

Result: RMSD = 0.782

2. Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?

For small mutations, I tested the following sequence:

MREIVHIQAGQCGNQIGAKFWEIISDEHGIDATGAYHGDSDLQLERINVYYNEASGGKYVPRAVLVDLEPGTMDSVRSGPFGQIFRPDNFVFGQSGAGNNWAKGHYTEGAELVDSVLDVVRKEAESCDCLQGFQLTHSLMMMTGSGMGTLLISKIREEYPDRIMNTYSVVPSPKVSDTVVEPYNATLSVHQLVENTDETYCIDNEALYDICFRTLKLTTPTYGDLNHLVSLTMSGVTTCLRFPGQLNADLRKLAVNMVPFPRLHFFMPGFAPLTSRGSQQYRALTVPELTQQMFDAKNMMAACDPRHGRYLTVAAIFRGRMSMKEVDEQMLNIQNKNSSYFVEWIPNNVKTAVCDIPPRGLKMSATFIGNSTAIQELFKRISEQFTAMFRRKAFLHWYTGEGMDEMEFTEAESNMNDLVSEYQQYQEATADEDAEFEEEQEAEVDEN

RMSD: 0.750

For larger mutations, I tested the following sequence:

MREIVHIQAEEQGTLIGAKFWEIISDEHGIDATGAYHGDSDLQLINVYYNEASGGKYVPRAVLVDLEPGTMDSVRSGPFGQIFRPDNFVFGQSGAGNNWAKGHYTEGAELVDSVLDVVRKEAESCDCLQGFQLTHSLMMMTGSGMGTLLISKIREEYPDRIMNTYSVVPSPKVSDTVVEPYNATLSVHQLVENTDETYCIDNEALYDICFRTLKLTTPTYGDLNHLVSLTMSGVTTCLRFPGQLNADLRKLAYLTVACIFRGKMSTKGFAPLTSRGSQQYRALTVPELTQQMFDAKNMMAACDPRHGRVNMVPFPRLHFFMPEVDEQMLNIQNKNSSYFVEWIPNNVKTAVCDIPPRGLKMSATFIGNSTAIQELFKRISEQFTAMFRRKAFLHWYTGEGMDEMEFTEAESNMNDLVSEYQQYQEATADEDAEFEEEQEAEVDEN

RMSD: 0.873

This was quite surprising, because I expected a much larger variance to occur. Based on these results, the overall fold appears to be fairly resilient to both small and somewhat larger sequence changes.

C3. Protein Generation

Inverse-Folding a protein

1. Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one.

The predicted sequence has a better ProteinMPNN score (0.8587) than the original sequence (1.8495), with a sequence recovery of 0.3783. This means ProteinMPNN proposed a substantially different sequence that it still predicts will fit the same backbone.

2. Input this sequence into ESMFold and compare the predicted structure to your original.

The predicted structure had an RMSD of 0.756 Å relative to the original. The main structure seems to be well preserved, but there appears to be one extra part that is not present in the original structure, which is quite strange.

Part D. Group Brainstorm on Bacteriophage Engineering

Assignees for the following sections
MIT/Harvard studentsOptional
Committed ListenersRequired

Proposal: Computational engineering of the MS2 L protein for increased stability and higher titers

For my project, I will focus on two goals for the MS2 L protein: increased stability and higher phage titers. I chose these because the literature suggests that L contains a small but sensitive functional core, and that lysis timing likely influences how many phage particles are produced before the host cell breaks open. In simple terms, I want a version of L that is more reliable as a protein, but not so aggressive that it causes lysis too early.

My main approach would be to combine sequence conservation analysis, in silico mutagenesis, protein language model scoring, and structure/topology prediction. I would first identify residues that are likely too important to mutate, especially around the conserved LS motif, since mutational studies show that this region is highly sensitive and likely involved in an essential interaction. I would also treat the basic N-terminal region carefully, because it regulates activity through interaction with DnaJ rather than acting as the main lytic domain itself.

Next, I would computationally test mutations that might improve folding robustness or membrane association while preserving the protein’s core lytic features. This seems reasonable because recent work suggests that MS2-L forms oligomeric assemblies in membrane-like environments, and that the transmembrane/C-terminal region is central to this behavior. In plain language, I am not only asking whether the protein folds, but whether it can still adopt the right shape and assembly state to work properly.

For the higher titer goal, I would not try to predict titers directly. Instead, I would use a proxy strategy and prioritize variants that are predicted to be more stable while still preserving the regulatory features that may prevent premature lysis. This is important because N-terminal truncations can bypass DnaJ and trigger earlier lysis, which may actually reduce phage output if the host is killed before assembly is complete.

Planned pipeline

  • Collect sequence, mutational, and structural/topology information for L.
  • Mark function-sensitive residues and conserved regions.
  • Run in silico mutagenesis and rank variants with language-model or sequence-based scores.
  • Filter variants using membrane topology and structural plausibility.
  • Prioritize candidates that may improve stability while preserving productive lysis timing.

Potential pitfalls

One pitfall is that higher titers are not controlled by L alone, so even a better L variant may not improve total phage yield. Another is that MS2 has overlapping genes and RNA-level regulation, so a mutation that looks good for the protein might still be harmful in the native phage genome.


Week 5: HW protein design part II

This week we learn how cutting-edge AI and protein language models are used to design functional proteins and peptides “in silico”.

Objective:

  1. Design short peptides that bind mutant SOD1.
  2. Decide which peptides are worth advancing toward therapy.
  3. Evaluate generated peptides using structural and therapeutic-property prediction tools.
  4. Think about how computational protein design can be applied to the final project on L-protein mutants.

Part A: SOD1 Binder Peptide Design

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

Superoxide dismutase 1 (SOD1) is a cytosolic antioxidant enzyme that converts superoxide radicals into hydrogen peroxide and oxygen. In its native state, it forms a stable homodimer and binds copper and zinc.

Mutations in SOD1 cause familial Amyotrophic Lateral Sclerosis (ALS). Among them, the A4V mutation (Alanine → Valine at residue 4) leads to one of the most aggressive forms of the disease. The mutation subtly destabilizes the N-terminus, perturbs folding energetics, and promotes toxic aggregation.

Your challenge:

  1. Design short peptides that bind mutant SOD1.
  2. Then decide which ones are worth advancing toward therapy.

You will use three models developed in our lab:

  • PepMLM: target sequence-conditioned peptide generation via masked language modeling
  • PeptiVerse: therapeutic property prediction
  • moPPIt: motif-specific multi-objective peptide design using Multi-Objective Guided Discrete Flow Matching (MOG-DFM)

Part 1: Generate Binders with PepMLM

1. Begin by retrieving the human SOD1 sequence from UniProt (P00441) and introducing the A4V mutation.

I used the following A4V mutant SOD1 sequence:

MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQFLYRWLPSRRGG

2. Using the PepMLM Colab linked from the HuggingFace PepMLM-650M model card:

3. Generate four peptides of length 12 amino acids conditioned on the mutant SOD1 sequence.

4. To your generated list, add the known SOD1-binding peptide FLYRWLPSRRGG for comparison.

5. Record the perplexity scores that indicate PepMLM’s confidence in the binders.

BinderSequencePseudo Perplexity
Binder 1WLYGATGLRLKK17.92783425
Binder 2WRSGAVALELGX6.975748166
Binder 3WRYYAVAAEWKX11.07016835
Binder 4WRYGPAALAHKE10.98033022
Binder 5 (example)FLYRWLPSRRGG/

Part 2: Evaluate Binders with AlphaFold3

1. Navigate to the AlphaFold Server: http://alphafoldserver.com

2. For each peptide, submit the mutant SOD1 sequence followed by the peptide sequence as separate chains to model the protein-peptide complex.

3. Record the ipTM score and briefly describe where the peptide appears to bind. Does it localize near the N-terminus where A4V sits? Does it engage the β-barrel region or approach the dimer interface? Does it appear surface-bound or partially buried?

4. In a short paragraph, describe the ipTM values you observe and whether any PepMLM-generated peptide matches or exceeds the known binder.

0.70–1.00 → strong, specific binding (good binder)

0.50–0.70 → moderate binding (possible binder, but uncertain)

0.30–0.50 → weak surface contact (likely not a real binder)

Below 0.30 → basically no meaningful binding

Across all five peptides, the ipTM values fell between 0.28 and 0.37, which places every binder in the “weak surface interaction” range (0.30–0.50) rather than showing strong or specific binding. Binder 4 slightly exceeded the ipTM of the known binder, with an ipTM of 0.37 compared to 0.35, but all showed similarly low-confidence, superficial binding. These results indicate that while the peptides contact SOD1’s surface, none form a stable or well defined interface, and no generated peptide displays stronger predicted binding than the known binder.

Binder 1: ipTM = 0.32, pTM = 0.79

Binder 1 interacts only weakly with SOD1, remaining positioned on the outer surface without engaging the N terminus, β-barrel, or dimer interface.

Binder 2: ipTM = 0.36, pTM = 0.86

In the AlphaFold3 model, the peptide binds loosely on the surface of SOD1 rather than near the N terminus, β-barrel, or dimer interface.

Binder 3: ipTM = 0.28, pTM = 0.72

Similarly, this one also binds to the surface and does not penetrate or bind near anything meaningful.

Binder 4: ipTM = 0.37, pTM = 0.84

Binder 4 shows another weak binding on the surface; it doesn’t engage the N-terminus, β-barrel, or dimer interface.

Binder 5 (the example): ipTM = 0.35, pTM = 0.83

The known binder (Binder 5) shows a slightly more defined and closer interaction with the SOD1 surface compared to the generated peptides, but still binds only shallowly and does not specifically target the N terminus, β-barrel core, or dimer interface.

Part 3: Evaluate Properties of Generated Peptides in the PeptiVerse

Structural confidence alone is insufficient for therapeutic development. Using https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse, let’s evaluate the therapeutic properties of the peptide.

For each PepMLM-generated peptide:

  1. Paste the peptide sequence.
  2. Paste the A4V mutant SOD1 sequence in the target field.
  3. Check the boxes:
    • Predicted binding affinity
    • Solubility
    • Hemolysis probability
    • Net charge (pH 7)
    • Molecular weight

The target sequence I used was:

MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

Binder (Sequence)Binding Affinity (pKd/pKi)Solubility (Probability)Hemolysis Risk (Probability)Net Charge (pH 7)Molecular Weight (Da)Hydrophobicity (GRAVY)
Binder 1 (WLYGATGLRLKK)5.961 (Weak) pKd/pKi1.0000.061 (Non hemolytic)+2.761405.7 Da-0.23
Binder 2 (WRSGAVALELGX)5.993 (Weak) pKd/pKi1.0000.042 (Non hemolytic)-0.241140.5 Da+0.41
Binder 3 (WRYYAVAAEWKX)6.612 (Weak) pKd/pKi1.0000.041 (Non hemolytic)+0.761424.7 Da-0.56
Binder 4 (WRYGPAALAHKE)5.336 (Weak) pKd/pKi1.0000.014 (Non hemolytic)+0.851398.6 Da-0.84
Binder 5 (FLYRWLPSRRGG)5.968 (Weak) pKd/pKi1.0000.047 (Non hemolytic)+2.761507.7 Da-0.71

The PeptiVerse results show that all peptides had perfect predicted solubility and low hemolysis probability, meaning none are predicted to be especially toxic or poorly soluble. However, the predicted binding affinities were still weak across all peptides, which matches the AlphaFold3 observation that all binders only showed weak surface interactions. Peptides with higher ipTM did not necessarily show stronger predicted affinity. For example, Binder 4 had the highest ipTM, but one of the weakest affinity predictions, while Binder 3 had the strongest predicted affinity but the lowest ipTM. This means structural confidence and affinity prediction did not clearly align here. Overall, the peptides appear relatively safe, but none showed convincing strong binding behavior.

Choose one peptide you would advance and justify your decision briefly.

I would advance Binder 2 (WRSGAVALELGX) because it had one of the better ipTM scores among the generated peptides, while also showing perfect solubility, low hemolysis probability, and a relatively balanced net charge. Even though its affinity is still weak, it seems to offer the best overall trade-off between predicted binding and therapeutic properties.

Part 4: Generate Optimized Peptides with moPPIt

Now, move from sampling to controlled design. https://www.biorxiv.org/content/10.1101/2024.07.31.606098v2 uses Multi-Objective Guided Discrete Flow Matching (https://openreview.net/forum?id=8YIMLoHP9J) to steer peptide generation toward specific residues and optimize binding and therapeutic properties simultaneously. Unlike PepMLM, which samples plausible binders conditioned on just the target sequence, moPPIt lets you choose where you want to bind and optimize multiple objectives at once.

  1. Open the https://colab.research.google.com/drive/16n8PIwKwAiG-oDLm171BWvv-lQH0dHMg?usp=sharing linked from the https://huggingface.co/ChatterjeeLab/moPPIt.
  2. Make a copy and switch to a GPU runtime.
  3. In the notebook:
    • Paste your A4V mutant SOD1 sequence.
    • Choose specific residue indices on SOD1 that you want your peptide to bind.
    • Set peptide length to 12 amino acids.
    • Enable motif and affinity guidance (and solubility/hemolysis guidance if available). Generate peptides.
  4. After generation, briefly describe how these moPPit peptides differ from your PepMLM peptides. How would you evaluate these peptides before advancing them to clinical studies?

The moPPit peptides are quite different in their structure and thus, their properties. Many of the PepMLM peptides started with tryptophan and generally had very similar amino acids. Although the PepMLM peptides differed slightly in their binding, affinity, charge, and hydrophobicity, it is not particularly noteworthy. On the other hand the moPPit peptides varied in structure a lot relative to each other and the PepMLM, generally their affinity was a bit higher but still awfully low. What is interesting is that the hemolysis probability is roughly 20x higher for nearly all moPPit peptides compared to PepMLM. This would be very toxic as it likely will rupture blood cells, making the peptides very dangerous. Furthermore, before advancing moPPit generated peptides to clinical studies, further evaluation through structural docking validation to confirm stable protein–peptide interactions, toxicity and hemolysis screening to assess potential cellular damage, and in vitro stability assays to ensure peptide integrity under physiological conditions would be required to support therapeutic safety and effective binding to mutant SOD1.

Binder (Sequence)Binding Affinity (pKd/pKi)Solubility (Probability)Hemolysis Risk (Probability)Net Charge (pH 7)Molecular Weight (Da)Hydrophobicity (GRAVY)
Binder 1 (WLYGATGLRLKK)5.961 (Weak) pKd/pKi10.061 (Non hemolytic)2.761405.7 Da-0.23
Binder 2 (WRSGAVALELGX)5.993 (Weak) pKd/pKi10.042 (Non hemolytic)-0.241140.5 Da0.41
Binder 3 (WRYYAVAAEWKX)6.612 (Weak) pKd/pKi10.041 (Non hemolytic)0.761424.7 Da-0.56
Binder 4 (WRYGPAALAHKE)5.336 (Weak) pKd/pKi10.014 (Non hemolytic)0.851398.6 Da-0.84
Binder 5 (FLYRWLPSRRGG)5.968 (Weak) pKd/pKi10.047 (Non hemolytic)2.761507.7 Da-0.71
moPPit Binder 1 (YVCYSYNYCVCH)7.8780.8330.911
moPPit Binder 2 (TEKTTQAKKYCV)6.2740.8330.978
moPPit Binder 3 (GDMTRYSYYKKC)6.7900.9160.964

Part B: BRD4 Drug Discovery Platform Tutorial

Assignees for the following sections
MIT/Harvard studentsOptional
Committed ListenersOptional

I did not include a written response for Part B.

Part C: Final Project: L-Protein Mutants

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

High level summary: The objective is to improve the stability and autofolding of the lysis protein.

More specifically, we want to engineer the lysis protein to increase the ability of MS2 to overcome a common E. coli resistance mechanism: a single point mutation in DnaJ prevents the binding of the lysis protein. We can attempt this by mutating the lysis protein to change its properties. Together, we aim for finding mutations that change the lysis protein in one of the following ways:

  1. an independence of lysis protein processing from DnaJ or other bacterial chaperones
  2. a faster or more efficient killing of E. coli to reduce the window in which the host can acquire resistance
  3. higher lysis protein expression

L-Protein Engineering | Option 1: Mutagenesis

  1. Designing these mutants with good computational confidence is hard. It will show you limitations of some of the structure based models. Ultimately, you can pick various combinations of mutations and get lab results and then decide to pick the next round of mutations, but this assay will not be easy to run at scale in this class.

  2. Run this notebook to generate for each position in the amino acid sequence, a “score” for what would happen to the protein if you mutated into another amino acid. It can be positive or negative for the protein. We want to identify possible mutations that are “positive”. If you run this notebook, you will see a .csv file in the sidebar. You can download it and look at it in google sheets if that’s easier.

  1. Use the experimental data here. This dataset contains information about mutants of the L-Protein and their effect on lysis in the lab.

4. First check, does the experimental data correlate with the scores from the notebook in (b)? This should give you a clue on how well these language embeddings capture information about this protein sequence.

The experimental data seems to correlate exactly with the scores from the notebook.

5. Using information about the effect of protein mutations at these sites - both the scores and the experimental data in the drive, come up with 5 mutations for each student along with how you came up with them and why you believe they would work. 2 of the variants you submit must have mutations in the transmembrane region (refer to notes above on what amino acid positions these are) and 2 of them must be in the soluble region. Remember that you can also use the pBLAST to see which residues are conserved and not mutate them if you want to.

One easy way to generate sequence mutations could be to look for residue positions and mutations that have a positive mutational effect either in the experimental or have a positive score from step 1. And pick a combination of those mutations.

The MS2 L-protein consists of a short N-terminal soluble domain (residues 1–37) and a single transmembrane α-helix located at residues 38–60.
https://www.uniprot.org/uniprotkb/P03609/entry

To select the final mutation set, I first focused on mutations that were positive in both the experimental dataset and the computational scoring results, thereby prioritising candidates supported by both experimental evidence and predicted mutational benefit. From this overlap, I chose the two highest-scoring mutations in the soluble region (E25G and E25V) and the only positive candidate from the transmembrane region (A45P). For the final two mutations, I selected the highest-scoring computational candidates that were not experimentally tested: K50L and Y39L; both are located in the transmembrane region.

Final 5 mutations:

  • E25G
  • E25V
  • A45P
  • K50L
  • Y39L

6. You can utilize Af2_Multimer to generate a Multimeric Assembly; you can do this by making your query sequence as. We want to do this because - A running hypothesis for how this protein functions is that it assembles to make a perforation in the bacterial membrane.

Figure generated using the following multimeric assembly (where each chain is separated from the other with a :):

METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT


Week 6: HW genetic circuits part I

This week we learn core molecular biology tools and techniques for processing and assembling DNA, including PCR and Gibson Assembly.

Objective:

  1. Learn core molecular biology tools and techniques for processing and assembling DNA.
  2. Understand PCR and Gibson Assembly.
  3. Compare different ways of creating DNA fragments for cloning.
  4. Explore genetic circuit design and simulation using Asimov Kernel.

Assignment: DNA Assembly

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

Answer these questions about the protocol in this week’s lab.

1. What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose?

The Phusion High-Fidelity PCR Master Mix contains a high-fidelity DNA polymerase, which copies DNA with very few mistakes. It also contains dNTPs, which are the small molecules used to build new DNA strands. The mix includes a buffer to keep the chemical conditions stable and suitable for the reaction. It also has Mg2+ ions, which help the polymerase function properly.

2. What are some factors that determine primer annealing temperature during PCR?

Primer annealing temperature depends on the primer sequence, length, and GC content. The higher the GC content, the higher the temperature is for annealing. The temperature is also affected by how well the primer matches the template. Salt and buffer conditions in the reaction can change how strongly the primer binds.

3. There are two methods from this class that create linear fragments of DNA: PCR, and restriction enzyme digests. Compare and contrast these two methods, both in terms of protocol as well as when one may be preferable to use over the other.

PCR uses primers and DNA polymerase to copy a specific DNA region, producing many copies. It is ideal when you need to amplify a fragment or introduce small sequence changes or overlaps. In contrast, restriction digestion uses enzymes to cut DNA at specific existing recognition sites. It is best when the DNA already contains the right cut sites and you want a simple, precise cut into fragments.

4. How can you ensure that the DNA sequences that you have digested and PCR-ed will be appropriate for Gibson cloning?

To make DNA suitable for Gibson cloning, the fragments need to have matching overlapping ends so they can join together correctly. The overlaps must be in the right order and orientation so the final construct assembles as planned. A crucial step is to check that the fragments are the correct size and that the DNA is clean. If there are unwanted bands or extra products, the assembly is prone to fail.

5. How does the plasmid DNA enter the E. coli cells during transformation?

Plasmid DNA enters E. coli after the cells are made competent, which means the cell membrane is prepared to allow DNA to pass through more easily. During heat shock or electroporation, the membrane becomes temporarily more permeable. This allows the plasmid DNA to move into the cell. After that, the cells recover and can begin to replicate the plasmid.

6. Describe another assembly method in detail (such as Golden Gate Assembly)

6.1 Explain the other method in 5–7 sentences plus diagrams (either handmade or online).

Golden Gate Assembly is a cloning method that joins DNA fragments together in a specific order. It uses Type IIS restriction enzymes, which cut outside of their recognition sequence instead of directly inside it. This creates short overhangs that can be designed to match only the correct neighbouring fragment. DNA ligase then joins the matching fragments together. One major advantage of Golden Gate Assembly is that the recognition site is usually removed during the process, so the final DNA product can be seamless. This method is very useful when multiple DNA fragments need to be assembled in one reaction.

Made using M365 Copilot.

6.2 Model this assembly method with Benchling or Asimov Kernel.

I am still waiting to see whether we will receive Kernel access for this.

Assignment: Asimov Kernel

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

I am still waiting to see whether we will receive Kernel access for this.

1. Create a Repository for your work

I am still waiting to see whether we will receive Kernel access for this.

2. Create a blank Notebook entry to document the homework and save it to that Repository

I am still waiting to see whether we will receive Kernel access for this.

3. Explore the devices in the Bacterial Demos Repo to understand how the parts work together by running the Simulator on various examples, following the instructions for the simulator found in the “Info” panel (click the “i” icon on the right to open the Info panel)

I am still waiting to see whether we will receive Kernel access for this.

4. Create a blank Construct and save it to your Repository

I am still waiting to see whether we will receive Kernel access for this.

4.1 Recreate the Repressilator in that empty Construct by using parts from the Characterized Bacterial Parts repository

I am still waiting to see whether we will receive Kernel access for this.

4.2 Search the parts using the Search function in the right menu

I am still waiting to see whether we will receive Kernel access for this.

4.3 Drag and drop the parts into the Construct

I am still waiting to see whether we will receive Kernel access for this.

4.4 Confirm it works as expected by running the Simulator (“play” button) and compare your results with the Repressilator Construct found in the Bacterial Demos repository

I am still waiting to see whether we will receive Kernel access for this.

4.5 Document all of this work in your Notebook entry - you can copy the glyph image and the simulator graphs, and paste them into your Notebook

I am still waiting to see whether we will receive Kernel access for this.

5. Build three of your own Constructs using the parts in the Characterized Bacterials Parts Repo

I am still waiting to see whether we will receive Kernel access for this.

5.1 Explain in the Notebook Entry how you think each of the Constructs should function

I am still waiting to see whether we will receive Kernel access for this.

5.2 Run the simulator and share your results in the Notebook Entry

I am still waiting to see whether we will receive Kernel access for this.

5.3 If the results don’t match your expectations, speculate on why and see if you can adjust the simulator settings to get the expected outcome

I am still waiting to see whether we will receive Kernel access for this.


Reading & Resources (click to expand)

Resources

Week 7: HW genetic circuits part II

This week covers neuromorphic genetic circuits, showing how engineered gene networks can implement neural-network “perceptron”-like computation and learning.

Objective:

  1. Understand intracellular artificial neural networks (IANNs).
  2. Explore applications of neuromorphic genetic circuits.
  3. Learn about fungal materials and possible engineering applications in fungi.
  4. Make progress on the individual final project and first DNA design order.

Assignment Part 1: Intracellular Artificial Neural Networks (IANNs)

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

1. What advantages do IANNs have over traditional genetic circuits, whose input/output behaviours are Boolean functions?

Intracellular artificial neural networks (IANNs) are advantageous to Boolean genetic circuits because they can exhibit graded biological signals rather than absolute on-or-off states. In reality, cells do not have simple on and off outputs, making IANNs approximate the functionality of the cells with better accuracy. Many inputs can affect a single output and vary like a spectrum; biomolecules like miRNA, TFs, and plasmids drastically alter cellular states. As Anthony Genot explains, “Enzymes bring high nonlinearity, which we exploit to improve the sharpness of computation.” This nonlinearity is difficult to model via Boolean functions using logic gates. IANNs can also integrate many inputs with varying weights to create a more accurate representation of the biological system. This way, inhibitory and stimulatory signals are weighted rather than treated as binary. Compared with Boolean circuits, which become disproportionately complex with growing inputs, multilayer networks can approximate graded, weighted, and non-linear signals of a cell.

2. Describe a useful application for an IANN; include a detailed description of input/output behaviour, as well as any limitations an IANN might face to achieve your goal.

One potential application for IANNs could be to optimise biosynthesis pathways by controlling metabolic outputs of the microbial strain. Since IANNs allow for specific control of the cells, they could be used to mimic natural enzymatic kinetics to specify and optimise metabolic products. The IANN could help by sensing several continuous intracellular signals at the same time, such as precursor availability, intermediate accumulation, stress level, and growth state, and then using those signals to regulate key pathway enzymes in a graded way instead of a simple ON/OFF manner. For example, if harmful intermediates start to build, the system could reduce the specific pathway tied to it to relieve cellular stress and prevent cell death. Although these systems are useful, they can come with many costs. Presently, IANNs suffer from noisy intracellular signals, cell-to-cell variability, and changing cellular context, which makes it difficult to get precise and reproducible input/output behaviour. They can also suffer from cross-talk and resource competition, where different regulators interfere with each other or compete for limited cellular machinery. In biosynthesis, this is even harder because pathways need very precise balancing, and multilayer circuits respond slowly since each layer depends on transcription and translation.

3. Below is a diagram depicting an intracellular single-layer perceptron where the X1 input is DNA encoding for the Csy4 endoribonuclease and the X2 input is DNA encoding for a fluorescent protein output whose mRNA is regulated by Csy4. Tx: transcription; Tl: translation.

Draw a diagram for an intracellular multilayer perceptron where layer 1 outputs an endoribonuclease that regulates a fluorescent protein output in layer 2.

I created a diagram for an intracellular multilayer perceptron where layer 1 outputs an endoribonuclease that regulates a fluorescent protein output in layer 2.

Made using BioRender.

Assignment Part 2: Fungal Materials

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

1. What are some examples of existing fungal materials and what are they used for? What are their advantages and disadvantages over traditional counterparts?

Existing fungal materials are mostly mycelium-based materials, including packaging foams, insulation boards/panels, construction composites, and leather-like textiles made from fungal biomass grown on agricultural waste. These materials are used as substitutes for polystyrene packaging, synthetic foams, animal leather, and some lightweight building materials, because mycelium can act as a natural binder and be shaped into different forms during growth. The main advantage is that they are biobased, biodegradable, and can be produced from waste streams, which can lower environmental impact compared with plastics or animal-derived materials. Their disadvantages are that they still often have lower and more variable mechanical performance, limited water resistance and durability, and struggle to adapt to diverse environments.

2. What might you want to genetically engineer fungi to do and why? What are the advantages of doing synthetic biology in fungi as opposed to bacteria?

If I were to genetically engineer fungi, I would want to make them better microbial factories for complex pharmaceuticals such as paclitaxel. In that case, the goal would be to engineer fungi to express more of the paclitaxel pathway efficiently and improve precursor supply. Synthetic biology in fungi can be advantageous over bacteria for this type of problem because fungi are eukaryotes, so they are generally better at handling complex eukaryotic biosynthetic enzymes, performing post-translational modifications, secreting proteins, and supporting secondary metabolite pathways that are often harder to reconstruct in bacterial hosts. This makes fungi, especially yeasts and filamentous fungi, attractive hosts for producing plant- or fungus-derived pharmaceuticals, since many of these molecules depend on enzyme systems and intracellular organisation that are more similar to those of other eukaryotes than to bacteria. Fungi are also already widely used as industrial production organisms, which makes them a promising chassis for scaling the biosynthesis of valuable drugs. However, bacteria tend to grow faster and are usually easier to manipulate genetically, making it easier to adapt to a variety of situations and contexts.

Assignment Part 3: First DNA Twist Order

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

0. Review the Individual Final Project documentation guidelines.

I reviewed the Individual Final Project documentation guidelines.

1. Submit this Google Form with your draft Aim 1, final project summary, HTGAA industry council selections, and shared folder for DNA designs.

N/A, waiting on feedback for the final project, but I have made an explanation of my preject for Paxlitaxel on my project page

2. Review Part 3: DNA Design Challenge of the week 2 homework. Design at least 1 insert sequence and place it into the Benchling/Kernel/Other folder you shared in the Google Form above. Document the backbone vector it will be synthesized in on your website.

N/A, waiting on feedback for the final project.


Reading & Resources (click to expand)

Week 10 — Advanced Imaging & Measurement Technology

This lecture presents a range of advanced technologies to do precision measurement of proteins at atomic scales, characterizing chemical composition, and detecting protein sequence and structure.

Lecture (Tues, Apr 7)

Advanced Imaging & Measurement Tech
(▶️Recording)
Evan Daugharthy, Lindsay Morrison.

Recitation (Wed, Apr 8)

Mass spectrometry
(▶️Recording | 💻Slides)
Waters Corp. Team

Homework — DUE BY START OF Apr 14 LECTURE

Homework is partly based on data that will be generated in the Waters Immerse Lab in Cambridge, MA. Students will characterize green fluorescent protein (eGFP, a recombinant protein standard) structure (primary, secondary/tertiary) in the lab using liquid chromatography and mass spectrometry, as well as Keyhole Limpet Hemocyanin (KLH) oligomeric states using charge detection mass spectrometry (CDMS). Data generated in the lab needed to do the homework is included both within this document and in the Appendix of the laboratory protocol.

Homework: Final Project

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

For your final project:

  • Please identify at least one (ideally many) aspect(s) of your project that you will measure. It could be the mass or sequence of a protein, the presence, absence, or quantity of a biomarker, etc.
  • Please describe all of the elements you would like to measure, and furthermore describe how you will perform these measurements.
  • What are the technologies you will use (e.g., gel electrophoresis, DNA sequencing, mass spectrometry, etc.)? Describe in detail.

I will measure both enzyme expression and activity in the paclitaxel biosynthesis pathway, focusing on CYP725A4. The aim is to confirm that the enzyme is successfully expressed and that it is functionally active in producing the desired product.

Protein expression will first be analysed using SDS‑PAGE, a simple and low-cost technique that separates proteins based on size. This allows me to confirm that a protein is present at the expected molecular weight and gives a rough indication of expression level. If more specific confirmation is required, a Western blot can be used to selectively detect the target protein using an antibody.

To evaluate enzyme activity, I will use LC‑MS to detect and quantify small molecule products such as taxadien‑5α‑ol. This method provides high sensitivity and accuracy, allowing me to confirm that the enzyme is producing the correct product.

Overall, this approach combines low-cost methods for initial validation with more precise analytical techniques for functional characterization.

Homework: Waters Part I — Molecular Weight

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

We will analyze an eGFP standard on a Waters Xevo G3 QTof MS system to determine the molecular weight of intact eGFP and observe its charge state distribution in the native and denatured (unfolded) states. The conditions for LC-MS analysis of intact protein cause it to unfold and be detected in its denatured form (due to the solvents and pH used for analysis).

  1. Based on the predicted amino acid sequence of eGFP (see below) and any known modifications, what is the calculated molecular weight? You can use an online calculator like the one at https://web.expasy.org/compute_pi/

    eGFP Sequence:
    MVSKGEELFTG VVPILVELDG DVNGHKFSVS GEGEGDATYG KLTLKFICTT GKLPVPWPTL VTTLTYGVQC FSRYPDHMKQ HDFFKSAMPE GYVQERTIFF KDDGNYKTRA EVKFEGDTLV NRIELKGIDF KEDGNILGHK LEYNYNSHNV YIMADKQKNG IKVNFKIRHN IEDGSVQLAD HYQQNTPIGD GPVLLPDNHY LSTQSALSKD PNEKRDHMVL LEFVTAAGIT LGMDELYKLE HHHHHH
    Note: This contains a His-purification tag (HHHHHH) and a linker (the LE before it).

Sequence used:
MVSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTLTYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITLGMDELYKLEHHHHHH

The calculated molecular weight of eGFP including the His-tag is approximately

Without his tag Theoretical pI/Mw: 5.58 / 26941.48
With his tag Theoretical pI/Mw: 5.90 / 28006.60

  1. Calculate the molecular weight of the eGFP using the adjacent charge state approach described in the recitation. Select two charge states from the intact LC-MS data and:

    1. Determine ( z ) for each adjacent pair of peaks ((n, n+1)) using:

    $$ z = \frac{(m/z){n+1}}{(m/z){n+1} - (m/z)_n} $$

    1. Determine the MW of the protein using the relationship between ((m/z)_n), MW, and (z)

    2. Calculate the accuracy of the measurement using the deconvoluted MW from 2.2 and the predicted weight of the protein from 2.1 using:

    $$ \text{Accuracy} = \frac{|MW_{\text{experiment}} - MW_{\text{theory}}|}{MW_{\text{theory}}} $$

Two adjacent peaks from the spectrum were selected:

[ (m/z)n = 903.7148 \quad \text{and} \quad (m/z){n+1} = 933.7349 ]

The charge state was calculated using:

$$ z = \frac{(m/z){n+1}}{(m/z){n+1} - (m/z)_n} $$

$$ z = \frac{933.7349}{933.7349 - 903.7148} = \frac{933.7349}{30.0201} \approx 31 $$

The molecular weight was then calculated using:

$$ MW = z \times \left((m/z)_n - 1\right) $$

$$ MW = 31 \times (903.7148 - 1) = 31 \times 902.7148 \approx 27{,}984 \text{ Da} $$

The molecular weight of eGFP is approximately 28.0 kDa.

  1. Calculate the accuracy of the measurement.

Using the theoretical molecular weight (~27,800 Da):

$$ \text{Accuracy} = \frac{|MW_{\text{experiment}} - MW_{\text{theory}}|}{MW_{\text{theory}}} $$

$$ = \frac{|27{,}984 - 27{,}800|}{27{,}800} = \frac{184}{27{,}800} \approx 0.0066 $$

The measurement accuracy is approximately 0.66%.

  1. Can you observe the charge state for the zoomed-in peak in the mass spectrum for the intact eGFP? If yes, what is it? If no, why not?

The charge state cannot be clearly determined from the zoomed-in peak because the isotope peaks are not fully resolved and overlap with each other. This makes it difficult to measure the spacing between peaks, which is required to calculate the charge state accurately.

Homework: Waters Part II — Secondary/Tertiary structure

Assignees for the following sections
MIT/Harvard studentsOptional but highly recommended
Committed ListenersOptional but highly recommended

We will analyze eGFP in its native, folded state and compare it to its denatured, unfolded state on a quadrupole time-of-flight MS. We will be doing MS-only analysis (no liquid chromatography, also known as “direct infusion” experiments) on the Waters Xevo G3-QToF MS.

  1. Based on learnings in the lab, please explain the difference between native and denatured protein conformations. For example, what happens when a protein unfolds? How is that determined with a mass spectrometer? What changes do you see in the mass spectrum between the native and denatured protein analyses?

A native protein is folded into its compact, functional structure, while a denatured protein is unfolded due to changes in solvent or pH. When the protein unfolds, more charged residues become exposed, increasing the number of charges it can carry.

In mass spectrometry, denatured proteins show a broader distribution of higher charge states, while native proteins show fewer peaks at lower charge states due to their compact structure.

  1. Zooming into the native mass spectrum of eGFP from the Waters Xevo G3 QTof MS, can you discern the charge state of the peak at ~2800 (\frac{m}{z})? What is the charge state? How can you tell?

From Figure 3, the zoomed inset shows isotope peaks spaced by approximately ~0.125 (m/z).

Using:

$$ z \approx \frac{1}{\text{spacing}} $$

$$ z \approx \frac{1}{0.125} \approx 8 $$

The charge state is approximately +8.

Homework: Waters Part III — Peptide Mapping - primary structure

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

We will digest the eGFP protein standard into peptides using trypsin (an enzyme that selectively cleaves the peptide bond after Lysine (K) and Arginine (R) residues. The resulting peptides will be analyzed on the Waters BioAccord LC-MS to measure their molecular weights and fragmented to confirm the amino acid sequence within each peptide – generating a “peptide map”. This process is used to confirm the primary structure of the protein.

There are a variety of tools available online to calculate protein molecular weight and predict a list of peptides generated from a tryptic digest. We will be using tools within the online resource Expasy (the bioinformatics resource portal of the Swiss Institute of Bioinformatics (SIB)) to predict a list of tryptic peptides from eGFP.

  1. How many Lysines (K) and Arginines (R) are in eGFP? Please circle or highlight them in the eGFP sequence given in Waters Part I question 1 above. (Note: adding the sequence to Benchling as an amino acid file and clicking biochemical properties tab will show you a count for each amino acid).

Sequence used:
MVSKGEELFTG VVPILVELDG DVNGHKFSVS GEGEGDATYG KLTLKFICTT GKLPVPWPTL VTTLTYGVQC FSRYPDHMKQ HDFFKSAMPE GYVQERTIFF KDDGNYKTRA EVKFEGDTLV NRIELKGIDF KEDGNILGHK LEYNYNSHNV YIMADKQKNG IKVNFKIRHN IEDGSVQLAD HYQQNTPIGD GPVLLPDNHY LSTQSALSKD PNEKRDHMVL LEFVTAAGIT LGMDELYKLE HHHHHH

From the sequence:
Lysine (K) = 20
Arginine (R) = 7

  1. How many peptides will be generated from tryptic digestion of eGFP?

    1. Navigate to https://web.expasy.org/peptide_mass/
    2. Copy/paste the sequence above into the input box in the PeptideMass tool to generate expected list of peptides.
    3. Use the relevant parameters to predict peptides from eGFP.
    4. Click “Perform the Cleavage” button in the PeptideMass tool and report the number of peptides generated when using trypsin to perform the digest.

Trypsin cleaves after lysine (K) and arginine (R) residues.

Estimated number of peptides:

$$ (K + R) + 1 = 28 $$

Approximately 28 peptides are expected, assuming no missed cleavages.

  1. Based on the LC-MS data for the peptide map data generated in lab, how many chromatographic peaks do you see in the eGFP peptide map between 0.5 and 6 minutes? You may count all peaks that are >10% relative abundance.

From Figure 5a, approximately 20 peaks above the threshold are observed.

  1. Assuming all the peaks are peptides, does the number of peaks match the number of peptides predicted from question 2 above? Are there more peaks in the chromatogram or fewer?

No, the observed number of peaks is lower than the predicted number of peptides. This may be due to incomplete digestion, co-elution of peptides, or detection limits.

  1. Identify the mass-to-charge (\left(\frac{m}{z}\right)) of the peptide shown in Figure 5b. What is the charge ((z)) of the most abundant charge state of the peptide (use the separation of the isotopes to determine the charge state). Calculate the mass of the singly charged form of the peptide (\left([M+H]^+\right)) based on its (\frac{m}{z}) and (z).

From the spectrum:
(m/z \approx 525.76)
Isotope spacing (\approx 0.5 \rightarrow z \approx 2)

$$ M = (m/z \times z) - z $$

$$ M \approx (525.76 \times 2) - 2 \approx 1049.5 \text{ Da} $$

The peptide mass is approximately 1049 Da.

Predicted tryptic peptides of eGFP (Expasy PeptideMass)

Mass (Da)PositionPeptide Sequence
4472.18170–210HNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSK
2566.29217–239DHMVLLEFVTAAGITLGMDELYK
2437.265–27GEELFTGVVPILVELDGDVNGHK
2378.2654–74LPVPWPTLVTTLTYGVQCFSR
1973.91142–157LEYNYNSHNVYIMADK
1503.6628–42FSVSGEGEGDATYGK
1266.5887–97SAMPEGYVQER
1083.50240–247LEHHHHHH
1050.52115–123FEGDTLVNR
982.50133–141EDGNILGHK
821.3981–86QHDFFK
790.3675–80YPDHMK
769.3947–53FICTTGK
711.29103–108DDGNYK
655.3898–102TIFFK
602.28211–215DPNEK
579.31128–132GIDFK
507.29164–167VNFK
502.32124–127IELK
  1. Identify the peptide based on comparison to expected masses in the PeptideMass tool. What is mass accuracy of measurement? Please calculate the error in ppm.

Recall that

$$ \text{Accuracy} = \frac{|MW_{\text{experiment}} - MW_{\text{theory}}|}{MW_{\text{theory}}} $$

From Figure 5b:
(m/z \approx 525.76)
(z = 2)

$$ M \approx (525.76 \times 2) - 2 \approx 1049.5 \text{ Da} $$

Closest match from Expasy:
Peptide: FEGDTLVNR
Theoretical mass = 1050.52 Da

$$ \text{Error (ppm)} = \frac{1049.5 - 1050.52}{1050.52} \times 10^6 \approx -970 \text{ ppm} $$

$$ \text{Accuracy} = \frac{|27{,}984 - 27{,}988.96|}{27{,}988.96} \approx 0.00018 $$

The identified peptide matches FEGDTLVNR, with a peptide mass error of approximately −970 ppm and overall protein accuracy of ~0.018%.

  1. What is the percentage of the sequence that is confirmed by peptide mapping?

From Expasy:
90.7% coverage

From Figure 6:
~88% coverage

The peptide mapping confirms approximately 88–91% of the sequence, indicating strong agreement with the expected eGFP structure.

Homework: Waters Part IV — Oligomers

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

We will determine Keyhole Limpet Hemocyanin (KLH)’s oligomeric states using charge detection mass spectrometry (CDMS).

CDMS single-particle measurements of KLH allow us to make direct mass measurements to determine what oligomeric states (that is, how many protein subunits combine) are present in solution. Using the known masses of the polypeptide subunits (Table 1) for KLH, identify where the following oligomeric species are on the spectrum shown below from the CDMS:

  • 7FU Decamer
  • 8FU Didecamer
  • 8FU 3-Decamer
  • 8FU 4-Decamer
Polypeptide Subunit NameSubunit Mass
7FU340 kDa
8FU400 kDa

Given subunits:
7FU = 340 kDa
8FU = 400 kDa

Calculated oligomers:
7FU decamer = 3.4 MDa
8FU didecamer = 8.0 MDa
8FU 3-decamer = 12.0 MDa
8FU 4-decamer = 16.0 MDa

From Figure 7:
Peak at ~3.4 MDa matches the 7FU decamer
Peak at ~8.33 MDa matches the 8FU didecamer
Peak at ~12.67 MDa matches the 8FU 3-decamer

These peaks align closely with the expected masses of the oligomeric assemblies, indicating that the protein forms higher-order structures in solution. The presence of multiple distinct peaks suggests a mixture of oligomeric states rather than a single uniform complex. Overall, the data is consistent with known behaviour of KLH, which is known to form large, multimeric assemblies.

Homework: Waters Part V — Did I make GFP?

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

Please fill out this table with the data you acquired from the lab work done at the Waters Immerse Lab in Cambridge, or else the data screenshots in this document if you were unable to have lab work done at Waters.

TheoreticalObserved/measured on the Intact LC-MSPPM Mass Error
Molecular weight (kDa)

Reading & Resources (click to expand)

Subsections of Labs

Week 1 Lab: Pipetting

cover image cover image

Projects

Final projects:

  • Project drafts The following three slides represent my drafts for my final project. Project one involves decaffeinating drinks using bacterial strains, project 2 and 3 are similiar in nature as both are small molecule drugs which I aim to synthesise using bacteria. Although this is ambitious I have also found that a mutual precursor such as diterpene could be made instead of the complete drug.

Subsections of Projects

Individual Final Project

Project drafts

The following three slides represent my drafts for my final project. Project one involves decaffeinating drinks using bacterial strains, project 2 and 3 are similiar in nature as both are small molecule drugs which I aim to synthesise using bacteria. Although this is ambitious I have also found that a mutual precursor such as diterpene could be made instead of the complete drug.

ppt1 ppt1ppt2 ppt2ppt3 ppt3

Optimising the oxidation reaction of Paclitaxel biosynthesis

Paclitaxel is a popular chemotherapy in several cancers such as ovarian, breast, and lung. However, the current production of it remains unsustainable from an environmental and economic perspective and can be optimised using biosynthesis. The most common way it is made on an industrial scale is via semi-synthesis by extraction of 10-deacetylbaccatin III from the European Yew (Taxus baccata) or other similar trees and eventually ending up with paclitaxel. This often is not extremely efficient (paclitaxel after extraction 10%), and contributes to environmental strain as it takes long for yew trees to mature driving up costs further Current biosynthesis is not particularly effective either, some work has been done in generating the taxol precursors using E. coli (10.1126/science.1191652). The complete synthesis is difficult due to the inefficiency of enzymatic reactions. One of the current bottlenecks in production is the first oxidation step catalysed by taxadiene 5α-hydroxylase (CYP725A4). This enzyme converts taxadiene into taxadien‑5α‑ol but exhibits low catalytic efficiency and poor selectivity, resulting in the formation of multiple undesired side products. Heterologous expression of taxadiene‑5α‑hydroxylase (CYP725A4) results in a high side‑product to main‑product ratio and low taxadien‑5α‑ol titres due to the formation of multiple oxygenated taxane derivatives, thereby limiting metabolic flux toward paclitaxel precursors and hindering efficient microbial production (doi.org/10.1186/s12934-022-01922-1).

“The challenge for biosynthesis of paclitaxel lies on the insufficient precursor, such as taxadien-5α-ol” (Wu, QY et al. doi.org/10.1186/s40643-022-00569-5)

I want to optimise the CYP725A4‑catalysed oxidation step in paclitaxel biosynthesis, which currently exhibits low selectivity due to competing reaction pathways in its active site. This may be achieved through enzyme engineering approaches such as active site analysis, molecular docking, rational mutation prediction, or by exploring alternative enzyme variants. Improving this early biosynthetic step could increase taxadien‑5α‑ol production and enhance the overall efficiency and sustainability of microbial paclitaxel synthesis.

Final Project Slide

After some discussion with my node TAs, I settled on paclitaxel as my final project. My three goals are as follows;

Final Project Slide

Aim 1: Identify and design CYP725A4 variants with improved efficiency using DNA construct design, rational mutation prediction, active-site analysis, molecular docking, and AI

Aim 2: Experimentally test the best CYP725A4 variants in a heterologous expression system and compare product distributions to determine if product formation improves relative to current variants

Aim 3: Enable more efficient and sustainable microbial paclitaxel production by reducing a major bottleneck in the biosynthetic pathway, decreasing dependence on plant derived intermediates

Group Final Project

cover image cover image