Homework

Weekly homework submissions:

Subsections of Homework

Week 1 HW: Principles and Practices

cover image cover image
1. Project Concept: In-Silico Design of a Lactase-Releasing Probiotic for Lactose Intolerance
  1. First, describe a biological engineering application or tool you want to develop and why. This could be inspired by an idea for your HTGAA class project and/or something for which you are already doing in your research, or something you are just curious about.

I am interested in developing an engineered probiotic system designed to release the lactase enzyme on demand in the human gut for individuals with lactose intolerance. This project is entirely in silico, combining concepts from synthetic biology, microbiome modeling, and systems biology without any wet-lab implementation.

The system would simulate a probiotic chassis such as Lactobacillus or Bifidobacterium, equipped with virtual genetic circuits inspired by lactose metabolism. These circuits would model regulatory control of lactase expression based on local lactose concentration, using logic-gate–like behavior and feedback mechanisms. Enzyme production would increase when lactose is present and decrease once lactose is depleted, allowing adaptive and resource-efficient regulation.

Why Is This Idea Relevant?

In-silico modeling is a recognized and safe approach in synthetic biology that allows the exploration of engineered biological systems and gut microbiome interactions without experimental, ethical, or biosafety risks. Such computational frameworks enable hypothesis generation, system-level understanding, and educational visualization of complex biological behaviors before any real-world implementation.

Note

Lactose intolerance is one of the most common digestive disorders globally, caused by reduced or absent lactase activity in adulthood. It affects a large proportion of the world’s population, particularly in Africa, Asia, and South America, leading to gastrointestinal discomfort and dietary restrictions. Addressing this condition highlights a real, widespread health challenge that benefits from innovative and accessible solutions. (Lactose Intolerance - NIDDK, 2024); image reference Bluepic Bluepic

2. Governance / Policy Goals

2. Next, describe one or more governance/policy goals related to ensuring that this application or tool contributes to an “ethical” future, like ensuring non-malfeasance (preventing harm). Break big goals down into two or more specific sub-goals. Below is one example framework (developed in the context of synthetic genomics) you can choose to use or adapt, or you can develop your own. The example was developed to consider policy goals of ensuring safety and security, alongside other goals, like promoting constructive uses, but you could propose other goals for example, those relating to equity or autonomy.


Because this project represents an early, in-silico design phase, its governance goals focus on the responsible framing, communication, and interpretation of computational results rather than regulation of a finalized biological product.

1. Ensuring Ethical Transparency

In silico models can appear highly convincing, even though they rely on simplifying assumptions. Without transparency, such simulations may be mistakenly interpreted as real biological proof, reused incorrectly by others, or generate unjustified confidence in safety or effectiveness. To prevent these risks, the project emphasizes:

  • Clear documentation of all modeling assumptions, including chosen parameters (e.g., lactose concentration thresholds, promoter sensitivity), simulation boundaries, and known limitations.
  • Explicit disclosure of the speculative nature of the work, clarifying potential real-world implications while emphasizing that the model does not represent a validated or deployable probiotic system.
2. Maintaining Scientific Integrity

Although the conceptual model may function optimally in simulation, real biological systems often behave unpredictably due to environmental variability and biological complexity. To maintain scientific integrity, it is essential to:

  • Avoid overstating the effectiveness or safety of real-world probiotics based solely on computational results, and clearly distinguish between theoretical design and experimentally validated outcomes.
3. Considering Public Health and Safety

Since biological behavior cannot be predicted with complete accuracy, the project addresses public health and safety by:

  • Highlighting potential risks of physical implementation, such as disruption of gut microbiome balance or unintended metabolic effects.
  • Including scenario-based analyses to explore possible unexpected consequences for gut microbiome health under different simulated conditions.
3. Potential Governance Actions

3. Next, describe at least three different potential governance “actions” by considering the four aspects below (Purpose, Design, Assumptions, Risks of Failure & “Success”). Try to outline a mix of actions (e.g. a new requirement/rule, incentive, or technical strategy) pursued by different “actors” (e.g. academic researchers, companies, federal regulators, law enforcement, etc). Draw upon your existing knowledge and a little additional digging, and feel free to use analogies to other domains (e.g. 3D printing, drones, financial systems, etc.).

  1. Purpose: What is done now and what changes are you proposing?
  2. Design: What is needed to make it “work”? (including the actor(s) involved - who must opt-in, fund, approve, or implement, etc)
  3. Assumptions: What could you have wrong (incorrect assumptions, uncertainties)?
  4. Risks of Failure & “Success”: How might this fail, including any unintended consequences of the “success” of your proposed actions?

PURPOSEDESIGNASSUMPTIONSRISKS OF FAILURE & “SUCCESS”
Providing mandatory transparency and documentation standards for in-silico biological models (by academic researchers, journals, funding bodies)Require structured documentation sections describing modeling assumptions, parameter choices, simulation constraints, and known limitations of the modelClear and standardized documentation reduces misuse, misinterpretation, and overconfidence in simulation resultsDocumentation may be superficial, misunderstood, or ignored by users
Providing ethical claim-limitation guidelines for computational synthetic biology projects (by bioethics committees, academic institutions)Encourage explicit labeling of projects as Conceptual, Exploratory, or Pre-experimental, and require clear statements that simulation outcomes do not constitute clinical or biological proofClear framing of claims improves scientific integrity, responsible communication, and public trust in synthetic biology researchGuidelines may be ignored outside formal academic or publishing contexts; excessive caution may slow translation of promising concepts into experimental research
Recommending scenario-based risk modeling as a design requirement (by researchers, synthetic biology educators)Integrate scenario analysis into in-silico projects, exploring possible unintended outcomes such as microbiome imbalance, excessive enzyme expression, or metabolic side effects if the system were physically implementedEarly anticipation of risks improves downstream design decisions and promotes responsible innovationScenario analysis may oversimplify complex biological interactions
4. Scoring Governance Actions Against Policy Goals

4. Next, score (from 1-3 with, 1 as the best, or n/a) each of your governance actions against your rubric of policy goals. The following is one framework but feel free to make your own:


Action / Policy GoalEnsuring Ethical TransparencyMaintaining Scientific IntegrityConsidering Public Health and Safety
Providing Mandatory Transparency & Documentation Standards for In-Silico Biological Models123
Providing Ethical Claim-Limitation Guidelines for Computational Synthetic Biology Projects212
Recommending Scenario-Based Risk Modeling as a Design Requirement321
5. Prioritization of Governance Options and Strategic Recommendations

5. Last, drawing upon this scoring, describe which governance option, or combination of options, you would prioritize, and why. Outline any trade-offs you considered as well as assumptions and uncertainties. For this, you can choose one or more relevant audiences for your recommendation, which could range from the very local (e.g. to MIT leadership or Cambridge Mayoral Office) to the national (e.g. to President Biden or the head of a Federal Agency) to the international (e.g. to the United Nations Office of the Secretary-General, or the leadership of a multinational firm or industry consortia). These could also be one of the “actor” groups in your matrix.


From my perspective, scenario-based risk modeling can be prioritized over the other governance options, because all three approaches address public health and safety either directly or indirectly. Scenario-based analysis explicitly explores what could go wrong if an in-silico model were physically implemented, making it the most direct mechanism for anticipating risks to gut microbiome balance or unintended metabolic effects. However, maintaining scientific integrity also plays a critical indirect role in protecting public health: by avoiding overclaiming the safety or effectiveness of a purely conceptual model, the transition from simulation to real-world application becomes more cautious, accurate, and oriented toward appropriate experimental validation, thereby reducing the likelihood of harmful misinterpretations. Similarly, ensuring ethical transparency through clear and accurate documentation of modeling assumptions, parameters, and limitations improves how the model is interpreted and reused by others, helping prevent incorrect applications that could ultimately pose health risks.

in this homework, AI ChatGPT assisted me in organizing and clearly articulating my answers and descriptions, ensuring that the content is well-structured and easy to understand.


Sources:


Assignment (Week 2 Lecture Prep):

Homework Questions from Professor Jacobson:

  1. Error rate and genome context • From the slide N°= 8 , DNA polymerase has an error rate of ~1 in 10⁶ bases. • With the human genome of ~3 × 10⁹ bp, this would result in ~3,000 errors per replication without repair. • Biology reduces this discrepancy with proofreading activity of DNA polymerase (3′→5′ exonuclease) and post-replication mismatch repair like MutS, NER, BER…, which collectively reduce the final error rate to ~1 in 10⁹–10¹⁰.
  2. Human protein: ~1036 bp (~345 amino acids), With ~3 codons per amino acid on average, the number of possible DNA sequences for an average human protein is ~3³⁴⁵ (~10¹⁶⁴ possible sequences). Not all sequences work in practice because of Mutations: Insertions, deletions, transitions, and transversions that can introduce frameshifts or premature stop codons, making the protein non-functional. Also, there are some mechanism of regulations that make some Sequences creating unwanted secondary structures in mRNA, affect splicing, or introduce cryptic signals that disrupt translation.

Homework Questions from Dr. LeProust:

  1. Most commonly used method for oligo synthesis Today, almost all synthetic DNA is made using phosphoramidite solid-phase synthesis. This method adds one nucleotide at a time on a solid support and is reliable, efficient, and easy to automate, which is why it became the standard for modern DNA synthesizers. https://biolabmix.ru/en/info/detail/oligonucleotide-synthesis/#:~:text=The%20most%20common%20approach%20to,for%20example%2C%20by%20attaching%20fluorophores.

  2. Why it’s hard to make oligos longer than ~200 nt Each step in chemical DNA synthesis is very efficient but not perfect, so small errors happen every time a base is added. As the oligo gets longer, these errors pile up, and beyond about 200 nucleotides it becomes very difficult to get a clean, full-length sequence. https://pubs.rsc.org/en/content/articlepdf/2025/sc/d4sc06958g

  3. Why you can’t directly synthesize a 2000 bp gene Making a 2000-base gene in one piece would accumulate too many chemical errors and damaged bases to be useful. Instead, companies synthesize short oligos and then assemble them enzymatically, followed by cloning and sequence checking to make sure the gene is correct. https://www.pnas.org/doi/10.1073/pnas.2237126100#:~:text=The%20broader%20implications%20of%20the,without%20multiple%20repair/selection%20steps.


Homework Question from George Church:

All animals require the same 10 essential amino acids because they cannot synthesize them and must obtain them from their diet. These are: histidine, isoleucine, leucine, lysine, methionine, phenylalanine, threonine, tryptophan, valine, and arginine (arginine is essential for all animals and conditionally essential in adult humans). The “lysine contingency” refers to the idea that lysine is often the limiting essential amino acid in plant-based diets, especially those dominated by cereals like wheat, rice, or maize. Since animals cannot make lysine, their growth and health are directly constrained by how much lysine is available in their food. So knowing that all animals share the same essential amino acid requirements makes lysine’s importance stand out even more. It shows that lysine is not just nutritionally important but evolutionarily critical.

https://www.kemin.com/ap/en/blog/animal/amino-acids-for-animal-health#:~:text=Essential%20amino%20acids:%20These%20are,essential)%2C%20leucine%20and%20lysine

Week 2 HW: DNA Read, Write, and Edit

cover image cover image
Part 0: Basics of Gel Electrophoresis

Attend or watch all lecture and recitation videos. Optionally watch bootcamp


Part 1: Benchling & In-silico Gel Art

See the Gel Art: Restriction Digests and Gel Electrophoresis protocol for details. Overview: Make a free account at benchling.com Import the Lambda DNA. Simulate Restriction Enzyme Digestion with the following Enzymes:

  • EcoRI
  • HindIII
  • BamHI
  • KpnI
  • EcoRV
  • SacI
  • SalI Create a pattern/image in the style of Paul Vanouse’s Latent Figure Protocol artworks. You might find Ronan’s website a helpful tool for quickly iterating on designs!
  • In this part, I imported The complete 48,502 bp linear genome of bacteriophage lambda from NCBI GenBank into Benchling. This sequence corresponds to the Lambda DNA sold by NEB (N3011) and will be used for in-silico restriction digestion.

Image Image cover image cover image Image Image cover image cover image
Image Image cover image cover image
Image Image cover image cover image
Image Image cover image cover image
Image Image cover image cover image
Image Image

  • Then simulated restriction enzyme digestion using EcoRI, HindIII, BamHI, KpnI, EcoRV, SacI, and SalI. By running in-silico gel electrophoresis . The resulting virtual gel shows discrete bands corresponding to these fragments, which demostrates how sequence information maps to physical separation in gel electrophoresis. Virtual digest gel Virtual digest gel

  • To create a pattern in the style of Paul Vanouse’s work, I experimented with different combinations of restriction enzymes to control the gel band patterns. By adjusting the number and length of the resulting DNA fragments, I explored how these parameters influence the final visual outcome. Through this process, I ultimately obtained a gel pattern resembling a butterfly shape. Image Image

Gel pattern in style of Paul Vanouse’s work Gel pattern in style of Paul Vanouse’s work
  • This helped me understand how restriction digests and gels work before doing any real lab experiment. I treated this as both a technical exercise and a creative exploration, inspired by DNA gel art concepts.
Part 2: Gel Art - Restriction Digests and Gel Electrophoresis

Assignees for the following sections MIT/Harvard students Required Committed Listeners Optional (for those with Lab access) Perform the lab experiment you designed in Part 1 and outlined in the Gel Art: Restriction Digests and Gel Electrophoresis protocol.


Part 3: DNA Design Challenge

Assignees for the following sections MIT/Harvard students Required Committed Listeners Required

3.1. Choose your protein. In recitation, we discussed that you will pick a protein for your homework that you find interesting. Which protein have you chosen and why? Using one of the tools described in recitation (NCBI, UniProt, google), obtain the protein sequence for the protein you chose. [Example from our group homework, you may notice the particular format — The example below came from UniProt] sp|P03609|LYS_BPMS2 Lysis protein OS=Escherichia phage MS2 OX=12022 PE=2 SV=1 METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLL EAVIRTVTTLQQLLT

3.2. Reverse Translate: Protein (amino acid) sequence to DNA (nucleotide) sequence. The Central Dogma discussed in class and recitation describes the process in which DNA sequence becomes transcribed and translated into protein. The Central Dogma gives us the framework to work backwards from a given protein sequence and infer the DNA sequence that the protein is derived from. Using one of the tools discussed in class, NCBI or online tools (google “reverse translation tools”), determine the nucleotide sequence that corresponds to the protein sequence you chose above. [Example: Get to the original sequence of phage MS2 L-protein from its genome phage MS2 genome - Nucleotide - NCBI] Lysis protein DNA sequence atggaaacccgattccctcagcaatcgcagcaaactccggcatctactaatagacgccggccattcaaacatgaggattacccatgtcgaagacaacaaagaagttcaactctttatgtattgatcttcctcgcgatctttctctcgaaatttaccaatcaattgcttctgtcgctactggaagcggtgatccgcacagtgacgactttacagcaattgcttacttaa

3.3. Codon optimization. Once a nucleotide sequence of your protein is determined, you need to codon optimize your sequence. You may, once again, utilize google for a “codon optimization tool”. In your own words, describe why you need to optimize codon usage. Which organism have you chosen to optimize the codon sequence for and why? [Example from Codon Optimization Tool | Twist Bioscience while avoiding Type IIs enzyme recognition sites BsaI, BsmBI, and BbsI] Lysis protein DNA sequence with Codon-Optimization ATGGAAACCCGCTTTCCGCAGCAGAGCCAGCAGACCCCGGCGAGCACCAACCGCCGCCGCCCGTTCAAACATGAAGATTATCCGTGCCGTCGTCAGCAGCGCAGCAGCACCCTGTATGTGCTGATTTTTCTGGCGATTTTTCTGAGCAAATTCACCAACCAGCTGCTGCTGAGCCTGCTGGAAGCGGTGATTCGCACAGTGACGACCCTGCAGCAGCTGCTGACCTAA

3.4. You have a sequence! Now what? What technologies could be used to produce this protein from your DNA? Describe in your words the DNA sequence can be transcribed and translated into your protein. You may describe either cell-dependent or cell-free methods, or both.

3.5. [Optional] How does it work in nature/biological systems?

  1. Describe how a single gene codes for multiple proteins at the transcriptional level.
  2. Try aligning the DNA sequence, the transcribed RNA, and also the resulting translated Protein!!! See example below. [Example shows the biomolecular flow in central dogma from DNA to RNA to Protein] Special note that all “T” were transcribed into “U” and that the 3-nt codon represents 1-AA. Rearranged snapshot of MS2 L-protein information flow from DNA to RNA to Protein. Captured from Ice’s Benchling and stitched together in a ppt

  • For the DNA design challenge, I chose a protein related to my project interest in engineered probiotics and conditional enzyme release in the gut.The enzyme β-galactosidase is well-characterized and commonly expressed in Escherichia coli, making it an ideal candidate for computational DNA design and expression modeling.
  • I first searched online database UniProt to obtain the amino acid sequence of the protein.

Image Image cover image cover image Image Image cover image cover image Image Image

  • the amino acid equence was as follow:
>sp|P00722|BGAL_ECOLI Beta-galactosidase OS=Escherichia coli (strain K12) OX=83333 GN=lacZ PE=1 SV=2
MTMITDSLAVVLQRRDWENPGVTQLNRLAAHPPFASWRNSEEARTDRPSQQLRSLNGEWR
FAWFPAPEAVPESWLECDLPEADTVVVPSNWQMHGYDAPIYTNVTYPITVNPPFVPTENP
TGCYSLTFNVDESWLQEGQTRIIFDGVNSAFHLWCNGRWVGYGQDSRLPSEFDLSAFLRA
GENRLAVMVLRWSDGSYLEDQDMWRMSGIFRDVSLLHKPTTQISDFHVATRFNDDFSRAV
LEAEVQMCGELRDYLRVTVSLWQGETQVASGTAPFGGEIIDERGGYADRVTLRLNVENPK
LWSAEIPNLYRAVVELHTADGTLIEAEACDVGFREVRIENGLLLLNGKPLLIRGVNRHEH
HPLHGQVMDEQTMVQDILLMKQNNFNAVRCSHYPNHPLWYTLCDRYGLYVVDEANIETHG
MVPMNRLTDDPRWLPAMSERVTRMVQRDRNHPSVIIWSLGNESGHGANHDALYRWIKSVD
PSRPVQYEGGGADTTATDIICPMYARVDEDQPFPAVPKWSIKKWLSLPGETRPLILCEYA
HAMGNSLGGFAKYWQAFRQYPRLQGGFVWDWVDQSLIKYDENGNPWSAYGGDFGDTPNDR
QFCMNGLVFADRTPHPALTEAKHQQQFFQFRLSGQTIEVTSEYLFRHSDNELLHWMVALD
GKPLASGEVPLDVAPQGKQLIELPELPQPESAGQLWLTVRVVQPNATAWSEAGHISAWQQ
WRLAENLSVTLPAASHAIPHLTTSEMDFCIELGNKRWQFNRQSGFLSQMWIGDKKQLLTP
LRDQFTRAPLDNDIGVSEATRIDPNAWVERWKAAGHYQAEAALLQCTADTLADAVLITTA
HAWQHQGKTLFISRKTYRIDGSGQMAITVDVEVASDTPHPARIGLNCQLAQVAERVNWLG
LGPQENYPDRLTAACFDRWDLPLSDMYTPYVFPSENGLRCGTRELNYGPHQWRGDFQFNI
SRYSQQQLMETSHRHLLHAEEGTWLNIDGFHMGIGGDDSWSPSVSAEFQLSAGRYHYQLV
WCQK
  • After selecting the protein, I converted the amino acid sequence of β-galactosidase (1024 residues) into the corresponding DNA sequence using the Sequence Manipulation Suite Reverse Translate tool. Because the genetic code is degenerate, multiple codons can encode the same amino acid. The resulting 3072 bp DNA sequence represents one valid nucleotide sequence capable of encoding the β-galactosidase protein.
Image Image
  • the resulted DNA sequence was as follow:
>reverse translation of sp|P00722|BGAL_ECOLI Beta-galactosidase OS=Escherichia coli (strain K12) OX=83333 GN=lacZ PE=1 SV=2 to a 3072 base sequence of most likely codons.
atgaccatgattaccgatagcctggcggtggtgctgcagcgccgcgattgggaaaacccg
ggcgtgacccagctgaaccgcctggcggcgcatccgccgtttgcgagctggcgcaacagc
gaagaagcgcgcaccgatcgcccgagccagcagctgcgcagcctgaacggcgaatggcgc
tttgcgtggtttccggcgccggaagcggtgccggaaagctggctggaatgcgatctgccg
gaagcggataccgtggtggtgccgagcaactggcagatgcatggctatgatgcgccgatt
tataccaacgtgacctatccgattaccgtgaacccgccgtttgtgccgaccgaaaacccg
accggctgctatagcctgacctttaacgtggatgaaagctggctgcaggaaggccagacc
cgcattatttttgatggcgtgaacagcgcgtttcatctgtggtgcaacggccgctgggtg
ggctatggccaggatagccgcctgccgagcgaatttgatctgagcgcgtttctgcgcgcg
ggcgaaaaccgcctggcggtgatggtgctgcgctggagcgatggcagctatctggaagat
caggatatgtggcgcatgagcggcatttttcgcgatgtgagcctgctgcataaaccgacc
acccagattagcgattttcatgtggcgacccgctttaacgatgattttagccgcgcggtg
ctggaagcggaagtgcagatgtgcggcgaactgcgcgattatctgcgcgtgaccgtgagc
ctgtggcagggcgaaacccaggtggcgagcggcaccgcgccgtttggcggcgaaattatt
gatgaacgcggcggctatgcggatcgcgtgaccctgcgcctgaacgtggaaaacccgaaa
ctgtggagcgcggaaattccgaacctgtatcgcgcggtggtggaactgcataccgcggat
ggcaccctgattgaagcggaagcgtgcgatgtgggctttcgcgaagtgcgcattgaaaac
ggcctgctgctgctgaacggcaaaccgctgctgattcgcggcgtgaaccgccatgaacat
catccgctgcatggccaggtgatggatgaacagaccatggtgcaggatattctgctgatg
aaacagaacaactttaacgcggtgcgctgcagccattatccgaaccatccgctgtggtat
accctgtgcgatcgctatggcctgtatgtggtggatgaagcgaacattgaaacccatggc
atggtgccgatgaaccgcctgaccgatgatccgcgctggctgccggcgatgagcgaacgc
gtgacccgcatggtgcagcgcgatcgcaaccatccgagcgtgattatttggagcctgggc
aacgaaagcggccatggcgcgaaccatgatgcgctgtatcgctggattaaaagcgtggat
ccgagccgcccggtgcagtatgaaggcggcggcgcggataccaccgcgaccgatattatt
tgcccgatgtatgcgcgcgtggatgaagatcagccgtttccggcggtgccgaaatggagc
attaaaaaatggctgagcctgccgggcgaaacccgcccgctgattctgtgcgaatatgcg
catgcgatgggcaacagcctgggcggctttgcgaaatattggcaggcgtttcgccagtat
ccgcgcctgcagggcggctttgtgtgggattgggtggatcagagcctgattaaatatgat
gaaaacggcaacccgtggagcgcgtatggcggcgattttggcgataccccgaacgatcgc
cagttttgcatgaacggcctggtgtttgcggatcgcaccccgcatccggcgctgaccgaa
gcgaaacatcagcagcagttttttcagtttcgcctgagcggccagaccattgaagtgacc
agcgaatatctgtttcgccatagcgataacgaactgctgcattggatggtggcgctggat
ggcaaaccgctggcgagcggcgaagtgccgctggatgtggcgccgcagggcaaacagctg
attgaactgccggaactgccgcagccggaaagcgcgggccagctgtggctgaccgtgcgc
gtggtgcagccgaacgcgaccgcgtggagcgaagcgggccatattagcgcgtggcagcag
tggcgcctggcggaaaacctgagcgtgaccctgccggcggcgagccatgcgattccgcat
ctgaccaccagcgaaatggatttttgcattgaactgggcaacaaacgctggcagtttaac
cgccagagcggctttctgagccagatgtggattggcgataaaaaacagctgctgaccccg
ctgcgcgatcagtttacccgcgcgccgctggataacgatattggcgtgagcgaagcgacc
cgcattgatccgaacgcgtgggtggaacgctggaaagcggcgggccattatcaggcggaa
gcggcgctgctgcagtgcaccgcggataccctggcggatgcggtgctgattaccaccgcg
catgcgtggcagcatcagggcaaaaccctgtttattagccgcaaaacctatcgcattgat
ggcagcggccagatggcgattaccgtggatgtggaagtggcgagcgataccccgcatccg
gcgcgcattggcctgaactgccagctggcgcaggtggcggaacgcgtgaactggctgggc
ctgggcccgcaggaaaactatccggatcgcctgaccgcggcgtgctttgatcgctgggat
ctgccgctgagcgatatgtataccccgtatgtgtttccgagcgaaaacggcctgcgctgc
ggcacccgcgaactgaactatggcccgcatcagtggcgcggcgattttcagtttaacatt
agccgctatagccagcagcagctgatggaaaccagccatcgccatctgctgcatgcggaa
gaaggcacctggctgaacattgatggctttcatatgggcattggcggcgatgatagctgg
agcccgagcgtgagcgcggaatttcagctgagcgcgggccgctatcattatcagctggtg
tggtgccagaaa
  • After reverse translation, I verified the identity of the resulting nucleotide sequence by performing a BLASTn search against the reference lacZ gene from Escherichia coli K-12. The alignment showed 100% query coverage with an E-value of 0.0, confirming a highly significant match. The percent identity was ~84%, which is expected because reverse translation produces a synonymous DNA sequence that differs at the codon level while still encoding the same β-galactosidase protein. This result confirmed that the reverse-translated sequence correctly corresponds to the lacZ gene.

Image Image cover image cover image Image Image cover image cover image Image Image

Next, I performed codon optimization of the sequence originates from E. coli K-12 to improve expression efficiency in a Lactobacillus probiotic strain (delbrueckii subsp. Bulgaricus), as this organism is the intended chassis for conditional lactase expression in the human gut, to ensure efficient translation in the final probiotic host organism. Codon optimization was performed using a host-specific algorithm using the Vector Builder codon orimisation tool that adjusts synonymous codon usage to match the preferred codons of L. delbrueckii while preserving the original amino acid sequence.

Why codon optimization is necessary?

Codon optimization is required because different organisms preferentially use different synonymous codons. Optimizing the DNA sequence for the codon usage of the target host improves ribosome efficiency, protein yield, and reduces translational stalling.

Image Image Image Image

  • the resulted optimised sequence is as following:

Host organism: Lactobacillus delbruekii susbsp. Bulgaricus ATCC 11842 = JCM 1002 Original Sequence: GC=59.80%, CAI=0.72 Optimized Sequence: GC=60.16%, CAI=0.89

Improved DNA[1]: GC=60.16%, CAI=0.89
ATGACTATGATCACCGACAGCCTGGCAGTTGTTTTGCAACGGCGGGACTGGGAAAACCCGGGCGTCACTCAGTTGAACCGGCTGGCCGCCCACCCACCATTTGCCAGCTGGCGCAACTCCGAAGAAGCCCGGACCGACCGGCCGAGCCAGCAACTGAGAAGCTTGAACGGCGAATGGCGTTTCGCCTGGTTTCCGGCCCCGGAAGCCGTCCCAGAAAGCTGGTTGGAATGCGACCTCCCGGAAGCCGATACCGTCGTGGTGCCGAGCAACTGGCAAATGCACGGCTATGACGCCCCCATCTACACCAATGTTACCTACCCAATTACCGTCAACCCGCCATTTGTCCCGACCGAAAACCCGACTGGTTGCTATAGCTTGACCTTCAACGTTGACGAAAGCTGGCTGCAAGAAGGCCAGACCCGCATTATTTTTGACGGCGTTAACAGCGCCTTCCACTTGTGGTGCAACGGCCGCTGGGTCGGCTACGGCCAGGACAGCCGCTTGCCATCCGAATTTGACCTGAGTGCTTTCTTGCGGGCCGGCGAAAACCGTCTGGCCGTCATGGTCCTGCGCTGGAGCGACGGCAGCTACCTGGAAGACCAAGACATGTGGCGGATGTCCGGCATTTTCCGGGACGTCAGCCTGCTGCACAAGCCGACCACCCAGATTTCCGACTTTCACGTTGCAACCCGGTTCAACGACGACTTCTCTCGGGCTGTGCTGGAAGCTGAAGTCCAGATGTGCGGCGAATTGCGGGACTACCTGCGGGTTACTGTTTCATTGTGGCAGGGCGAAACCCAGGTTGCCTCAGGCACCGCCCCGTTTGGCGGTGAAATTATCGACGAACGCGGCGGGTACGCCGACCGGGTTACCTTGAGACTGAACGTGGAAAACCCGAAGTTGTGGAGCGCCGAAATCCCAAATCTGTACCGCGCCGTCGTCGAATTGCACACCGCTGACGGCACCCTGATCGAAGCCGAAGCCTGCGACGTTGGCTTCCGGGAAGTCCGCATCGAAAACGGCTTGCTGCTCCTGAACGGCAAGCCACTGCTGATCCGGGGCGTTAACCGGCACGAACACCACCCATTGCACGGCCAAGTCATGGACGAACAGACTATGGTCCAGGACATCCTGCTGATGAAGCAGAACAACTTCAACGCTGTTCGTTGCTCACACTATCCAAACCATCCACTGTGGTACACTCTGTGCGACCGGTACGGCCTGTACGTTGTGGACGAAGCCAACATCGAAACTCACGGCATGGTTCCGATGAACCGGCTGACCGACGACCCGAGATGGCTGCCAGCCATGAGCGAACGGGTTACTCGCATGGTTCAACGCGACCGGAACCACCCATCCGTTATTATCTGGAGCCTGGGGAACGAAAGCGGCCACGGCGCCAATCACGACGCTCTGTACCGGTGGATCAAGTCCGTCGACCCATCCCGCCCTGTTCAGTACGAAGGCGGCGGCGCCGATACGACCGCCACCGACATCATCTGCCCAATGTACGCCCGGGTTGATGAAGACCAGCCGTTTCCGGCTGTCCCAAAGTGGAGCATCAAGAAGTGGCTGAGCCTGCCAGGCGAAACTCGGCCGCTGATCCTGTGCGAATACGCCCACGCCATGGGCAACTCCCTGGGCGGCTTTGCCAAGTACTGGCAGGCTTTTCGCCAGTATCCACGGTTGCAGGGCGGCTTTGTTTGGGACTGGGTCGACCAAAGCCTGATCAAGTACGACGAAAACGGCAACCCGTGGAGCGCCTACGGCGGCGACTTTGGCGACACCCCGAACGACCGCCAGTTTTGCATGAACGGTCTGGTTTTCGCTGACCGGACGCCACACCCGGCCCTGACCGAAGCCAAGCACCAGCAGCAGTTCTTCCAGTTCCGGCTGTCAGGCCAGACCATCGAAGTGACTAGCGAATACCTGTTTCGCCACTCCGACAACGAATTGTTGCACTGGATGGTCGCCCTGGACGGCAAGCCACTGGCCAGCGGCGAAGTTCCGCTGGACGTTGCCCCACAGGGCAAGCAGCTGATCGAATTGCCGGAACTGCCGCAGCCGGAAAGCGCCGGCCAACTGTGGCTGACTGTTCGGGTCGTTCAGCCGAACGCCACTGCCTGGTCTGAAGCCGGGCACATCTCAGCCTGGCAGCAGTGGCGCCTGGCCGAAAACTTGAGCGTTACGCTGCCGGCCGCCAGCCACGCCATCCCACACCTGACTACTAGCGAAATGGACTTTTGCATCGAATTGGGCAACAAGCGGTGGCAATTCAACCGGCAGAGCGGCTTTCTGAGCCAGATGTGGATCGGCGACAAGAAGCAGTTGCTGACCCCACTGCGGGATCAGTTCACCCGGGCCCCGCTGGACAACGACATCGGCGTCAGCGAAGCCACTCGGATCGACCCAAACGCCTGGGTCGAACGCTGGAAGGCCGCCGGCCACTACCAGGCCGAAGCCGCTCTGCTGCAATGTACCGCTGATACGCTGGCTGACGCCGTCTTGATTACTACCGCTCACGCCTGGCAACACCAGGGCAAGACTTTGTTTATCAGCCGGAAGACCTACCGGATTGACGGCAGCGGTCAGATGGCCATCACAGTCGATGTCGAAGTTGCCAGCGACACCCCGCACCCGGCACGGATCGGCCTGAACTGCCAGCTGGCCCAGGTTGCCGAACGGGTTAACTGGCTGGGCCTGGGCCCTCAGGAAAACTACCCAGACCGTTTGACGGCTGCCTGCTTTGACCGGTGGGACTTACCGTTGAGCGATATGTACACTCCATACGTCTTTCCGTCCGAAAACGGCCTGCGGTGCGGCACCAGAGAACTGAACTATGGCCCGCACCAGTGGCGCGGTGACTTTCAATTCAACATCAGCCGGTACTCCCAGCAGCAGTTGATGGAAACCAGCCACCGCCACCTGCTGCACGCCGAAGAAGGGACGTGGTTGAACATCGACGGCTTTCACATGGGCATCGGCGGCGACGACTCATGGAGCCCGAGCGTTAGCGCTGAATTCCAGTTGAGCGCCGGCCGGTACCACTACCAGTTGGTTTGGTGCCAGAAG
  • To produce the protein from this DNA sequence, I would use a cell-dependent expression system based on bacterial transformation and expression. In this approach, This gene is then placed into an expression cassette with the necessary regulatory elements so it can be used by a biological system.
  • To produce the protein, I would use a cell-dependent expression system through bacterial cloning. The designed DNA sequence is inserted into a plasmid and introduced into a bacterial host by transformation. Inside the cell, the gene is transcribed into mRNA under the control of the selected promoter. The mRNA is then translated by ribosomes, which read the codons starting at the start codon and assemble the corresponding amino acids into the lactase protein. This approach follows the natural flow of genetic information (DNA to RNA to protein) and allows controlled production of the enzyme in living cells.
Part 4: Prepare a Twist DNA Synthesis Order

Assignees for the following sections MIT/Harvard students Required Committed Listeners Required This is a practice exercise, not necessarily your real Twist order!

4.1. Create a Twist account, and Benchling account

4.2. Build Your DNA Insert Sequence For example, let’s make a sequence that will make E. coli glow fluorescent green under UV light by constitutively (always) expressing sfGFP (a green fluorescent protein): In Benchling, select New DNA/RNA sequence Give your insert sequence a name and select DNA with a Linear topology (this is a linear sequence that will be inserted into a circular backbone vector of our choosing). Go through each piece of the given DNA sequences highlighted below (Promoter, RBS, Start Codon, Coding Sequence, His Tag, Stop Codon, Terminator) and paste the sequences into the Benchling file one after the other (replacing the coding sequence with your codon optimized DNA sequence of interest!). Each time you add a new piece of the sequence, make sure to annotate by right clicking over the sequence and creating an annotation that describes what each piece (e.g., Promoter, RBS, etc.) is (see image below). Promoter (e.g. BBa_J23106) TTTACGGCTAGCTCAGTCCTAGGTATAGTGCTAGC RBS (e.g. BBa_B0034 with spacers for optimal expression) CATTAAAGAGGAGAAAGGTACC Start Codon ATG Coding Sequence (your codon optimized DNA for a protein of interest, sfGFP for example) AGCAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTTAATGGGCACAAATTTTCTGTCCGTGGAGAGGGTGAAGGTGATGCTACAAACGGAAAACTCACCCTTAAATTTATTTGCACTACTGGAAAACTACCTGTTCCGTGGCCAACACTTGTCACTACTCTGACCTATGGTGTTCAATGCTTTTCCCGTTATCCGGATCACATGAAACGGCATGACTTTTTCAAGAGTGCCATGCCCGAAGGTTATGTACAGGAACGCACTATATCTTTCAAAGATGACGGGACCTACAAGACGCGTGCTGAAGTCAAGTTTGAAGGTGATACCCTTGTTAATCGTATCGAGTTAAAGGGTATTGATTTTAAAGAAGATGGAAACATTCTTGGACACAAACTCGAGTACAACTTTAACTCACACAATGTATACATCACGGCAGACAAACAAAAGAATGGAATCAAAGCTAACTTCAAAATTCGCCACAACGTTGAAGATGGTTCCGTTCAACTAGCAGACCATTATCAACAAAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTCGACACAATCTGTCCTTTCGAAAGATCCCAACGAAAAGCGTGACCACATGGTCCTTCTTGAGTTTGTAACTGCTGCTGGGATTACACATGGCATGGATGAGCTCTACAAA 7x His Tag (Let’s add a 7×His tag at the C-terminus of the protein to enable protein purification from E. coli) CATCACCATCACCATCATCAC Stop Codon TAA Terminator (e.g. BBa_B0015) CCAGGCATCAAATAAAACGAAAGGCTCAGTCGAAAGACTGGGCCTTTCGTTTTATCTGTTGTTTGTCGGTGAACGCTCTCTACTAGAGTCACACTGGCTCACCTTCGGGTGGGCCTTTCTGCGTTTATA Once you’ve completed this, click on Linear Map to preview the entire sequence. If you intend to have a TA review a sequence in the future, this is a good way to verify that all sections are annotated! (Optional) Share your final sequence link with a TA for review! This insert sequence you built is commonly referred to as an expression cassette in molecular biology (a sequence you can drop into any vector and it’ll perform its function). Go ahead and download the FASTA file for the sequence you made. It’s helpful to visualize DNA designs using SBOL Canvas (Synthetic Biology Open Language) to convey your designs. Here’s an example of what you just annotated in Benchling:

4.2. On Twist, Select The “Genes” Option

4.3. Select “Clonal Genes” option For this demonstration, we’ll choose Clonal Genes. You’ll select clonal genes or gene fragments depending on your final project. Historically, HTGAA projects using clonal genes (circular DNA) have reached experimental results 1-2 weeks quicker because they can be transformed directly into E. coli without additional assembly. Gene fragments (linear DNA) offer greater design flexibility but typically require an assembly or cloning step prior to transformation. An advantage is If designed with the appropriate exonuclease protection, gene fragments can be used directly in cell-free expression.

4.4. Import your sequence You just took an amino acid sequence of interest and converted it into DNA, codon optimized it, and built an expression cassette around it! Choose the Nucleotide Sequence option and Upload Sequence File to upload your FASTA file.

4.5. Choose Your Vector Since we’re ordering a clonal gene, you will need to refer to Twist’s Vector Catalog to choose your circular backbone. You can think of this as taking your linear expression cassette for your protein of interest, and completing the rest of the circle! The backbone confers many special properties like antibiotic resistance, an origin of replication, and more. Discuss with your node to decide on appropriate antibiotic options. At MIT/Harvard, you can use Ampicillin, Chloramphenicol, or Kanamycin resistance. Twist vectors do not contain restriction sites near the insert fragment, so make sure to flank your design with cut sites if you are intending to extract this DNA insert fragment later. For this demonstration, choose a Twist cloning vectors like pTwist Amp High Copy. Click into your sequence and select download construct (GenBank) to get the full plasmid sequence: Go back to your Benchling account. Inside of a folder, click the import DNA/RNA sequence button and upload the GenBank file you just downloaded. This is the plasmid you just built with your expression cassette included. Congratulations on building your first plasmid! Important For your final projects, remember to include:

  1. Fully annotated Benchling insert fragment
  2. Desired Twist cloning vector

  • A lactose-inducible promoter was selected to enable conditional expression of lactase in response to lactose availability in the gut. The PlacA promoter region was extracted from the Lactococcus lactis lac operon upstream of the native ribosome binding site, with preserving lactose-responsive regulation.
AATCGTCGTTTTTTGTTCATATGAAGACTTTCTTTCATAAAGTAATTTTTTTCCAAAGATAATTCTCTTT
TAATTGTATCATAAAAGATAATATTTTCAAGGTAAAACAAACAATTTCAAACAAAAACAAACGTTAGATG
ATGAAATAAGAACAGAGGATTGACGTATATTAGCTTAGGTCAGATTTTGTATAAGACGAAAATAAAGTAG
GACCTCTTAATCAGTAAGTTATAGAAAGTAAAAGACTTTTGTAATACCTGAATAGATATTTCACGTCCAT
TTTGTGATGGATTAAATGAACAAAAATGAACAATAATTTAACGGTGTTATCTATTTTTTAAAAAAACAAA
TAAAAAAAAACAAAAAATTAACAAAAATAGTTGCGTTTTGTTTGAATGTTTGATATCATATAAACAAAGA
AATGATGAAAACGTTATCTTGAACATTTTGCAAAATATTTTCTACTTCTACGTAGCATTTCTTTTTAAAA
TTTAGGAGGTAGTCCAA

Image Image
cover image cover image Image Image

  • For the RBS, I chose to keep the native Lactococcus lactis ribosome binding site (RBS) derived from the lacA operon which is the region immediately upstream of the coding sequence (CDS) and preserved its original spacer length to ensure efficient translation initiation in the probiotic host. Maintaining native RBS spacing is critical in Gram-positive bacteria, as ribosome binding and translation efficiency are highly sensitive to the distance between the Shine–Dalgarno sequence and the start codon.

  • the RBS sequence is as follow:

AGGAGGTAGTCCAA
  • I selected the transcription terminator from the tpi gene of Lactococcus lactis, a highly expressed native housekeeping gene, to ensure efficient and reliable transcription termination in the probiotic host. While two related annotations are present in GenBank for this region, both correspond to the same rho-independent transcription terminator. Therefore, I chose the complete annotated terminator region (positions 958–988), which includes both the inverted repeat and the downstream poly-T tract, to ensure proper formation of the termination hairpin and robust termination of transcription.

Image Image
cover image cover image Image Image

  • A transcription terminator was included downstream of the lactase coding sequence to ensure proper termination of transcription. This prevents transcriptional read-through into adjacent sequences and improves the stability and predictability of gene expression, independent of promoter regulation.

  • ATG used as start codon and AAG as stop codon

  • From the selected elements, I built a linear expression cassette in Benchling containing a lactose-regulated promoter, native LAB ribosome binding site, codon-optimized lacZ, and a native transcription terminator. I exported this sequence as a FASTA file. Cassette_link_to_Benchling

Image Image Image Image

  • When I first uploaded my expression cassette FASTA file to Twist Bioscience, I encountered an initial error related to the FASTA header name. The header exceeded the maximum allowed length (32 characters), which caused the sequence to be rejected. I fixed this issue by shortening the header name and re-uploading the file. After this correction, the sequence was accepted for further analysis.
Image Image

However, after re-uploading the corrected file, additional synthesis warnings appeared. These warnings were related to large GC content variation, repetitive regions, and overall sequence complexity. These issues are mainly due to the codon-optimized lacZ gene and the presence of multiple regulatory elements such as the ribosome binding site and transcription terminator. Twist flagged these features as potential manufacturability risks. Unfortunately, I was not able to resolve these additional issues at this stage. Fixing them would have required re-optimizing the enzyme sequence, possibly changing the host organism for codon optimization, and redesigning the regulatory architecture of the cassette. Due to time constraints and because this assignment focuses on learning the design and ordering workflow rather than producing a synthesis-ready construct, I chose not to redesign the sequence further.

Image Image

For this exercise, I proceeded by selecting a Twist clonal vector (pTwist Amp High Copy) to complete the plasmid design. Although the insert sequence still contained manufacturability warnings. However, In a real DNA synthesis order, additional sequence optimization would be required to reduce GC content extremes and repetitive regions to meet synthesis constraints.

Image Image
cover image cover image Image Image

Part 5: DNA Read/Write/Edit

Assignees for the following sections MIT/Harvard students Required Committed Listeners Required

5.1 DNA Read (i) What DNA would you want to sequence (e.g., read) and why? This could be DNA related to human health (e.g. genes related to disease research), environmental monitoring (e.g., sewage waste water, biodiversity analysis), and beyond (e.g. DNA data storage, biobank). DNA-based digital data storage technology. Source: Archives in DNA: Workshop Exploring Implications of an Emerging Bio-Digital Technology through Design Fiction - Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/DNA-based-digital-data-storage-technology_fig1_353128454 [accessed 11 Feb 2025]. (ii) In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why? Also answer the following questions:

  1. Is your method first-, second- or third-generation or other? How so?
  2. What is your input? How do you prepare your input (e.g. fragmentation, adapter ligation, PCR)? List the essential steps.
  3. What are the essential steps of your chosen sequencing technology, how does it decode the bases of your DNA sample (base calling)
  4. What is the output of your chosen sequencing technology?

5.2 DNA Write (i) What DNA would you want to synthesize (e.g., write) and why? These could be individual genes, clusters of genes or genetic circuits, whole genomes, and beyond. As described in class thus far, applications could range from therapeutics and drug discovery (e.g., mRNA vaccines and therapies) to novel biomaterials (e.g. structural proteins), to sensors (e.g., genetic circuits for sensing and responding to inflammation, environmental stimuli, etc.), to art (DNA origamis). If possible, include the specific genetic sequence(s) of what you would like to synthesize! You will have the opportunity to actually have Twist synthesize these DNA constructs! :) (ii) What technology or technologies would you use to perform this DNA synthesis and why? Also answer the following questions:

  1. What are the essential steps of your chosen sequencing methods?
  2. What are the limitations of your sequencing method (if any) in terms of speed, accuracy, scalability?

5.3 DNA Edit (i) What DNA would you want to edit and why? In class, George shared a variety of ways to edit the genes and genomes of humans and other organisms. Such DNA editing technologies have profound implications for human health, development, and even human longevity and human augmentation. DNA editing is also already commonly leveraged for flora and fauna, for example in nature conservation efforts, (animal/plant restoration, de-extinction), or in agriculture (e.g. plant breeding, nitrogen fixation). What kinds of edits might you want to make to DNA (e.g., human genomes and beyond) and why? (ii) What technology or technologies would you use to perform these DNA edits and why? Also answer the following questions:

  1. How does your technology of choice edit DNA? What are the essential steps?
  2. What preparation do you need to do (e.g. design steps) and what is the input (e.g. DNA template, enzymes, plasmids, primers, guides, cells) for the editing?
  3. What are the limitations of your editing methods (if any) in terms of efficiency or precision?

DNA read:

  • I would want to sequence DNA used for digital data storage. In my knowledge, this technology enables the storage of digital information such as text, images, or files by encoding them into DNA sequences instead of being stored on hard drives. DNA is extremely stable and can store a huge amount of information in a very small space, which makes it interesting for long-term data storage. Reading this DNA by sequencing is necessary to retrieve the stored information and check that the data has not been damaged or changed over time.
  • For this porpose, I would use Illumina sequencing because it is very accurate and well suited for reading short DNA fragments, which is how DNA data storage is usually organized. this strategy can be performed following 4 crusial steps: Image_adress
Image Image
  1. Generation This method is a second-generation sequencing technology. It sequences millions of short DNA fragments in parallel, which makes it fast and reliable, but it cannot read very long DNA molecules in one piece.

  2. Input and preparation The input is DNA that contains the encoded digital data. To prepare it: The DNA is fragmented into short pieces, Adapters are added to both ends of the fragments, The fragments are amplified using PCR, The prepared DNA is loaded onto a flow cell

  3. How the technology reads DNA (base calling) Each DNA fragment is copied one base at a time using fluorescently labeled nucleotides. A camera records the color added at each step, and the machine translates these signals into DNA letters (A, T, C, G).

  4. Output The output is a large number of short DNA sequence reads saved as digital files. These reads are then assembled and decoded to recover the original stored data.

DNA write:

  • I am particularly interested in the genes in human genomic DNA related to pharmacogenomics and pharmacogenetics. These fields study how genetic variation affects how people respond to drugs. So, I would want to synthesize genes encoding drug-metabolizing enzymes, like human cytochrome P450 enzymes. Since, these genes are central to pharmacogenetics as variations in them strongly influence how drugs are processed in the body. Synthesizing these genes allows them to be studied, expressed, and tested in controlled systems.

  • So in order to synthetizing them , I would use chemical DNA synthesis combined with gene assembly, which is the standard approach used by commercial DNA synthesis companies.

  1. Essential steps
  • DNA synthesis starts with the digital design of the DNA sequence. This is followed by the chemical synthesis of short oligonucleotides, which are then assembled into full-length genes (for example, using Gibson Assembly). The synthesized genes are cloned into plasmids and finally sequence-verified to confirm their accuracy before use.

  • This DNA synthesis method is easy to use and works well for many projects.However, it can sometimes make mistakes during the process. Parts of DNA that have lots of G and C letters or repeated sequences are harder to make. Very long DNA pieces also need to be built from many shorter fragments, which can be tricky and may cause errors.

DNA Edit:

  • I would want to edit DNA in human cell lines used for drug testing, focusing on genes that affect how drugs work. Changing these genes helps researchers see how different genetic variants influence drug effects and side effects, which is useful in pharmacogenomics.

  • The modification can be realised by CRISPR for editing because it allows precise and programmable changes to DNA. this stratigy works by using a guide RNA to find a specific DNA sequence. The Cas enzyme then makes a cut or nick, and the cell repairs it, introducing the change we want.

  • To use CRISPR, you need to design guide RNAs, prepare the CRISPR components (DNA, RNA, or protein), deliver them into cells, and then check which cells were correctly edited.

  • However, there are some limitations, like different editing efficiencies depending on cell type, and ethical or regulatory concerns when working with human cells.

in this homework, AI ChatGPT assisted me in organizing and clearly articulating my answers and descriptions, ensuring that the content is well-structured and easy to understand.


Sources:

Week 3 HW: Lab Automation

cover image cover image
Assignment: Python Script for Opentrons Artwork

Assignees for this section MIT/Harvard students Required Committed Listeners Required Your task this week is to Create a Python file to run on an Opentrons liquid handling robot.

  1. Review this week’s recitation and this week’s lab for details on the Opentrons and programming it.
  2. Generate an artistic design using the GUI at opentrons-art.rcdonovan.com.
  3. Using the coordinates from the GUI, follow the instructions in the HTGAA26 Opentrons Colab to write your own Python script which draws your design using the Opentrons. You may use AI assistance for this coding — Google Gemini is integrated into Colab (see the stylized star bottom center); it will do a good job writing functional Python, while you probably need to take charge of the art concept.
  • If you’re a proficient programmer and you’d rather code something mathematical or algorithmic instead of using your GUI coordinates, you may do that instead.
  • Ask for help early!
  • If you are having any trouble with scripting, contact your TAs as soon as possible for help. Do not wait until your scheduled robot time slot or you may not be able to complete this assignment!
  1. If the Python component is proving too problematic even with AI and human assistance, download the full Python script from the GUI website and submit that: Use the download icon pointed to by the red arrow in this diagram.
  2. If you use AI to help complete this homework or lab, document how you used AI and which models made contributions.
  3. Sign up for a robot time slot if you are at MIT/Harvard/Wellesley or at a Node offering Opentrons automation. The Python script you created will be run on the robot to produce your work of art!
  • At MIT/Harvard? Lab times are on Thursday Feb.19 between 10AM and 6PM.
  • At other Nodes? Please coordinate with your Node.
  1. Submit your Python file via this form.

I created two different agar art designs using two Arabic calligraphy styles. For the first design, I used a simple calligraphy style and created it directly using Python scripting in a Google Colab notebook. For the second design, I used the Opentrons Automation Art interface to design the calligraphy and obtain the coordinates.

cover image cover image

I used the Google Gemini AI tool in Colab to understand the logic of the example Opentrons scripts provided in the lab. It helped me understand how coordinates, loops, and pipetting commands work. I also used Gemini AI to help identify and correct mistakes in my Python script, such as indentation errors. I reviewed the suggestions and edited the final code myself.

Post-Lab Questions

Assignees for this section MIT/Harvard students Required Committed Listeners Required One of the great parts about having an automated robot is being able to precisely mix, deposit, and run reactions without much intervention, and design and deploy experiments remotely. For this week, we’d like for you to do the following:

  1. Find and describe a published paper that utilizes the Opentrons or an automation tool to achieve novel biological applications.
  2. Write a description about what you intend to do with automation tools for your final project. You may include example pseudocode, Python scripts, 3D printed holders, a plan for how to use Ginkgo Nebula, and more. You may reference this week’s recitation slide deck for lab automation details. While your description/project idea doesn’t need to be set in stone, we would like to see core details of what you would automate. This is due at the start of lecture and does not need to be tested on the Opentrons yet. Example 1: You are creating a custom fabric, and want to deposit art onto specific parts that need to be intertwined in odd ways. You can design a 3D printed holder to attach this fabric to it, and be able to deposit bio art on top. Check out the Opentrons 3D Printing Directory. Example 2: You are using the cloud laboratory to screen an array of biosensor constructs that you design, synthesize, and express using cell free protein synthesis.
  3. Echo transfer biosensor constructs and any required cofactors into specified wells.
  4. Bravo stamp in CPFS reagent master mix into all wells of a 96-well / 384-well plate.
  5. Multiflo dispense the CFPS lysate to all wells to start protein expression.
  6. PlateLoc seal the plate.
  7. Inheco incubate the plate at 37°C while the biosensor proteins are synthesized.
  8. XPeel remove the seal.
  9. PHERAstar measure fluorescence to compare biosensor responses.

  1. Featured Article: Automated Assembly of Programmable RNA-Based Sensors
cover image cover image

The research aimed to solve the challenge of rapidly designing and building large libraries of RNA sensors that can “sense” specific viral RNA signatures. These sensors are crucial for diagnostic applications and understanding RNA-protein interactions. The authors focused on the biological validation of these sensors in both in vivo (bacteria) and cell-free systems. cover image cover image

They used the following lab automation:

  • Hardware: Hamilton Microlab STAR liquid-handling workstation.
  • Software: Custom Python scripts integrated with the liquid handler’s control software to manage complex plate layouts and reaction conditions.

The researchers used the automated system as a tool to facilitate:

  • High-Throughput Plasmid Assembly: The authors needed to construct 144 unique plasmids encoding different riboregulator designs. Doing this manually would be prone to pipetting errors and extremely time-consuming.
  • Library Preparation: Automation was used to prepare DNA libraries and reaction mixes for cell-free protein synthesis assays, ensuring consistent reagent volumes across hundreds of samples.
  • Normalization and Dilution: The Hamilton system handled the precise normalization of DNA concentrations across plates, which is critical for accurate comparative screening of sensor performance.

The study successfully identified several high-performing RNA sensors capable of detecting viral targets. The use of automation allowed the team to scale their construction phase by nearly 10-fold compared to manual workflows, enabling them to test a much wider range of biological designs than previously possible. For understanding the content of this artical and which type of Lab automation the authors used in their research , i used the AI tool “SCISPACE”.

  1. Final project Lab Automation:

My final project focuses on developing an in silico model of a lactose-responsive probiotic that produces lactase only when lactose is present. The physical implementation of this model would allow laboratory automation to verify its predicted results through experimental tests. A liquid-handling robot such as Opentrons could be used to prepare a multi-well plate containing a gradient of lactose concentrations. The robot would then inoculate each well with the engineered probiotic strain, and perform timed sampling to measure lactase activity or reporter output. The automated workflow enables scientists to perform systematic and repeatable tests on lactose responses of the genetic circuit. This helps them match their experimental results with their computer-based model. The project currently exists as a computational project which will use automation as a future extension of the project which does not require automation for its current research activities.

Final Project Ideas

Assignees for this section MIT/Harvard students Required Committed Listeners Required As explained in this week’s recitation, add 1-3 slides in your Node’s section of this slide deck with 3 ideas you have for an Individual Final Project. Be sure to put your name, city, and country on your slide!


1st Idea: In-Silico Model of an Engineered Probiotic Producing Lactase in Response to Lactose

  • Problematic:

Many people cannot digest lactose because they lack enough lactase in their intestine. A possible solution is to use engineered probiotics that produce lactase only when lactose is present. Before building such probiotics in the laboratory, it is important to understand how the genetic system would behave. So, without computational modeling, designing these systems requires trial-and-error experiments that are slow and expensive. There is a need for a simple computational model that can predict how a lactose-responsive genetic circuit would control lactase production over time. Image_ref
Lactose_intolerance_distribution_map Lactose_intolerance_distribution_map

  • Objectives:

The project is based on a lactose-responsive genetic cassette, who’s dynamic behavior is modeled as a genetic circuit in silico. The objectives of this project are:

  1. –> To build an in-silico model of a lactose-responsive genetic circuit.
  2. –> To simulate how lactose stimulate lactase production.
  3. –> To study how changing key parameters affects lactose degradation.
  4. –> To explore system behavior completely in silico.
  • Project Description:

The project develops a purely computational model of an engineered probiotic strain. The model is based on a lactose-responsive genetic cassette, whose dynamic behavior is represented in silico as a genetic circuit: – a lacA promoter, operator, and native RBS from Lactococcus lactis for lactose sensing and regulation, – the lacZ gene from Escherichia coli K-12, encoding β-galactosidase (lactase).

I used an AI tool (ChatGPT) to guide me about the repression mechanism I should use, and its response was as follows: To ensure realistic behavior in the model, the lacA promoter includes a native operator, normally repressed by a LacR-like protein in Lactococcus lactis. In the simulation, a repression term is included to prevent unnecessary accumulation of lacZ (lactase) when lactose is absent. Repression Repression The model simulates how the presence of lactose activates the promoter, leading to lactase production, and how this enzyme then degrades lactose over time. No DNA construction or wet-lab experiments are performed. All behavior is represented mathematically and simulated using a computer.

  • Steps to Achieve the Project:
  1. Define simplified biological assumptions (single strain, constant environment).
  2. Represent lactose as the input signal.
  3. Model promoter activation based on lactose concentration.
  4. Model lactase production and degradation over time.
  5. Model lactose degradation by lactase.
  6. Run simulations to observe system behavior.
  7. Change parameters to study different scenarios.
  • Limitations:
  1. The model does not include other gut microbes.
  2. The gut environment is assumed constant.
  3. Results are predictive, not experimentally validated.

2nd Idea:Engineering an E. coli Reporter Strain to Monitor Protein Aging During Heterologous Expression Using a Fluorescent Timer Protein

  • Problematic:

Escherichia coli BL21(DE3) is one of the most widely used hosts for heterologous protein expression in research and biotechnology. Although protein expression levels can be easily measured, there are very limited tools to determine how long the expressed protein molecules have persisted inside the cell. During prolonged induction, proteins may accumulate, age, misfold, or lose functionality, even when expression appears successful. Most current methods detect protein quality only after purification, making optimization of expression conditions slow and inefficient. So, there is a need for a genetically encoded reporter system that can estimate protein aging in living cells during expression. Image_ref

Automation Automation
  • Objectives:

This project is based on a fluorescent timer protein–based reporter system integrated into a heterologous protein expression strain. The objectives are:

  1. –> To engineer a reporter strain capable of estimating protein age in vivo.
  2. –> To use a fluorescent timer protein to distinguish newly synthesized and older proteins.
  3. –> To monitor protein aging during prolonged heterologous expression.
  4. –> To provide a practical tool for optimizing protein expression conditions.
  • Project Description:

The project focuses on the genetic engineering of a protein expression strain of E. coli BL21(DE3). The reporter system is based on a genetic fusion between:

  • a protein of interest (POI) expressed under the T7 promoter, and
  • a fluorescent timer protein whose emission spectrum changes over time after synthesis. The genetic construct consists of:
  • a T7 promoter and ribosome binding site,
  • the gene encoding the protein of interest,
  • a flexible linker sequence,
  • the fluorescent timer protein gene,
  • a transcriptional terminator. Image_ref
Automation Automation

After induction, newly synthesized POI–timer fusion proteins initially emit one fluorescent signal. As time progresses, the timer protein matures and shifts to a second fluorescent signal. The ratio of the two fluorescence signals provides an estimate of the age distribution of the expressed protein population.

I used AI tool (ChatGPT) version to refine questions related to the necessary genetic elements required for T7-based heterologous expression in Escherichia coli BL21(DE3) and to determine the appropriate placement of a fluorescent timer gene for monitoring the age of the expressed protein. Automation Automation

  • Steps to Achieve the Project:
  1. Select a heterologous protein suitable for expression in E. coli.
  2. Design a genetic fusion between the protein of interest and a fluorescent timer protein.
  3. Clone the fusion construct under a T7 promoter into an expression plasmid.
  4. Transform the plasmid into E. coli BL21(DE3).
  5. Induce protein expression using IPTG.
  6. Monitor fluorescence signals over time using appropriate excitation/emission settings.
  7. Calculate fluorescence signal ratios to estimate protein aging.
  8. Compare protein aging under different induction times and expression conditions.
  • Limitations:
  1. Fusion of the timer protein may affect protein folding or function.
  2. Protein damage mechanisms are not directly measured.

3rd Idea:Engineering Houseplants for Atmospheric Carbon Monoxide (CO) Capture

  • Problematic:

Carbon monoxide (CO) is a toxic gas produced by cars, heaters, and incomplete combustion. It is dangerous for humans, especially in indoor environments. Current solutions such as CO detectors can detect the gas but cannot remove it. Some bacteria naturally use CO as an energy source and convert it into carbon dioxide (CO₂). However, common houseplants cannot metabolize CO. If plants could be engineered to convert CO into CO₂, they could act as natural biological air filters. Image_ref Image_ref Image_ref

  • Objectives:

The objectives of this project are:

  1. –> To engineer a houseplant capable of converting carbon monoxide into carbon dioxide.
  2. –> To use microbial genes that naturally perform CO oxidation.
  3. –> To ensure the system works safely in oxygen-rich (indoor) environments.
  4. –> To allow the produced CO₂ to be reused by the plant’s normal photosynthesis.
  5. –> To design a genetically stable and safe indoor plant system.
  • Project Description:

This project engineers a plant to express a bacterial enzyme called carbon monoxide dehydrogenase (CODH). This enzyme converts carbon monoxide (CO) into carbon dioxide (CO₂). The CO₂ produced by this reaction is not wasted. Instead, it enters the plant’s natural photosynthetic pathway (Calvin cycle), where it can be fixed into sugars. The plant therefore detoxifies CO while continuing its normal metabolism. The system is designed to work only when CO is present, to avoid unnecessary energy use.

  • Genetic Elements for construct design:
  1. CO Oxidation Enzymes

The core of the system is the carbon monoxide dehydrogenase (CODH) enzyme, which is responsible for converting carbon monoxide (CO) into carbon dioxide (CO₂). This enzyme is composed of three subunits encoded by the genes coxL, coxM, and coxS. The coxL gene encodes the large catalytic subunit, coxM encodes a subunit involved in electron transfer, and coxS encodes a structural subunit that stabilizes the enzyme complex. These genes originate from Oligotropha carboxidovorans, a bacterium that can oxidize CO in the presence of oxygen, making it suitable for expression in plant cells.

  1. Promoter (Gene Expression Control)

To drive the expression of the CODH genes in plant cells, the CaMV 35S promoter is used. This promoter originates from the Cauliflower mosaic virus and is one of the most widely used promoters in plant biotechnology. It enables strong and constitutive gene expression across many plant tissues and is well characterized, making it a reliable choice for this project.

  1. Subcellular Targeting Signal

A chloroplast transit peptide is included to ensure that the CODH proteins are transported into the chloroplast after synthesis. This targeting signal is derived from the small subunit of the plant enzyme Rubisco, which naturally localizes to the chloroplast. By directing the CODH enzymes to the chloroplast, the CO₂ produced from CO oxidation is generated close to the photosynthetic machinery, allowing it to be efficiently reused by the plant during photosynthesis.

  1. Transcription Terminator

The NOS terminator is used to ensure proper termination of transcription and stable gene expression. This terminator originates from Agrobacterium tumefaciens and is commonly used in plant genetic constructs. Its function is to signal the end of transcription, improving mRNA stability and ensuring reliable expression of the introduced genes.

  • Steps to Achieve the Project:
  1. Select CO-oxidation genes from aerobic bacteria.
  2. Adapt bacterial gene sequences for plant expression (codon optimization).
  3. Design a plant expression construct containing:
  • Plant promoter
  • CODH genes
  • Chloroplast targeting signal
  • Transcription terminator Image_ref Automation Automation
  1. Introduce the construct into the plant genome.
  2. Confirm expression of CODH proteins in plant cells.
  3. Evaluate CO removal and plant health in controlled conditions.
  4. Assess whether produced CO₂ supports normal photosynthesis.
  • Limitations:
  1. Plant genetic engineering is slow and complex.
  2. CO uptake by plants may be limited.
  3. CO metabolism efficiency may be low.

in this homework, AI ChatGPT also assisted me in organizing and clearly articulating my answers and descriptions, ensuring that the content is well-structured and easy to understand.


Sources:

week 04 HW: protein design-part-I

cover image cover image
Part A. Conceptual Questions

Assignees for this section MIT/Harvard students Required Committed Listeners Required Answer any NINE of the following questions from Shuguang Zhang: (i.e. you can select two to skip)

  1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)
  2. Why do humans eat beef but do not become a cow, eat fish but do not become fish?
  3. Why are there only 20 natural amino acids?
  4. Can you make other non-natural amino acids? Design some new amino acids.
  5. Where did amino acids come from before enzymes that make them, and before life started?
  6. If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?
  7. Can you discover additional helices in proteins?
  8. Why are most molecular helices right-handed?
  9. Why do β-sheets tend to aggregate?
  • What is the driving force for β-sheet aggregation?
  1. Why do many amyloid diseases form β-sheets?
  • Can you use amyloid β-sheets as materials?
  1. Design a β-sheet motif that forms a well-ordered structure.

  1. Amino Acid Count in 500g Meat: Meat is roughly 20% protein by mass. (Human Nutrition - Protein, Vitamins, Minerals | Britannica, n.d.)
    • 500g meat x 0.20 = 100g protein.
    • Using an average mass of 100 Daltons (Da) per amino acid: 100g / 100 Daltons (or g/mol) = 1 moles of amino acids
    • 1x 6.022 x 1023 = 6.022 x 1023 molecules /1 mole.
  2. Why we don’t become cows: When we eat protein, our digestive system breaks it down into individual amino acids. Our body then uses its own DNA information to reassemble those amino acids into human proteins. The information which is coded by the sequence of AA is destroyed, but the building blocks or AA are reused.
  3. Why only 20 amino acids: In nature, the use of 20 amino acids is often explained as a “frozen accident” that originated in the early RNA World. This set worked well very early in Earth’s history and then became fixed. These 20 amino acids were good enough to build strong and functional proteins. Even though many other amino acids exist, this small group provides enough variety to perform many functions while remaining simple, stable, and efficient for cells to use. (Doig, 2017)
  4. Non-natural amino acids: Yes, scientists can make non-natural (unnatural) amino acids. They do this using chemical methods and special genetic tools that allow new amino acids to be added to proteins. These new amino acids can give proteins new properties that natural amino acids do not have. (Young & Schultz, 2010) For example, A new amino acid could be made by taking a normal amino acid, like alanine, and adding a fluorine atom to its side chain. This fluorinated amino acid would make proteins more stable and less likely to break down, which is useful for drug design. (Adhikari et al., n.d.)
  5. Pre-life origins of amino acids: According to Gutiérrez-Preciado, Romero, and Peimbert (2010) Before enzymes and living organisms existed, amino acids were probably formed naturally on early Earth. Energy from lightning, UV light, and volcanic heat helped simple gases react to make amino acids. Some amino acids were also brought to Earth by meteorites and comets. Together, these processes created a “primordial soup” of basic organic molecules. (Amino Acids, Evolution | Learn Science at Scitable, n.d.)
  6. D-amino acid α-helix: In nature, L-amino acids form right-handed helices. If you used only D-amino acids, the stereochemistry would be mirrored, resulting in a left-handed $\alpha$-helix. (Zotti et al., n.d.)
  7. Additional helices: Yes, additional helical structures besides the standard α-helix can be found in proteins. Studies show that other types of helices occur in many proteins, but they are often overlooked or mistaken for small distortions in α-helices. These helices are especially common in membrane proteins and are found in a significant number of known protein structures.(Vieira-Pires & Morais-Cabral, 2010)
  8. Why right-handed helices: because this shape is the most stable for the natural building blocks of life. L-amino acids and D-sugars fit together best in a right-handed twist, which allows strong hydrogen bonds and reduces crowding between atoms. Left-handed helices are usually less stable or hard to form. (Right-Handed Alpha-Helix - an Overview | ScienceDirect Topics, n.d.)
  9. β -sheet aggregation: β-sheets tend to aggregate because their edges have exposed hydrogen-bonding groups that easily stick to other β-strands. The main driving forces are hydrogen bonding between strands and the hydrophobic effect, which together make the stacked β-sheet structure very stable and allow fibrils to form.(Gsponer & Vendruscolo, 2006)
Part B: Protein Analysis and Visualization

Assignees for this section MIT/Harvard students Required Committed Listeners Required In this part of the homework, you will be using online resources and 3D visualization software to answer questions about proteins. Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions:

  1. Briefly describe the protein you selected and why you selected it.
  2. Identify the amino acid sequence of your protein.
    • How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids.
    • How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.
    • Does your protein belong to any protein family?
  3. Identify the structure page of your protein in RCSB
    • When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)
    • Are there any other molecules in the solved structure apart from protein?
    • Does your protein belong to any structure classification family?
  4. Open the structure of your protein in any 3D molecule visualization software:
    • PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands)
    • Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.
    • Color the protein by secondary structure. Does it have more helices or sheets?
    • Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?
    • Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

  1. I chose the protein mCherry because it is a small red fluorescent protein that is easy to visualize and analyze using 3D protein visualization software. Its structure is well known and has a clear β-barrel shape, which makes it easy to study secondary structure, amino acid distribution, and surface features. This makes mCherry a good example protein for learning basic protein sequence and structure analysis.
  • The mCherry protein analyzed here is the standard red fluorescent protein and does not function as a fluorescent timer. However, according to the fluorescent protein database (FPbase), mCherry is the parent fluorescent protein for several timer-based reporters, including the medium fluorescent timer planned for my final project. Therefore, mCherry is used in this assignment as a reference protein to understand the structure and sequence properties of fluorescent proteins before working with fluorescent timer variants.
cover image cover image
  1. I obtained the amino acid sequence of mCherry from the FPbase, which links laboratory fluorescent protein names to biological databases. FPbase provided the UniProt identifier X5DSL3, which is now stored in UniParc (UPI000046F63B) because the UniProtKB entry was removed. And also, the same database provided the genebank identifier for this protein AAV52164, which from where I got the sequence in fasta format. cover image cover image
  • This is the obtained sequence :
>AAV52164.1 monomeric red fluorescent protein [synthetic construct]
MVSKGEEDNMAIIKEFMRFKVHMEGSVNGHEFEIEGEGEGRPYEGTQTAKLKVTKGGPLPFAWDILSPQF
MYGSKAYVKHPADIPDYLKLSFPEGFKWERVMNFEDGGVVTVTQDSSLQDGEFIYKVKLRGTNFPSDGPV
MQKKTMGWEASSERMYPEDGALKGEIKQRLKLKDGGHYDAEVKTTYKAKKPVQLPGAYNVNIKLDITSHN
EDYTIVEQYERAEGRHSTGGMDELYK
  • The protein sequence is 236 amino acids long and a molecular mass of approximately 26.7 kDa. It has been confirmed at the protein level, although the UniProt entry is currently unreviewed (TrEMBL). Using the provided Colab notebook, I analyzed the amino acid composition of the sequence and found that glycine (G) is the most frequent amino acid, appearing 25 times.
Note

While analyzing the amino acid sequence of mCherry, I noticed a small difference between the sequence length reported by UniProt (236 amino acids) and the sequence obtained from the Colab notebook (241 amino acids). This discrepancy is likely due to the Colab sequence including extra residues from expression constructs, such as start codons, tags, or linkers, which are not part of the canonical protein. UniProt provides the biologically relevant, canonical sequence, which is what I used for further analysis and visualization in this homework.

  • To identify protein sequence homologs of mCherry, I used the BLAST tool available on UniProt. cover image cover image cover image cover image

  • Using the BLAST tool in UniProt, a total of 227 homologous protein sequences were identified for mCherry in the UniProtKB database. Among these results, 13 sequences are reviewed (Swiss-Prot) and 214 are unreviewed (TrEMBL). The homologs show a wide range of sequence identities, from about 23.6% up to 100%, with very low E-values (as low as 4.4 × 10⁻¹⁷⁵), indicating strong evolutionary relatedness. cover image cover image

  • Most homologous proteins have sequence lengths between 200 and 400 amino acids, which is similar to mCherry (236 amino acids). Many homologs originate from marine organisms, especially corals and sea anemones such as Porites lobata, Pocillopora meandrina, and Discosoma species, which are known natural sources of GFP-like fluorescent proteins. Some homologs also appear in bacteria and other organisms, reflecting that mCherry is an engineered protein that has been widely introduced into different hosts for research purposes. Overall, these results confirm that mCherry belongs to a well-conserved GFP-like fluorescent protein family with broad biological and biotechnological use.

  • The mCherry protein belongs to a known protein family. According to UniProt family and domain analysis, mCherry is part of the green fluorescent protein (GFP)-like family , even though it emits red light. This classification is supported by several databases, including InterPro, Pfam, Gene3D, and PRINTS, all of which identify mCherry as a GFP or GFP-related protein. Proteins in this family share a conserved structure and chromophore-forming mechanism. cover image cover image cover image cover image

  1. The structure of the selected protein mCherry is available in the RCSB Protein Data Bank under the PDB ID 2H5Q, titled “Crystal structure of mCherry.” This structure represents the red fluorescent protein mCherry derived from Discosoma species and expressed in Escherichia coli. The structure was solved using X-ray diffraction and was deposited in May 2006 and released in August 2006. cover image cover image
  • The quality of this structure is very high. It was solved at a resolution of 1.36 Å, which is much better than the 2.70 Å threshold typically used to define a good-quality structure. Lower resolution values indicate more detailed and accurate atomic positions, so a resolution of 1.36 Å means the structure is very reliable. In addition, the reported R-values (R-work ≈ 0.15 and R-free ≈ 0.19) further support that this is a well-refined and high-quality crystal structure. cover image cover image

  • Besides the protein itself, the solved structure contains a modified residue that corresponds to the mature chromophore of mCherry. This chromophore is formed from amino acids within the protein chain and is responsible for fluorescence. No additional ligands, cofactors, or external small molecules are present. The biological assembly is a single monomer, which means that the protein functions as one chain and does not require binding to other protein subunits.

  • According to SCOP (Structural Classification of Proteins), mCherry is classified within the fluorescent protein family and the GFP-like superfamily. SCOP groups proteins based on their three-dimensional structure rather than their biological function or expression host. In this classification, mCherry contains a single domain (residues 6–224) that forms the characteristic β-barrel fold shared by GFP-like proteins. This confirms that mCherry belongs to the same structural superfamily as other green and red fluorescent proteins that use a similar fold to support fluorescence. cover image cover image

Note

The difference in the listed organism for mCherry between databases is not an error but is due to how engineered proteins are described. The Fluorescent Protein Database (FPbase) lists mCherry as originating from Discosoma species because mCherry was originally engineered from DsRed, a natural red fluorescent protein found in coral. FPbase focuses on the biological and evolutionary origin of fluorescent proteins. In contrast, UniProt lists mCherry under organisms such as Anaplasma marginale because the mCherry gene has been artificially inserted into this organism for experimental use. UniProt records the organism in which a protein sequence is present or expressed, even if the protein is not naturally produced by that organism. Therefore, both databases are correct and provide different but complementary information about the same engineered fluorescent protein.

  1. The protein was visualized using cartoon, ribbon, and ball-and-stick representations to examine overall fold and atomic details. cover image cover image
  • Coloring by secondary structure shows that mCherry contains many β-sheets about 11 β-sheets and very few α-helices (only 3 helices) . The protein is dominated by a β-barrel fold, which is typical for GFP-like fluorescent proteins. cover image cover image

  • Using the PyMOL command line, I colored the hydrophobic residues yellow and the hydrophilic residues red. The resulting structure shows a clear alternating pattern along the β-strands, where hydrophilic side chains face the exterior to interact with the aqueous environment (supported by the presence of surrounding water molecules), while hydrophobic side chains face the interior. This internal hydrophobic core effectively shields the chromophore from the solvent, which is essential for its fluorescence. cover image cover image

  • Based on the surface visualization of the mCherry protein (PDB: 2H5Q), the protein does not show any clear holes or binding pockets. The surface is compact and smooth, forming a closed β-barrel structure that surrounds the chromophore inside the protein. Although small bumps and grooves are visible on the surface due to amino acid side chains, there are no deep openings that lead into the protein core. This sealed structure is important for mCherry’s function, because it protects the internal chromophore from water or oxygen that could interfere with fluorescence. The closed surface therefore supports the role of mCherry as a stable fluorescent protein. cover image cover image

Part C. Using ML-Based Protein Design Tools

Assignees for this section MIT/Harvard students Required Committed Listeners Required In this section, we will learn about the capabilities of modern protein AI models and test some of them in your chosen protein.

  1. Copy the HTGAA_ProteinDesign2026.ipynb notebook and set up a colab instance with GPU.
  2. Choose your favorite protein from the PDB.
  3. We will now try multiple things in the three sections below; report each of these results in your homework writeup on your HTGAA website:

C1. Protein Language Modeling

  1. Deep Mutational Scans
  • a. Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.
  • b. Can you explain any particular pattern? (choose a residue and a mutation that stands out)
  • c. (Bonus) Find sequences for which we have experimental scans, and compare the prediction of the language model to experiment.
  1. Latent Space Analysis
  • a. Use the provided sequence dataset to embed proteins in reduced dimensionality.
  • b. Analyze the different formed neighborhoods: do they approximate similar proteins?
  • c. Place your protein in the resulting map and explain its position and similarity to its neighbors.

C2. Protein Folding

  1. Folding a protein
  2. Fold your protein with ESMFold. Do the predicted coordinates match your original structure?
  3. Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?

C3. Protein Generation

  • Inverse-Folding a protein: Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN
  1. Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one.
  2. Input this sequence into ESMFold and compare the predicted structure to your original.

C1 Protein Language Modeling

  1. Deep Mutational Scans

To analyze how different mutations affect my protein, I used the ESM-2 protein language model to generate a deep mutational scan. The output is shown as a heatmap, where each color represents how favorable or unfavorable a specific mutation is. The score (z-value) reflects how likely the mutation is to be stable: positive values mean the mutation is well tolerated, while negative values suggest the mutation may damage the protein. cover image cover image

To Understand the Heatmap Colors, these are some exmples: cover image cover image

  • The darkest color (black) represents the most harmful mutations. For example, the mutation at position 92 to Cysteine (C) has a very low score (z = −5.01). This position is buried deep inside the protein. Changing it to cysteine is predicted to strongly disrupt the protein, likely causing misfolding or aggregation.
  • The dark blue color represents very risky mutations. An example is position 180 mutated to Proline (P) with a score of z = −3.08. This residue lies in a β-strand. Proline is known to break regular protein structures, so inserting it here would likely distort or break the β-barrel.
  • The green color indicates neutral mutations. For example, position 183 mutated to Threonine (T) has a score of z = 0, meaning the model predicts little to no effect on protein stability.
  • The yellow color represents favorable mutations. At position 45 mutated to Valine (V), the score is z = 3.04, suggesting this mutation may slightly improve protein stability compared to the original amino acid.

When looking at the entire heatmap, many positions appear as vertical dark bands. These positions do not tolerate most mutations and are therefore highly conserved. These residues usually form the hydrophobic core of the protein and point inward to build the β-barrel structure. Because mCherry has a tightly sealed β-barrel, mutations in these regions can disrupt proper folding or destabilize the barrel. If the β-barrel is damaged or becomes leaky, the chromophore inside can no longer be protected, which would stop the protein from fluorescing. So, this explains why mutations in these regions are strongly disfavored by the model.

My protein of interest is the Medium-FT variant, which is related to my final project and works as a protein “aging timer.” This behavior is controlled by specific mutations that change the chromophore chemistry without breaking the overall protein structure. To explore the functional mutations in the parent protein mCherry (PDB: 2H5Q, the one I used to represent the heatmap), I focused on two important mutations: K69R, and A224S (F. V. Subach et al., 2009; O. M. Subach et al., 2022). So as indicated in the heatmap, they showed positive scores (z = 0.75 ; z = 1.08) respectively. cover image cover image

Both mutations appear as light green to yellow on the heatmap, meaning they are well tolerated. This confirms that these changes do not disrupt the β-barrel or overall stability. they adjust the protein’s function by slowing down fluorescence maturation while keeping the main structure intact.

  1. Latent Space Analysis

To perform latent space analysis, I used the provided dataset of protein sequences from the SCOP database and generated numerical embeddings for each sequence using the ESM-2 protein language model, which results in a three-dimensional map where each point represents one protein. cover image cover image

When analyzing the resulting map, proteins do not appear randomly distributed. Instead, they form local neighborhoods where nearby points correspond to proteins with similar structural properties. These neighborhoods approximate similarities in protein fold and secondary structure rather than biological function. This shows that the language model organizes proteins based on shared “structural rules,” such as how alpha helices and beta sheets are arranged, even when the proteins come from different organisms or have different functions.

For example, the protein d2cw3a1 a.2.11.0 (A:4–90) from Perkinsus marinus has three closest neighbors that come from very different organisms, including Escherichia coli and cow. These neighboring proteins also have very different biological functions.

My protein of interest, mCherry (PDB: 2H5Q) which is represented by bleu dot, is located in a neighborhood dominated by proteins rich in β-sheet structures. Its closest neighbors include proteins such as the β-propeller domain of the enzyme PepX, the β-barrel domain of the chaperone protein Sis1, and other β-sheet–containing domains like transferrin-binding protein and latexin. Although these proteins perform very different biological roles, they share similar β-sheet-based structural architectures. The close proximity of mCherry to these proteins confirms that the ESM-2 model groups proteins based on structural similarity, correctly placing mCherry among other β-sheet and β-barrel-like proteins in the latent space. cover image cover image

C2. Protein Folding

The predicted coordinates matched the original structure very well. The overall shape, especially the β-barrel structure, was preserved, and the folding pattern looked almost identical. This shows that ESMFold can accurately predict the structure of mCherry from its amino acid sequence. cover image cover image

Next, I changed the protein sequence by introducing several mutations, including small amino acid changes and changes spread across the sequence. After folding the mutated sequence with ESMFold, the structure showed noticeable changes compared to the original protein. While the general β-barrel shape was still present, some regions were slightly distorted. This indicates that mCherry is partly resilient to mutations, but too many or poorly placed mutations can affect proper folding and reduce structural stability. cover image cover image

C3. Protein Generation

I used ProteinMPNN to do inverse folding on the mCherry protein (PDB: 2H5Q). I used the default settings and turned off the homomer option because this protein has only one chain. ProteinMPNN uses the 3D shape of the protein and suggests new amino acid sequences that can keep the same shape.

The output includes a probability heatmap, which shows the model’s confidence for each amino acid at every position in the sequence. In the heatmap, bright colors (yellow/green) indicate amino acids that are highly preferred at a specific position, while dark colors (blue/purple) indicate unlikely choices. Some positions show a strong preference for one amino acid, meaning they are important for maintaining the protein structure. Other positions show more flexibility, suggesting they can tolerate different amino acids without disrupting the fold. cover image cover image

ProteinMPNN generated a new sequence candidate with a sequence recovery of about 47.93 %, meaning nearly half of the amino acids are identical to the original mCherry sequence. The designed sequence received a lower score (0.8107) compared to the native sequence score (1.3913). Because lower scores indicate a better statistical fit to the backbone, this suggests that the designed sequence is predicted to be highly compatible and stable for the 11-stranded β-barrel structure of mCherry.

  • The native protein sequence and its score are shown below:
>2H5Q, score=1.3913, fixed_chains=[], designed_chains=['A'], model_name=v_48_020
NMAIIKEFMRFKVHMEGSVNGHEFEIEGEGEGRPYEGTQTAKLKVTKGGPLPFAWDILSPQFXXXSKAYVKHPADIPDYLKLSFPEGFKWERVMNFEDGGVVTVTQDSSLQDGEFIYKVKLRGTNFPSDGPVMQKKTMGWEASSERMYPEDGALKGEIKQRLKLKDGGHYDAEVKTTYKAKKPVQLPGAYNVNIKLDITSHNEDYTIVEQYERAEGRHST
  • The newly generated protein sequence and its evaluation metrics are shown below:
>T=0.1, sample=0, score=0.8107, seq_recovery=0.4793
VDDIVKPVQKYTVNLDGSVNGHKFKIKGEGIGTPYEGKYEVDLEVTEGGPLPFSFDILAPLFXXXAQQFTKYPADIPDYVKQAFPEGYTEERTADYEDGGKLKSTKTVTLKDGVVVQEIEADGSNFPADGPVMTKKTAGWEPVVWHCYPKDGALYCEADAALKLKDGGTYKAKVTAKIKPNHPVPLPGPFEIDEKLTVTDHNADETKVKLSKEAVARRAS

I attempted to refold the newly designed sequence using ESMFold in order to compare the predicted structure with the original mCherry structure. However, ESMFold requires GPU resources, and GPU access was not available at the time of execution. As a result, a direct structural comparison could not be performed. Despite this limitation, the strong sequence score and conserved structural regions indicate that the designed sequence would likely fold into a structure very similar to the original β-barrel if GPU resources were available.

Gemini AI tools integrated with Google Colab were used to help explain code errors, interpret the generated outputs such as heatmaps, and analyze the latent space by identifying the closest neighboring proteins through distance calculations between my protein and other sequences.

Part D. Group Brainstorm on Bacteriophage Engineering

Assignees for this section MIT/Harvard students Required Committed Listeners Required

  1. Find a group of ~3–4 students
  2. Read through the Phage Reading material listed under “Reading & Resources” below
  3. Review the Bacteriophage Final Project Goals for engineering the L Protein:
  • Increased stability (easiest)
  • Higher titers (medium)
  • Higher toxicity of lysis protein (hard)
  1. Brainstorm Session
  • Choose one or two main goals from the list that you think you can address computationally (e.g., “We’ll try to stabilize the lysis protein,” or “We’ll attempt to disrupt its interaction with E. coli DnaJ.”).
  • Write a 1-page proposal (bullet points or short paragraphs) describing:
    • Which tools/approaches from recitation you propose using (e.g., “Use Protein Language Models to do in silico mutagenesis, then AlphaFold-Multimer to check complexes.”).
    • Why do you think those tools might help solve your chosen sub-problem?
    • Name one or two potential pitfalls (e.g., “We lack enough training data on phage–bacteria interactions.”).
    • Include a schematic of your pipeline.
  • This resource may be useful: HTGAA Protein Engineering Tools
  1. Each individually put your plan on your HTGAA website
  • Include your group’s short plan for engineering a bacteriophage

One-Page Proposal

in this homework, AI ChatGPT assisted me in organizing and clearly articulating my answers and descriptions, ensuring that the content is well-structured and easy to understand.


Sources:

week-05-HW-protein-design-part-II

cover image cover image
Part A: SOD1 Binder Peptide Design (From Pranam)

Assignees for this section MIT/Harvard students Required Committed Listeners Required Superoxide dismutase 1 (SOD1) is a cytosolic antioxidant enzyme that converts superoxide radicals into hydrogen peroxide and oxygen. In its native state, it forms a stable homodimer and binds copper and zinc. Mutations in SOD1 cause familial Amyotrophic Lateral Sclerosis (ALS). Among them, the A4V mutation (Alanine → Valine at residue 4) leads to one of the most aggressive forms of the disease. The mutation subtly destabilizes the N-terminus, perturbs folding energetics, and promotes toxic aggregation.

Your challenge:

  1. Design short peptides that bind mutant SOD1.
  2. Then decide which ones are worth advancing toward therapy.

You will use three models developed in our lab:

  • PepMLM: target sequence-conditioned peptide generation via masked language modeling
  • PeptiVerse: therapeutic property prediction
  • moPPIt: motif-specific multi-objective peptide design using Multi-Objective Guided Discrete Flow Matching (MOG-DFM)

Part 1: Generate Binders with PepMLM

  1. Begin by retrieving the human SOD1 sequence from UniProt (P00441) and introducing the A4V mutation.
  2. Using the PepMLM Colab linked from the HuggingFace PepMLM-650M model card:
  3. Generate four peptides of length 12 amino acids conditioned on the mutant SOD1 sequence.
  4. To your generated list, add the known SOD1-binding peptide FLYRWLPSRRGG for comparison.
  5. Record the perplexity scores that indicate PepMLM’s confidence in the binders.

Part 2: Evaluate Binders with AlphaFold3

  1. Navigate to the AlphaFold Server: alphafoldserver.com
  2. For each peptide, submit the mutant SOD1 sequence followed by the peptide sequence as separate chains to model the protein-peptide complex.
  3. Record the ipTM score and briefly describe where the peptide appears to bind. Does it localize near the N-terminus where A4V sits? Does it engage the β-barrel region or approach the dimer interface? Does it appear surface-bound or partially buried?
  4. In a short paragraph, describe the ipTM values you observe and whether any PepMLM-generated peptide matches or exceeds the known binder.

Part 3: Evaluate Properties of Generated Peptides in the PeptiVerse

Structural confidence alone is insufficient for therapeutic development. Using PeptiVerse, let’s evaluate the therapeutic properties of your peptide! For each PepMLM-generated peptide:

  1. Paste the peptide sequence.
  2. Paste the A4V mutant SOD1 sequence in the target field.
  3. Check the boxes
    1. Predicted binding affinity
    2. Solubility
    3. Hemolysis probability
    4. Net charge (pH 7)
    5. Molecular weight

Compare these predictions to what you observed structurally with AlphaFold3. In a short paragraph, describe what you see. Do peptides with higher ipTM also show stronger predicted affinity? Are any strong binders predicted to be hemolytic or poorly soluble? Which peptide best balances predicted binding and therapeutic properties?

Choose one peptide you would advance and justify your decision briefly.

Part 4: Generate Optimized Peptides with moPPIt

Now, move from sampling to controlled design. moPPIt uses Multi-Objective Guided Discrete Flow Matching (MOG-DFM) to steer peptide generation toward specific residues and optimize binding and therapeutic properties simultaneously. Unlike PepMLM, which samples plausible binders conditioned on just the target sequence, moPPIt lets you choose where you want to bind and optimize multiple objectives at once.

  1. Open the moPPit Colab linked from the HuggingFace moPPIt model card
  2. Make a copy and switch to a GPU runtime.
  3. In the notebook:
    1. Paste your A4V mutant SOD1 sequence.
    2. Choose specific residue indices on SOD1 that you want your peptide to bind (for example, residues near position 4, the dimer interface, or another surface patch).
    3. Set peptide length to 12 amino acids.
    4. Enable motif and affinity guidance (and solubility/hemolysis guidance if available). Generate peptides.
  4. After generation, briefly describe how these moPPit peptides differ from your PepMLM peptides. How would you evaluate these peptides before advancing them to clinical studies?

Part 1: Generate Binders with PepMLM

For the PepMLM analysis, the amino acid sequence of the normal Superoxide Dismutase 1 protein was obtained from UniProt using the accession number P00441. cover image cover image To simulate the disease-associated variant, the A4V mutation was then manually introduced into the sequence to generate the mutant form of the protein used for the peptide design experiments. This mutation corresponds to the substitution of alanine by valine at position 4 of the protein sequence.

  • Original Superoxide dismutase 1 (SOD1) sequence from Uniprot:

sp|P00441|SODC_HUMAN Superoxide dismutase [Cu-Zn] OS=Homo sapiens OX=9606 GN=SOD1 PE=1 SV=2

MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTS AGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVV HEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

  • Mutant A4V sequence (A4V means Alanine → Valine at position 4):

sp|P00441|SODC_HUMAN Superoxide dismutase [Cu-Zn] OS=Homo sapiens OX=9606 GN=SOD1 PE=1 SV=2

MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTS AGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVV HEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

Note

The Problem: The A4V mutation you are studying is famous because it destabilizes the dimer interface. This causes the dimer to fall apart into monomers, which then misfold and aggregate into the toxic clumps seen in ALS patients.

After running the PepMLM model using mutant A4V sequence, I obtained four candidate peptide binder sequences. In the generated results, each sequence ended with the amino acid symbol ‘X’, which represents an undefined residue predicted by the model.

To proceed with the structural analysis and fold the predicted binders with the mutated SOD1 protein, I needed to assign a specific amino acid at this position. For this reason, I replaced the ‘X’ residue with alanine (A) in each sequence. I chose alanine because it is a small and neutral amino acid that generally has minimal effects on protein structure and interactions. This allowed me to obtain complete peptide sequences that could be used for the subsequent folding and interaction prediction with the A4V mutant SOD1 protein. cover image cover image

BinderOriginal SequenceModified SequencePerplexity
Known binderFLYRWLPSRRGG//////
1WLSPAAGVEWKXWLSPAAGVEWKA14.764
2WHYYVVVVRHWXWHYYVVVVRHWA28.661
3WRSYVVVVELKXWRSYVVVVELKA20.402
4WRYPAVVAAHGXWRYPAVVAAHGA5.136

Part 2: Evaluating Binders with AlphaFold3

The predicted peptide binders were evaluated using structural modeling. Each peptide was folded together with the A4V mutant of SOD1 to evaluate the potential protein–peptide interactions. cover image cover image cover image cover image

Note

pTM and ipTM scores: the predicted template modeling (pTM) score and the interface predicted template modeling (ipTM) score are both derived from a measure called the template modeling (TM) score. This measures the accuracy of the entire structure (Zhang and Skolnick, 2004; Xu and Zhang, 2010). A pTM score above 0.5 means the overall predicted fold for the complex might be similar to the true structure. ipTM measures the accuracy of the predicted relative positions of the subunits within the complex. Values higher than 0.8 represent confident high-quality predictions, while values below 0.6 suggest likely a failed prediction. ipTM values between 0.6 and 0.8 are a gray zone where predictions could be correct or incorrect. TM score is very strict for small structures or short chains, so pTM assigns values less than 0.05 when fewer than 20 tokens are involved; for these cases PAE or pLDDT may be more indicative of prediction quality.

  • The known binder (FLYRWLPSRRGG) showed a relatively low binding confidence, with an ipTM score of 0.28. The peptide binds mainly on the surface of the SOD1 β-barrel, close to the electrostatic loop and the zinc-binding loop. It does not bind near the N-terminus, where the A4V mutation is located, and it also does not interact with the dimer interface. The peptide remains mostly surface-bound rather than buried inside the protein structure. Several residues help stabilize the interaction. For example, Trp5 and Tyr3 can form aromatic contacts with the protein surface, while Arg8 and Arg9 may form hydrogen bonds with nearby residues of SOD1. However, the peptide does not form a strong or compact binding interface, which suggests that the interaction may be weak or transient.

  • Binder 1 (WLSPAAGVEWKA) showed a clear improvement compared with the known binder, with an ipTM score of 0.39. The peptide binds on a hydrophobic groove on the surface of the SOD1 β-barrel. In this interaction, Trp1 acts as an important anchoring residue, helping the peptide attach to a hydrophobic pocket on the protein surface. Other residues such as Ser3 and Pro4 help position the peptide backbone against the protein surface. In addition, Glu9 forms stabilizing hydrogen bonds with nearby residues on SOD1. Because of these interactions, the peptide forms a more compact and organized binding conformation than the known binder. Although the peptide still binds away from the A4V mutation site, the higher ipTM score and the stronger interaction network suggest that Binder 1 may represent a more promising peptide candidate.

  • Binder 2 (WHYYVVVVRHWA) showed a moderate interaction with SOD1, with an ipTM score of 0.33, which is higher than the known binder but lower than Binder 1. The peptide binds on a surface patch of the SOD1 β-barrel region. Several residues appear to contribute to this interaction. Trp1 participates in both hydrogen bonding and aromatic interactions with the protein surface, helping to anchor the peptide. Tyr3 and Arg9 also participate in hydrogen bonding that stabilizes the peptide orientation. In addition, the terminal residue Ala12 contributes to stabilizing the peptide backbone through hydrogen bonding with the protein surface. Compared with the known binder, Binder 2 shows a more localized and organized binding mode, although the peptide still binds mainly on the surface of the protein rather than deeply inside the structure.

  • Binder 3 (WRSYVVVVELKA) showed the lowest binding confidence among the designed peptides, with an ipTM score of 0.20, which is even lower than the known binder. The peptide still localizes on the surface of the SOD1 β-barrel, but the interaction appears weak and poorly defined. The interaction is mainly supported by Arg2 and Lys11, which can form hydrogen bonds with residues on the SOD1 surface. In addition, Tyr4 may contribute through aromatic interactions with the protein surface. However, the peptide forms only a limited number of stabilizing contacts, and the interaction appears less stable compared with Binder 1 and Binder 2. These results suggest that Binder 3 may not be a strong candidate for stable binding to the SOD1 mutant.

  • Binder 4 (WRYPAVVAAHGA) showed a moderate structural confidence, with an ipTM score of 0.33, similar to Binder 2 and higher than the known binder. The peptide binds on the surface of the SOD1 β-barrel region. Several residues contribute to this interaction. Trp3, Val6, and Gly11 appear to form hydrogen bonds with residues on the SOD1 surface, helping stabilize the interaction. In addition, an internal hydrogen bond between Val6 and His10 helps stabilize the peptide backbone and maintain its conformation. Compared with Binder 3, this peptide forms more defined interactions with the protein surface, which explains its higher predicted binding confidence. Although the peptide still binds away from the A4V mutation site, the interaction appears more organized and stable than the known binder.

To further explore whether peptide length influences binding stability, the same structural analysis was also performed using 11-residue versions of the peptides obtained by removing the final alanine that replaced the unknown residue X. For Binder 1, the ipTM score decreased from 0.39 (12 aa) to 0.27 (11 aa), indicating that the twelfth residue likely helps stabilize the interaction with the SOD1 surface. In contrast, Binder 2 showed a small increase in structural confidence, where the score changed from 0.33 (12 aa) to 0.35 (11 aa), suggesting that the slightly shorter peptide may adopt a somewhat better orientation on the protein surface. Binder 3 showed the strongest negative effect of shortening the peptide, with the score decreasing from 0.20 (12 aa) to 0.13 (11 aa), confirming that this peptide already forms weak interactions and becomes even less stable when shortened. Interestingly, Binder 4 showed the opposite trend, where the 11-residue version reached the highest score of all tested peptides (0.44) compared with 0.33 for the 12-residue version, suggesting that removing the last residue may allow the peptide to adopt a more favorable binding conformation. Overall, these exploratory results suggest that peptide length can influence binding stability, but the effect is sequence-dependent, since shortening the peptide reduced stability for some binders (Binder 1 and Binder 3) while improving it for others (Binder 2 and Binder 4).

Part 3: Evaluate Properties of Generated Peptides in the PeptiVerse

I evaluated the therapeutic properties of all 12-residue peptide binders using PeptiVerse. cover image cover image cover image cover image cover image cover image The known binder FLYRWLPSRRGG showed weak binding affinity (pKd 5.97), good solubility, very low hemolysis probability (0.047), and a positive net charge of 2.76. Among the PepMLM-generated peptides, Binder 1 (WLSPAAGVEWKA) had weak binding affinity (pKd 5.61), excellent solubility, very low hemolysis (0.037), and a near-neutral net charge (-0.24). Binder 2 (WHYYVVVVRHWA) exhibited medium binding affinity (pKd 7.12), fully soluble, non-hemolytic (0.115), and slightly positive net charge (0.93). Binder 3 (WRSYVVVVELKA) showed weak binding (pKd 6.28), soluble, non-hemolytic (0.115), and net charge 0.76. Binder 4 (WRYPAVVAAHGA) had weak binding (pKd 5.22), soluble, non-hemolytic (0.037), and net charge 0.85. cover image cover image

BinderSequenceipTM (AlphaFold)Predicted Binding Affinity (pKd/pKi)SolubilityHemolysis ProbabilityNet Charge (pH 7)Molecular Weight (Da)Highlights
KnownFLYRWLPSRRGG0.28Weak (5.968)Soluble (1)0.0472.761507.7Surface-bound, low confidence; non-hemolytic
Binder 1WLSPAAGVEWKA0.39Weak (5.608)Soluble (1)0.037-0.241314.5Highest ipTM, stable hydrophobic groove binding; non-hemolytic
Binder 2WHYYVVVVRHWA0.33Medium (7.123)Soluble (1)0.1150.931614.8Good binding, slightly lower ipTM than Binder 1; non-hemolytic
Binder 3WRSYVVVVELKA0.20Weak (6.279)Soluble (1)0.1150.761448.7Lowest ipTM, weak structural stability; non-hemolytic
Binder 4WRYPAVVAAHGA0.33Weak (5.216)Soluble (1)0.0370.851297.5Medium ipTM, non-hemolytic, good solubility

Comparing these properties with the structural confidence from AlphaFold, we see that higher ipTM scores do not always directly match stronger predicted binding. For example, Binder 1 had the highest ipTM (0.39) but only weak predicted binding, while Binder 2 had slightly lower ipTM (0.33) but showed medium predicted binding. All generated peptides are soluble and non-hemolytic, which is favorable for therapeutic use. Considering both structural confidence and predicted properties, Binder 1 (WLSPAAGVEWKA) is the most promising overall: it has the highest structural stability on SOD1, is non-hemolytic, fully soluble, and has a near-neutral charge that may support safe and effective binding in a biological context.

Part 4: Generating Optimized Peptides with moPPIt

Mutations such as the A4V variant can destabilize the structure of Superoxide Dismutase 1, increasing the probability of protein misfolding, dissociation of the dimer, and toxic aggregation, which are processes associated with Amyotrophic Lateral Sclerosis (ALS). For this reason, the design strategy of this step focused on generating short peptides that can bind simultaneously to both monomers at the dimer interface, effectively acting as a molecular bridge that reconnects and stabilizes the two subunits. By reinforcing the interaction between the chains, these peptides may help restore a conformation closer to the native functional state of the SOD1 complex, while reducing the structural instability caused by the mutation.

To do so, several design parameters were selected before generating peptides. The peptide length was fixed at 12 amino acids. The motif position was focused on residues 3–10, meaning the central region of the peptide was encouraged to interact with the target protein. In addition, affinity guidance and solubility optimization were enabled, and hemolysis prediction was considered to reduce potential toxicity. These settings allow the model to design peptides that not only bind the protein but also have better therapeutic properties. cover image cover image cover image cover image cover image cover image

After generating the following peptides, their structures were evaluated using AlphaFold to predict how they interact with SOD1.

  • The generated sequences:
Optimized BinderSequenceIpTM ScoreBinding Localization
1KRQCEIFNQFMA0.91Interface between the two monomers
2EKDNKWVITSQF0.86Interface between the two monomers
3VCQFDYKTLFKK0.87Interface between the two monomers
4GQQSLFKTKTLD0.89The outer surface of a single SOD1 monomer
  • Binder 1 – KRQCEIFNQFMA (ipTM: 0.91) This peptide localizes at the dimer interface of the SOD1 homodimer and acts as a molecular bridge between the two monomers. Several residues of the peptide participate in stabilizing the interaction. Gln3 forms a hydrogen bond with residues on the first monomer, while Cys4 interacts with a cysteine residue on the second monomer. In addition, Asn8 forms multiple hydrogen bonds with residues on Chain A. These multiple contacts allow the peptide to connect both monomers simultaneously, which could help stabilize the dimer structure of SOD1. cover image cover image
  • Binder 2 – EKDNKWVITSQF (ipTM: 0.86) This peptide also binds at the dimer interface and connects the two monomers. The interaction is mainly driven by the N-terminal region of the peptide. Glu1 forms several hydrogen bonds with residues on Chain B, creating a strong anchoring point. In addition, Ser10 interacts with residues on Chain A. Through these interactions with both monomers, the peptide may help maintain the stability of the SOD1 dimer. cover image cover image
  • Binder 3 – VCQFDYKTLFKK (ipTM: 0.87) This peptide spans the interface between the two monomers, forming stabilizing contacts with both chains. Val1 forms a hydrogen bond with residues on Chain B, while Phe4 interacts with residues on Chain A. These interactions allow the peptide to bridge the two monomers and stabilize the interface region. cover image cover image
  • Binder 4 – GQQSLFKTKTLD (ipTM: 0.89) Unlike the previous peptides, this binder attaches to the outer surface of a single SOD1 monomer, particularly near the β-barrel structure. The interaction is mainly driven by residues near the C-terminus of the peptide. Thr10 forms a hydrogen bond with the monomer, while Asp12 forms two hydrogen bonds with residues on Chain A. Lys9 also contributes to stabilization by forming an additional hydrogen bond. This peptide does not bridge the dimer but instead stabilizes the surface structure of the monomer. cover image cover image The four peptides show two different binding strategies:

Three peptides (KRQCEIFNQFMA, EKDNKWVITSQF, and VCQFDYKTLFKK) bind at the dimer interface, where they interact with residues from both monomers. These peptides may help stabilize the SOD1 dimer by acting as a bridge between the two chains. In contrast, GQQSLFKTKTLD binds only to one monomer, specifically on the β-barrel surface. Instead of bridging the two chains, this peptide may stabilize the structure of the individual monomer.

Among the peptides, KRQCEIFNQFMA shows the highest ipTM score (0.91), suggesting the strongest predicted interaction with the protein complex. When comparing the peptides generated by PepMLM and moPPIt, the main difference lies in the design strategy. PepMLM mainly samples possible peptide sequences that could bind to the target protein based on patterns learned from protein sequence data. However, it does not allow the user to control exactly where the peptide should bind on the protein. As a result, the generated peptides are plausible binders, but their binding location and biochemical properties are not specifically optimized.

In contrast, moPPIt enables guided peptide design. In this approach, the user can select specific residues or regions on the protein where the peptide should bind, such as the dimer interface of Superoxide Dismutase 1 or regions near the A4V mutation. The model also optimizes several properties simultaneously, including binding affinity, solubility, hemolysis risk, and motif placement. Because of this multi-objective optimization, moPPIt peptides are designed to better satisfy several therapeutic requirements at the same time.

Part B: BRD4 Drug Discovery Platform Tutorial (Gabriele)

Assignees for this section

MIT/Harvard students Required

Committed Listeners Required


For this part, unfortunately, I was unable to access the BRD4 Drug Discovery Platform, as the access was not granted to me despite my request.

Part C: Final Project: L-Protein Mutants

Assignees for this section

MIT/Harvard students Required

Committed Listeners Required


Option 1 : Improve autofolding and lysis efficiency

The goal of this part was to design mutations in the L-protein in order to improve its function. Two main objectives were considered. The first objective was to improve the autofolding ability of the L-protein so that it can fold correctly without strong dependence on host chaperones. The second objective was to improve the lysis efficiency of the protein by enhancing its ability to form pores in the E. coli membrane and promote faster or more efficient bacterial lysis.

To identify possible mutations, the provided mutation scoring notebook was used. This notebook evaluates all possible amino-acid substitutions in the L protein and assigns a score to each mutation.

After running the notebook, the resulted mutation predictions are presented in the following dataset:

Position (DNA)Position (Protein)Wild Type AAMutation AALLR Score
98950KL2.561468
57429CR2.395427
76939YL2.241780
57529CS2.043150
1739SQ2.014325
57329CQ1.997049
57229CP1.971029
56929CL1.960646
98750KI1.928801
104953NL1.864932
120961EL1.818098
102952TL1.813968
98450KF1.802069
57629CT1.797247
56829CK1.795878
935FQ1.795244
945FR1.659717
56029CA1.648656
53427YR1.628061
43422FR1.602028
925FP1.596891
99750KV1.594576
99550KS1.574557
965FT1.559024
955FS1.556417
88945AL1.539248
77539YS1.517457
53527YS1.497053
78940VL1.477630
52927YL1.474637
43522FS1.423358
56329CE1.383281
76039YA1.364999
57129CN1.362601
98050KA1.357795
56729CI1.344121
895FL1.332615
33417NR1.323651
76739YI1.320103
77639YT1.302804
51426DR1.268762
56629CH1.246107
76439YF1.245851
77739YV1.244390
45423KR1.236555
49425ER1.229350
47424HR1.227779
99650KT1.222131
53327YQ1.218851
53627YT1.215567

cover image cover image The predicted mutations were compared with the experimental dataset of L-protein mutants provided in the course material. This dataset contains mutations that were experimentally tested and their effect on lysis activity.

The goal of this comparison was to determine whether mutations with high prediction scores correspond to mutations that show improved lysis in experimental studies. This step helps evaluate the reliability of the prediction model. The results of this comparison revealed a limited overlap between predicted beneficial mutations and experimentally tested mutations. Two mutations, C29R and K50I, appeared in both datasets. However, experimental data indicated that these substitutions did not improve lysis activity. This suggests that, while the protein language model captures sequence compatibility, it does not fully predict functional outcomes such as lysis efficiency. For this raison, experimental validation remains essential to confirm computational predictions.

To avoid mutations that could disrupt essential protein functions, sequence conservation analysis was performed. Multiple sequences related to the MS2 L protein obtained from the BLAST results provided in the course folder were uploaded to Clustal Omega and aligned. cover image cover image cover image cover image cover image cover image The conserved regions of the L protein were identified after analyzing the multiple sequence alignment results. Highly conserved residues, which are the same across all sequences, were marked with stars (*) in the alignment output. while colon (:) indicates residues with strongly similar chemical properties. These positions were considered critical for protein function, so mutations at these residues were avoided. The remaining positions, which showed variability among sequences, were classified as non-conserved and were selected as potential sites for mutation. This approach ensured that the chosen mutations would minimize disruption of essential protein structure and function. cover image cover image Mutations were selected using the resulted mutation scoring predictions and evolutionary conservation analysis. Only residues located in non-conserved positions were chosen in order to reduce the risk of disrupting essential protein functions. The selected mutations (F5Q, S9Q, F22S, Y27L, and A45L) -as represented in the following table- are distributed between the N-terminal region, the central region, and the transmembrane domain of the L protein. This distribution allows the exploration of potential effects on protein autofolding and membrane activity, while maintaining the overall structural integrity of the protein. cover image cover image

MutationLLR Score*Protein RegionAA Property ChangeMutation TypeConserved Residue?Structural RiskRationale for Selection
S9Q~2.01N-terminal regionSmall polar → Polar amideConservativeUnconservedLowSimilar polarity; minimal structural disruption while potentially altering hydrogen bonding
F5Q~1.80N-terminal regionHydrophobic aromatic → Polar amideModerateUnconservedModerateIntroduces polarity which may affect folding and interaction with cytoplasmic environment
A45L~1.54Transmembrane helixSmall hydrophobic → Larger hydrophobicConservativeUnconservedLowMaintains hydrophobic nature; may stabilize helix packing in membrane
Y27L~1.47Near transmembrane regionAromatic → Hydrophobic aliphaticModerateUnconservedModerateMaintains hydrophobicity but removes aromatic ring; could affect membrane insertion
F22S~1.42Cytoplasmic / near TM regionHydrophobic aromatic → Small polarModerateUnconservedModerateReduces hydrophobicity; may influence membrane interaction and folding

Because the L gene overlaps with other genes in the MS2 genome, the nucleotide changes corresponding to the selected mutations were checked to ensure that they do not introduce stop codons in the overlapping reading frames. cover image cover image cover image cover image cover image cover image The mutations F5Q and S9Q are located in the region overlapping with the coat protein (CP) gene, near its C-terminal end, while the mutation A45L is located in the region overlapping with the replicase (Rep) gene, near its N-terminal region. For each mutation, the possible codon substitutions were examined and confirmed not to generate stop codons in the overlapping genes. Therefore, these mutations are considered compatible with the genome organization of MS2. cover image cover image

Option 2: Achieve DnaJ independence

Here the goal was to reduce or eliminate the dependence of the L-protein on the host chaperone DnaJ. By designing mutations in the soluble N-terminal domain of the L-protein, i aimed to weaken its interaction with DnaJ while maintaining proper folding. This approach could potentially allow the phage to function even if DnaJ is mutated or absent in the host.

To study the interaction, i used the AlphaFold2-Multimer notebook in ColabFold to co-fold the soluble domain of the L-protein with the full sequence of E. coli DnaJ. The sequences used were:

  • DnaJ sequence:

MAKQDYYEILGVSKTAEEREIRKAYKRLAMKYHPDRNQGDKEAEAKFKEIKEAYEVLTDSQKRAAYDQYGHAAFEQGGMGGGGFGGGADFSDIFGDVFGDIFGGGRGRQRAARGADLRYNMELTLEEAVRGVTKEIRIPTLEECDVCHGSGAKPGTQPQTCPTCHGSGQVQMRQGFFAVQQTCPHCQGRGTLIKDPCNKCHGHGRVERSKTLSVKIPAGVDTGDRIRLAGEGEAGEHGAPAGDLYVQVQVKQHPIFEREGNNLYCEVPINFAMAALGGEIEVPTLDGRVKLKVPGETQTGKLFRMRGKGVKSVRGGAQGDLLCRVVVETPVGLNERQKQLLQELQESFGGPTGEHNSPRSKSFFDGVKKFFDDLTR

  • The soluble domain of Lysis protein (N terminal Domain):

METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSS

cover image cover image

The first co-folding run generated five ranked models with the following parameters. Here, pLDDT reflects the confidence in the predicted structure of the L–DnaJ complex, pTM indicates the overall predicted quality of the complex, and ipTM estimates the predicted strength of the interaction between L-protein and DnaJ:

Rank_004_alphafold2_multimer_v3_model_1: pLDDT=79.3 pTM=0.567 ipTM=0.198 cover image cover image Rank_005_alphafold2_multimer_v3_model_2 : pLDDT=78.9 pTM=0.573 ipTM=0.161 cover image cover image Rank_003_alphafold2_multimer_v3_model_3: pLDDT=78.4 pTM=0.572 ipTM=0.265 cover image cover image
Rank_001_alphafold2_multimer_v3_model_4: pLDDT=78.9 pTM=0.583 ipTM=0.373 cover image cover image
Rank_002_alphafold2_multimer_v3_model_5: pLDDT=77.9 pTM=0.572 ipTM=0.287 cover image cover image

  • Plots for L_DnaJ_complex cover image cover image

By comparing the different predicted models, the fourth model was identified as the best-ranked model because it showed the highest ipTM score, indicating the strongest predicted interaction between the L-protein and DnaJ. cover image cover image Using PyMol I Analysed the predicted L–DnaJ complex of the best predicted model (rank 4) and the results revealed multiple interaction residues located in the N-terminal region of the L protein as summarized in the following table:

ResidueTypeContacts with DnaJTypical Interaction Role
Met 1Hydrophobic (non-polar)ASP116, ARG113Hydrophobic contact
GLU2Negatively chargedALA115, ASP116Electrostatic / salt bridge
THR3Polar (uncharged)ASP116, LEU117, ARG118Hydrogen bonding
ARG4Positively chargedALA115, ARG118, LEU117, GLU233, ASP116Electrostatic / salt bridge
PHE5Hydrophobic aromaticASN120, LEU117, ARG118Hydrophobic packing
PRO6Hydrophobic (rigid)LEU117, ARG118, ASN120, TYR119, ASP116Structural / hydrophobic contact
GLN8Polar (uncharged)ASN120, ARG118, TYR119Hydrogen bonding
SER15Polar (uncharged)GLN252, GLU122, LYS251Hydrogen bonding
ASN17Polar (uncharged)GLN252Hydrogen bonding
ARG18Positively chargedVAL250, GLN252, GLN249, GLU122, LYS251Electrostatic interaction
ARG19Positively chargedGLN252Electrostatic interaction
ARG20Positively chargedGLN252Electrostatic interaction
PRO21HydrophobicGLN252, GLU257Structural / hydrophobic contact
PHE22Hydrophobic aromaticPRO254, GLU266, GLU257Hydrophobic contact
LYS23Positively chargedGLU266Electrostatic interaction
HIS24Positively charged / polarVAL326, ARG324, GLU266Electrostatic / hydrogen bond
GLU25Negatively chargedGLU266Electrostatic interaction
ASP26Negatively chargedARG324, GLU266, VAL326Electrostatic interaction
TYR27Aromatic polarVAL327, THR329, GLU328Hydrophobic + H-bond

Key residues such as Arg4, Thr3, Pro6, Phe5, Arg18, Lys23, His24, and Tyr27 were found to interact with several residues of DnaJ, including Asp116, Leu117, Arg118, Glu122, and Glu266. These interactions involve a combination of electrostatic, hydrophobic, and hydrogen-bond contacts. Residues forming multiple contacts were considered potential targets for mutagenesis aimed at reducing the dependence of the L protein on the DnaJ chaperone.

Two hydrophobic residues (Pro6 and Phe22), two positively charged residues (Arg4 and Arg18), and two negatively charged residues (Glu2 and Asp26) were selected for mutational analysis. These residues participate in multiple contacts with DnaJ and represent different physicochemical interaction types involved in stabilizing the L–DnaJ interface.

To evaluate the contribution of different interaction types at the L–DnaJ interface, selected residues were substituted with alanine using an alanine-scanning approach in order to remove their side-chain interactions while minimizing structural perturbation.

Original ResidueMutationReason
PRO6P6Aremoves rigid hydrophobic contact
PHE22F22Aremoves aromatic hydrophobic interaction
ARG4R4Aremoves positive charge
ARG18R18Aremoves strong electrostatic interaction
GLU2E2Aremoves negative charge
ASP26D26Aremoves negative charge

The resulting N-terminal sequence of the lysis protein was used to re-predict the interaction with the DnaJ protein in order to evaluate whether the introduced mutations could reduce the dependence of the lysis protein on the host chaperone:

MATAFAQQSQQTPASTNARRPAKHEAYPCRRQQRSS

The mutated L-protein was co-folded again with DnaJ using AlphaFold2-Multimer. The five ranked models obtained were:

RankpLDDTpTMipTM
378.70.5790.291
478.40.5740.235
577.10.5690.233
279.10.5810.219
179.40.5680.206

When we compared the new models with the wild-type complex, we can see clearly that the ipTM values were slightly lower. In the wild-type prediction, the best model showed an ipTM value of 0.373, while after mutation the highest ipTM value decreased to 0.291. Since ipTM reflects the predicted strength of interaction between two proteins, this decrease suggests that the interaction between the L-protein and DnaJ became weaker after the mutations were introduced. This reduction is consistent with the mutation strategy, where several key residues involved in hydrophobic and electrostatic contacts were replaced with alanine in order to remove their side-chain interactions.

Despite these changes, the overall structural confidence of the models (pLDDT values were 78.9 to 78.6) remained similar to the wild-type predictions, indicating that the L-protein is still likely to fold correctly. Therefore, these results suggest that the designed mutations may reduce the dependence of the L-protein on the DnaJ chaperone while maintaining a stable protein structure. This computational approach demonstrates how targeted mutagenesis combined with AlphaFold2-Multimer predictions can be used to design L-protein variants with potentially lower chaperone dependency.

in this homework, AI ChatGPT assisted me in organizing and clearly articulating my answers and descriptions, ensuring that the content is well-structured and easy to understand.

Week 06 HW: genetic circuits part-I

cover image cover image
Assignment: DNA Assembly

Assignees for this section MIT/Harvard students Required Committed Listeners Required Answer these questions about the protocol in this week’s lab:

  1. What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose?
  2. What are some factors that determine primer annealing temperature during PCR?
  3. There are two methods from this class that create linear fragments of DNA: PCR, and restriction enzyme digests. Compare and contrast these two methods, both in terms of protocol as well as when one may be preferable to use over the other.
  4. How can you ensure that the DNA sequences that you have digested and PCR-ed will be appropriate for Gibson cloning?
  5. How does the plasmid DNA enter the E. coli cells during transformation?
  6. Describe another assembly method in detail (such as Golden Gate Assembly)
    1. Explain the other method in 5 - 7 sentences plus diagrams (either handmade or online).
    2. Model this assembly method with Benchling or Asimov Kernel!

Answer 01: image_ref cover image cover image The Phusion High-Fidelity PCR Master Mix is a ready-to-use solution used for PCR amplification with high accuracy. It already contains the main components needed for DNA amplification. image_ref cover image cover image

  • Phusion DNA Polymerase

Phusion DNA polymerase is the enzyme that copies the DNA during PCR. It synthesizes new DNA strands using the template DNA. This polymerase has proofreading activity, which helps detect and correct errors during DNA synthesis, making the amplification very accurate.

  • Reaction Buffer

The reaction buffer provides the optimal chemical environment for the polymerase to function properly. It maintains the correct pH and salt conditions needed for efficient DNA amplification. Different buffers can be used depending on the DNA template, such as HF buffer for standard templates or GC buffer for GC-rich DNA.

  • MgCl₂ (Magnesium Chloride)

Magnesium ions are essential cofactors for DNA polymerase activity. They help stabilize the interaction between the enzyme, the primers, and the DNA template during DNA synthesis.

  • dNTPs (Deoxynucleotide Triphosphates)

dNTPs are the building blocks used to synthesize new DNA strands. They include dATP, dTTP, dCTP, and dGTP. During PCR, the polymerase adds these nucleotides to the growing DNA strand according to complementary base-pairing rules.

  • Additional additives (e.g., DMSO)

Some reactions may include additives such as DMSO, which helps improve amplification of GC-rich DNA by reducing secondary structures and improving primer binding.

Answer 02:

The primer annealing temperature (Ta) is the temperature at which primers bind to the DNA template during PCR. It mainly depends on the primer melting temperature (Tm). In practice, the annealing temperature is usually set about 3–5 °C lower than the lowest primer Tm so that the primers can bind correctly to the DNA template.

Important factors (from the lab and lecture)

-Primer melting temperature (Tm): image_ref cover image cover image The melting temperature is the temperature at which 50 % of the primer–DNA duplex separates into single strands. It is the main factor used to determine the annealing temperature.

-Primer length (18–22 nucleotides): cover image cover image image_ref Primers are usually designed with a length of 18–22 bases. This length provides good specificity and stable binding to the template DNA.

-GC content (40–60 %)

Primers should contain about 40–60 % GC bases. GC pairs form stronger bonds than AT pairs, which increases the stability of primer binding.

-GC clamp (≤3 GC bases at the 3′ end) image_ref cover image cover image A small number of GC bases at the 3′ end of the primer (called a GC clamp) helps the primer bind more strongly to the template DNA and improves PCR efficiency.

-Primer secondary structures image_ref cover image cover image Primers should avoid forming hairpins, self-dimers, or cross-dimers. These structures prevent primers from binding properly to the template DNA. Recommended primer Tm range (52–58 °C).

Primers are usually designed to have a melting temperature between 52 °C and 58 °C, which allows efficient and specific amplification.

-GC sequence composition

Primers with higher GC content bind more strongly because GC base pairs form three hydrogen bonds, while AT pairs form only two.

Additional factors

-Ionic environment (Mg²⁺ and salt concentration)

Ions such as Mg²⁺ and other salts stabilize the DNA double strand and influence primer binding. Changes in these concentrations can affect the optimal annealing temperature.

-Primer concentration

Higher primer concentrations increase the probability that primers will bind to the DNA template, which can influence the optimal annealing temperature.

-Optimization using gradient PCR

In many experiments, scientists perform gradient PCR to test different annealing temperatures and find the best one for efficient and specific amplification.

Answer 03:

PCR (Polymerase Chain Reaction) and restriction enzyme digestion are two common molecular biology methods used to produce linear DNA fragments, but they work in very different ways. PCR works by amplifying a specific DNA sequence, while restriction enzymes cut existing DNA at specific recognition sites. Both techniques are widely used in cloning and DNA assembly experiments, including methods such as Gibson Assembly. image_ref cover image cover image PCR is a technique used to copy a specific region of DNA many times. It starts with a small amount of template DNA and uses specific primers that bind to the target sequence. A heat-stable DNA polymerase enzyme (such as Taq or Phusion polymerase) then synthesizes new DNA strands. The reaction takes place in a thermal cycler, which repeatedly changes the temperature through three main steps: denaturation, where the DNA strands separate; annealing, where primers bind to the template DNA; and extension, where the polymerase enzyme copies the DNA. After many cycles, PCR produces large amounts of a specific linear DNA fragment. One advantage of PCR is that researchers can design primers to add new sequences to the ends of the fragment, such as restriction sites or overlapping regions for Gibson Assembly. image_ref cover image cover image Restriction enzyme digestion works differently. Instead of amplifying DNA, it cuts existing DNA molecules at specific short sequences called recognition sites. Restriction enzymes recognize these sequences and cut the DNA at or near those locations. In a typical protocol, the DNA (for example a plasmid) is mixed with the restriction enzyme and a specific buffer, and the reaction is incubated at a constant temperature, usually around 37 °C, for about one hour. The enzyme then cuts the DNA to produce linear fragments. Depending on the enzyme, the cut DNA can produce sticky ends (overhangs) or blunt ends, which can be used for cloning.

These two methods are used in different situations depending on the goal of the experiment. PCR is preferred when the DNA is present in low concentration, because it can amplify a very small amount of template into large quantities. PCR is also useful when researchers want to introduce new sequences, mutations, or overlaps into the DNA fragment. For example, primers can be designed to add restriction sites, promoter sequences, or homologous overlaps needed for Gibson Assembly. PCR is also commonly used when scientists want to isolate a specific gene or region from genomic DNA.

Restriction enzyme digestion is more suitable when the DNA is already available in large quantities, such as a purified plasmid. It is commonly used when researchers want to cut DNA at precise and known locations to isolate fragments or prepare a plasmid for cloning. Restriction enzymes are also often used for diagnostic analysis, such as verifying plasmid identity or checking the size of DNA fragments through restriction mapping.

Answer 04:

To ensure that DNA fragments produced by PCR or restriction digestion are suitable for Gibson Assembly, several preparation and verification steps must be followed. Gibson Assembly joins DNA fragments that contain overlapping homologous sequences, so the fragments must be designed carefully and purified before the assembly reaction.

The first and most important step is primer design. Primers used in PCR should include overlapping sequences of about 20–40 base pairs that match the ends of the neighboring DNA fragment. These overlaps allow the fragments to align and assemble correctly during the Gibson reaction. The overlapping regions should have similar melting temperatures (Tm) to allow stable annealing during the isothermal reaction. It is also important to design overlaps with a balanced GC content and to avoid strong secondary structures such as hairpins, because these structures can reduce assembly efficiency.

Another important step is using a high-fidelity DNA polymerase, such as Phusion or Q5 polymerase, during PCR amplification. These enzymes have proofreading activity and reduce the number of mutations introduced during amplification. This is important because Gibson Assembly is often used to construct precise DNA sequences or multi-fragment plasmids.

After PCR amplification, the DNA fragments should be verified using agarose gel electrophoresis to confirm that the fragments have the expected size. The correct DNA bands are then purified from the gel to remove primers, nucleotides, enzymes, and non-specific products that might interfere with the assembly reaction.

To reduce background contamination from the original template plasmid, PCR products can be treated with the restriction enzyme DpnI, which digests methylated template DNA but does not affect the newly synthesized PCR fragments.

If a plasmid backbone is used, the vector must be completely linearized before Gibson Assembly. This can be done by restriction enzyme digestion or PCR. When restriction enzymes are used, it is important to ensure that the digestion is complete so that no circular plasmid remains, because this could produce unwanted background colonies during transformation.

Another important step is DNA quantification. The concentration of each DNA fragment should be carefully measured using methods such as fluorometric quantification (for example Qubit) or gel analysis. The correct molar ratio of vector to insert fragments, often about 1:2 or 1:3, helps improve assembly efficiency.

Finally, after Gibson Assembly and bacterial transformation, the resulting plasmid constructs are usually verified by DNA sequencing to confirm that the fragments assembled correctly and that no mutations were introduced during PCR.

Answer 05:

Plasmid DNA enters Escherichia coli cells during a process called bacterial transformation. In this process, the bacterial cells must first be made competent, meaning their membranes become temporarily able to allow DNA molecules to enter. image_ref cover image cover image In the most common method, called chemical transformation, the cells are treated with a solution containing calcium chloride (CaCl₂). The calcium ions (Ca²⁺) play an important role because they neutralize the negative charges on both the plasmid DNA and the phospholipids of the bacterial membrane. Normally, DNA and the membrane repel each other because they are both negatively charged. The calcium ions reduce this repulsion and allow the plasmid DNA to attach to the surface of the bacterial cell.

After mixing the plasmid DNA with the competent cells, the mixture is kept on ice (around 0 °C) for a short time. The cells are then exposed to a brief heat shock, usually at about 42 °C for 30–60 seconds. This sudden temperature change creates a strong thermal gradient between the cold cells and the warm environment. As a result, the bacterial membrane becomes temporarily destabilized and small pores form, allowing the plasmid DNA to pass into the cell.

Immediately after the heat shock, the cells are placed back on ice. This rapid cooling helps close the pores and stabilize the membrane again. The cells are then transferred into a nutrient recovery medium and incubated for a short period. During this recovery step, the cells repair their membranes and begin expressing the antibiotic resistance gene carried by the plasmid.

Finally, the bacteria are plated on agar plates containing the appropriate antibiotic. Only the cells that successfully received the plasmid DNA will survive and form colonies.

Another alternative method used to introduce plasmid DNA into E. coli is electroporation. In this method, competent bacterial cells are mixed with plasmid DNA and placed in a special electroporation cuvette. A short electrical pulse is then applied using an electroporator. The electrical pulse temporarily creates small pores in the bacterial cell membrane, allowing the plasmid DNA to pass directly into the cell.

After the pulse, the membrane quickly reseals and the cells recover in a nutrient medium. Electroporation is often more efficient than chemical transformation and is commonly used when transforming difficult DNA constructs or when very high transformation efficiency is required.

Answer 06:

Another DNA assembly method is Golden Gate Assembly, which allows several DNA fragments to be joined together in a single reaction. This technique uses special restriction enzymes called Type IIS restriction enzymes, such as BsaI or BsmBI, together with T4 DNA Ligase. Unlike traditional restriction enzymes, Type IIS enzymes cut outside their recognition sequence, which allows scientists to design custom 4-base pair overhangs at the ends of DNA fragments. These overhangs are designed so that fragments can only join with the correct neighboring fragment, ensuring the correct order and orientation of the assembled DNA. During the reaction, the restriction enzyme cuts the DNA fragments and creates the overhangs, and the DNA ligase joins the fragments together. The recognition sites of the restriction enzyme are removed during assembly, which means the final DNA construct cannot be cut again by the same enzyme. The digestion and ligation steps occur in the same tube using alternating temperatures, making Golden Gate Assembly a very efficient method for assembling multiple DNA fragments, especially in synthetic biology and modular cloning experiments. image_ref cover image cover image

This diagram is a clear example of Golden Gate Assembly, a cloning method that joins several DNA fragments in one reaction. In the example, three DNA parts — Promoter (Fragment A), ORF (Fragment B), and Terminator (Fragment C) — are assembled into a final plasmid called the destination vector. The process uses the Type IIS restriction enzyme BsaI together with T4 DNA Ligase.

In the first step, each fragment is present in an entry vector that contains the BsaI recognition site (GGTCTC). Unlike classical restriction enzymes, BsaI cuts outside of its recognition site, generating specific 4-base pair sticky ends (overhangs). Because the cut occurs outside the recognition sequence, the recognition site is removed during assembly and does not remain in the final DNA construct.

The fragments are designed with specific overhangs so they connect in the correct order. For example, Fragment A ends with the overhang CCAC, which matches the beginning of Fragment B. Fragment B ends with CGAT, which matches the start of Fragment C. These complementary overhangs act like puzzle pieces, ensuring that the fragments assemble correctly and in the proper orientation.

All fragments, the destination vector, BsaI, and T4 DNA Ligase are mixed in a single tube. During the reaction, BsaI cuts the DNA fragments to create sticky ends, and T4 DNA ligase joins fragments with matching overhangs. The reaction cycles between temperatures that allow DNA digestion and ligation, gradually assembling the correct construct.

Once fragments are ligated together, the BsaI recognition sites are no longer present, so the final product cannot be cut again by the enzyme. This makes the process efficient and irreversible, allowing the formation of a seamless DNA construct containing Fragment A + Fragment B + Fragment C in the destination plasmid. image_ref cover image cover image

  • Modeling Golden Gate Assembly in Benchling

In this part, I modeled a Golden Gate Assembly to construct a genetic circuit for my second project, which is the engineering of an Escherichia coli reporter strain to monitor protein aging using a fluorescent timer protein.

First, I selected all the genetic elements needed for my construct. The backbone plasmid was obtained from Addgene, and it already contains a T7 promoter, a ribosome binding site (RBS), and a T7 terminator, which are very suitable for strong expression of the inserted gene. This vector also includes the GST (Glutathione S-Transferase from Schistosoma japonicum), which I used as the protein of interest because it has stable folding and is suitable for initial testing of my genetic system.

Then, I designed two additional fragments: a flexible linker (Gly₄Ser)₃ and a fluorescent timer (FT) protein (Medium FT). Their sequences were also obtained from Addgene. The linker allows proper folding between the GST protein and the fluorescent timer, while the FT protein provides a signal that changes over time, allowing estimation of protein age inside the cell.

  • The full sequence of the pET28a-GST-P(11)4:
GGTTTGCGTATTGGGCGCCAGGGTGGTTTTTCTTTTCACCAGTGAGACGGGCAACAGCTGATTGCCCTTCACCGCCTGGCCCTGAGAGAGTTGCAGCAAGCGGTCCACGCTGGTTTGCCCCAGCAGGCGAAAATCCTGTTTGATGGTGGTTAACGGCGGGATATAACATGAGCTGTCTTCGGTATCGTCGTATCCCACTACCGAGATATCCGCACCAACGCGCAGCCCGGACTCGGTAATGGCGCGCATTGCGCCCAGCGCCATCTGATCGTTGGCAACCAGCATCGCAGTGGGAACGATGCCCTCATTCAGCATTTGCATGGTTTGTTGAAAACCGGACATGGCACTCCAGTCGCCTTCCCGTTCCGCTATCGGCTGAATTTGATTGCGAGTGAGATATTTATGCCAGCCAGCCAGACGCAGACGCGCCGAGACAGAACTTAATGGGCCCGCTAACAGCGCGATTTGCTGGTGACCCAATGCGACCAGATGCTCCACGCCCAGTCGCGTACCGTCTTCATGGGAGAAAATAATACTGTTGATGGGTGTCTGGTCAGAGACATCAAGAAATAACGCCGGAACATTAGTGCAGGCAGCTTCCACAGCAATGGCATCCTGGTCATCCAGCGGATAGTTAATGATCAGCCCACTGACGCGTTGCGCGAGAAGATTGTGCACCGCCGCTTTACAGGCTTCGACGCCGCTTCGTTCTACCATCGACACCACCACGCTGGCACCCAGTTGATCGGCGCGAGATTTAATCGCCGCGACAATTTGCGACGGCGCGTGCAGGGCCAGACTGGAGGTGGCAACGCCAATCAGCAACGACTGTTTGCCCGCCAGTTGTTGTGCCACGCGGTTGGGAATGTAATTCAGCTCCGCCATCGCCGCTTCCACTTTTTCCCGCGTTTTCGCAGAAACGTGGCTGGCCTGGTTCACCACGCGGGAAACGGTCTGATAAGAGACACCGGCATACTCTGCGACATCGTATAACGTTACTGGTTTCACATTCACCACCCTGAATTGACTCTCTTCCGGGCGCTATCATGCCATACCGCGAAAGGTTTTGCGCCATTCGATGGTGTCCGGGATCTCGACGCTCTCCCTTATGCGACTCCTGCATTAGGAAGCAGCCCAGTAGTAGGTTGAGGCCGTTGAGCACCGCCGCCGCAAGGAATGGTGCATGCAAGGAGATGGCGCCCAACAGTCCCCCGGCCACGGGGCCTGCCACCATACCCACGCCGAAACAAGCGCTCATGAGCCCGAAGTGGCGAGCCCGATCTTCCCCATCGGTGATGTCGGCGATATAGGCGCCAGCAACCGCACCTGTGGCGCCGGTGATGCCGGCCACGATGCGTCCGGCGTAGAGGATCGAGATCTCGATCCCGCGAAATTAATACGACTCACTATAGGGGAATTGTGAGCGGATAACAATTCCCCTCTAGAAATAATTTTGTTTAACTTTAAGAAGGAGATATACCATGGGCCATCATCATCATCATCATAGCCCGATCCTGGGTTACTGGAAAATCAAGGGCCTGGTGCAACCGACCCGCCTGCTGCTGGAATACCTGGAGGAAAAATACGAGGAACACCTGTATGAGCGTGACGAAGGCGATAAGTGGCGTAACAAGAAATTCGAGCTGGGTCTGGAATTTCCGAACCTGCCGTACTATATTGACGGCGATGTGAAACTGACCCAGAGCATGGCGATCATTCGTTACATCGCGGACAAACACAACATGCTGGGTGGCTGCCCGAAGGAGCGTGCGGAAATTAGCATGCTGGAGGGCGCGGTGCTGGATATTCGTTACGGTGTTAGCCGTATCGCGTATAGCAAAGACTTCGAAACCCTGAAGGTGGATTTTCTGAGCAAACTGCCGGAGATGCTGAAGATGTTCGAGGACCGTCTGTGCCACAAAACCTATCTGAACGGTGACCACGTTACCCACCCGGATTTTATGCTGTACGACGCGCTGGATGTGGTTCTGTATATGGACCCGATGTGCCTGGATGCGTTCCCGAAGCTGGTTTGCTTTAAGAAACGTATCGAGGCGATTCCGCAAATCGACAAGTACCTGAAAAGCAGCAAGTATATTGCGTGGCCGCTGCAAGGTTGGCAAGCGACCTTTGGTGGCGGTGATCACCCGCCGAAGGGTGGCGGTGGTAGCGGCGGCGGCGGCAGCCAACAGCGTTTTGAATGGGAATTTGAACAGCAGTAATAACTCGAGCACCACCACCACCACCACTGAGATCCGGCTGCTAACAAAGCCCGAAAGGAAGCTGAGTTGGCTGCTGCCACCGCTGAGCAATAACTAGCATAACCCCTTGGGGCCTCTAAACGGGTCTTGAGGGGTTTTTTGCTGAAAGGAGGAACTATATCCGGATTGGCGAATGGGACGCGCCCTGTAGCGGCGCATTAAGCGCGGCGGGTGTGGTGGTTACGCGCAGCGTGACCGCTACACTTGCCAGCGCCCTAGCGCCCGCTCCTTTCGCTTTCTTCCCTTCCTTTCTCGCCACGTTCGCCGGCTTTCCCCGTCAAGCTCTAAATCGGGGGCTCCCTTTAGGGTTCCGATTTAGTGCTTTACGGCACCTCGACCCCAAAAAACTTGATTAGGGTGATGGTTCACGTAGTGGGCCATCGCCCTGATAGACGGTTTTTCGCCCTTTGACGTTGGAGTCCACGTTCTTTAATAGTGGACTCTTGTTCCAAACTGGAACAACACTCAACCCTATCTCGGTCTATTCTTTTGATTTATAAGGGATTTTGCCGATTTCGGCCTATTGGTTAAAAAATGAGCTGATTTAACAAAAATTTAACGCGAATTTTAACAAAATATTAACGCTTACAATTTAGGTGGCACTTTTCGGGGAAATGTGCGCGGAACCCCTATTTGTTTATTTTTCTAAATACATTCAAATATGTATCCGCTCATGAATTAATTCTTAGAAAAACTCATCGAGCATCAAATGAAACTGCAATTTATTCATATCAGGATTATCAATACCATATTTTTGAAAAAGCCGTTTCTGTAATGAAGGAGAAAACTCACCGAGGCAGTTCCATAGGATGGCAAGATCCTGGTATCGGTCTGCGATTCCGACTCGTCCAACATCAATACAACCTATTAATTTCCCCTCGTCAAAAATAAGGTTATCAAGTGAGAAATCACCATGAGTGACGACTGAATCCGGTGAGAATGGCAAAAGTTTATGCATTTCTTTCCAGACTTGTTCAACAGGCCAGCCATTACGCTCGTCATCAAAATCACTCGCATCAACCAAACCGTTATTCATTCGTGATTGCGCCTGAGCGAGACGAAATACGCGATCGCTGTTAAAAGGACAATTACAAACAGGAATCGAATGCAACCGGCGCAGGAACACTGCCAGCGCATCAACAATATTTTCACCTGAATCAGGATATTCTTCTAATACCTGGAATGCTGTTTTCCCGGGGATCGCAGTGGTGAGTAACCATGCATCATCAGGAGTACGGATAAAATGCTTGATGGTCGGAAGAGGCATAAATTCCGTCAGCCAGTTTAGTCTGACCATCTCATCTGTAACATCATTGGCAACGCTACCTTTGCCATGTTTCAGAAACAACTCTGGCGCATCGGGCTTCCCATACAATCGATAGATTGTCGCACCTGATTGCCCGACATTATCGCGAGCCCATTTATACCCATATAAATCAGCATCCATGTTGGAATTTAATCGCGGCCTAGAGCAAGACGTTTCCCGTTGAATATGGCTCATAACACCCCTTGTATTACTGTTTATGTAAGCAGACAGTTTTATTGTTCATGACCAAAATCCCTTAACGTGAGTTTTCGTTCCACTGAGCGTCAGACCCCGTAGAAAAGATCAAAGGATCTTCTTGAGATCCTTTTTTTCTGCGCGTAATCTGCTGCTTGCAAACAAAAAAACCACCGCTACCAGCGGTGGTTTGTTTGCCGGATCAAGAGCTACCAACTCTTTTTCCGAAGGTAACTGGCTTCAGCAGAGCGCAGATACCAAATACTGTCCTTCTAGTGTAGCCGTAGTTAGGCCACCACTTCAAGAACTCTGTAGCACCGCCTACATACCTCGCTCTGCTAATCCTGTTACCAGTGGCTGCTGCCAGTGGCGATAAGTCGTGTCTTACCGGGTTGGACTCAAGACGATAGTTACCGGATAAGGCGCAGCGGTCGGGCTGAACGGGGGGTTCGTGCACACAGCCCAGCTTGGAGCGAACGACCTACACCGAACTGAGATACCTACAGCGTGAGCTATGAGAAAGCGCCACGCTTCCCGAAGGGAGAAAGGCGGACAGGTATCCGGTAAGCGGCAGGGTCGGAACAGGAGAGCGCACGAGGGAGCTTCCAGGGGGAAACGCCTGGTATCTTTATAGTCCTGTCGGGTTTCGCCACCTCTGACTTGAGCGTCGATTTTTGTGATGCTCGTCAGGGGGGCGGAGCCTATGGAAAAACGCCAGCAACGCGGCCTTTTTACGGTTCCTGGCCTTTTGCTGGCCTTTTGCTCACATGTTCTTTCCTGCGTTATCCCCTGATTCTGTGGATAACCGTATTACCGCCTTTGAGTGAGCTGATACCGCTCGCCGCAGCCGAACGACCGAGCGCAGCGAGTCAGTGAGCGAGGAAGCGGAAGAGCGCCTGATGCGGTATTTTCTCCTTACGCATCTGTGCGGTATTTCACACCGCAATGGTGCACTCTCAGTACAATCTGCTCTGATGCCGCATAGTTAAGCCAGTATACACTCCGCTATCGCTACGTGACTGGGTCATGGCTGCGCCCCGACACCCGCCAACACCCGCTGACGCGCCCTGACGGGCTTGTCTGCTCCCGGCATCCGCTTACAGACAAGCTGTGACCGTCTCCGGGAGCTGCATGTGTCAGAGGTTTTCACCGTCATCACCGAAACGCGCGAGGCAGCTGCGGTAAAGCTCATCAGCGTGGTCGTGAAGCGATTCACAGATGTCTGCCTGTTCATCCGCGTCCAGCTCGTTGAGTTTCTCCAGAAGCGTTAATGTCTGGCTTCTGATAAAGCGGGCCATGTTAAGGGCGGTTTTTTCCTGTTTGGTCACTGATGCCTCCGTGTAAGGGGGATTTCTGTTCATGGGGGTAATGATACCGATGAAACGAGAGAGGATGCTCACGATACGGGTTACTGATGATGAACATGCCCGGTTACTGGAACGTTGTGAGGGTAAACAACTGGCGGTATGGATGCGGCGGGACCAGAGAAAAATCACTCAGGGTCAATGCCAGCGCTTCGTTAATACAGATGTAGGTGTTCCACAGGGTAGCCAGCAGCATCCTGCGATGCAGATCCGGAACATAATGGTGCAGGGCGCTGACTTCCGCGTTTCCAGACTTTACGAAACACGGAAACCGAAGACCATTCATGTTGTTGCTCAGGTCGCAGACGTTTTGCAGCAGCAGTCGCTTCACGTTCGCTCGCGTATCGGTGATTCATTCTGCTAACCAGTAAGGCAACCCCGCCAGCCTAGCCGGGTCCTCAACGACAGGAGCACGATCATGCGCACCCGTGGGGCCGCCATGCCGGCGATAATGGCCTGCTTCTCGCCGAAACGTTTGGTGGCGGGACCAGTGACGAAGGCTTGAGCGAGGGCGTGCAAGATTCCGAATACCGCAAGCGACAGGCCGATCATCGTCGCGCTCCAGCGAAAGCGGTCCTCGCCGAAAATGACCCAGAGCGCTGCCGGCACCTGTCCTACGAGTTGCATGATAAAGAAGACAGTCATAAGTGCGGCGACGATAGTCATGCCCCGCGCCCACCGGAAGGAGCTGACTGGGTTGAAGGCTCTCAAGGGCATCGGTCGAGATCCCGGTGCCTAATGAGTGAGCTAACTTACATTAATTGCGTTGCGCTCACTGCCCGCTTTCCAGTCGGGAAACCTGTCGTGCCAGCTGCATTAATGAATCGGCCAACGCGCGGGGAGAGGC
cover image cover image
  • Linker sequence (Gly₄Ser)₃ :
GGTGGCGGTGGCTCGGGCGGTGGTGGGTCGGGTGGCGGCGGATCT
  • Medium-FT sequence:
atggtgagcaagggcgaggaggataacatggccatcatcaaggaattcatgcgtttcaaggtgcacctggagggctccgtggacggccacgagttcgagatcgagggcgagggcgagggccgcccctacgagggcacccagagcgccaagctgaaggtgaccaagggtggccccctgcccttcgcctgggacatcctgtcccctcagttcatgtacggctccagggcctacgtgaagcaccccgccgacatccccgactactggaagctgtccttccccgagggcttcaagtgggagcgcgtgatgaacttcgaggatggcggcgtggtgaccgtgacccaggactcctccctgcaggacggcgagttcatctacaaggtgaagctgcgcggcaccaacttcccttccgacggccccgtaatgcagaagaagaccatgggctgggaggcctcctccgagcggatataccccgaggacggcgccctgaagggcgagatcaagcagaggctgaagctgaaggacggcggccactacgacgctgaggtcaagaccacctacaaggccaagaagcccgtgctgctgcccggcgcctacaacgtcaacatcaagatggacatcacctcccacaacgaggactacaccatcgttgaacagtgcgaacgcgccgagggccaccattccaccggcggcatggacgagctgtacaagtaa
cover image cover image

At the beginning, I manually designed the overhangs based on the coding sequence. I assumed that the last four nucleotides of the GST sequence (GAAG) would serve as the correct overhang to connect with the next fragment. Based on this assumption, I designed the linker fragment to have a compatible overhang (GAAG, GGTA). Similarly, I defined the overhangs between the linker and the fluorescent timer protein (GGTA) in order to maintain a continuous reading frame. During this step, I also verified that no frameshift was introduced at the junctions and that the coding sequence remained in frame across all fragments as indicated in the following table:

JunctionDNA SequenceResulting Amino AcidsStatus
GST to Linker...AAG GGT...Lys - GlyIn Frame
Linker to FT...TCT CCG GTA ATG...Ser - Pro - Val - MetIn Frame
FT to 6xHis...AAG AAG CAC...Lys - Lys - HisIn Frame

In addition, I checked that all BsaI restriction enzyme recognition sites were positioned outside of the fragments that would be recovered after digestion, ensuring that the internal sequences of the inserts would not be disrupted during the assembly process. the designed overhangs are as the following: cover image cover image

The designed overhangs are supposed to orient the assembly in the following order: the linker is placed immediately after the GST sequence, and the Medium FT is positioned just before the C-terminal His tag, as indicated in the following diagram: cover image cover image

After preparing the vector and all fragments, the designed vector digestion cuts were defined as follows: cover image cover image The designed linker fragment sticky ends were defined as follows: cover image cover image The designed Medium FT sticky ends were defined as follows: cover image cover image

Be careful !! A critical point to consider during the design is the correct placement of BsaI restriction enzyme recognition sites. For the inserted fragments, the BsaI sites must be located outside of the sequences of interest so that they are removed during digestion and do not remain in the final construct. In contrast, for the backbone vector, the BsaI sites must be positioned within the region to be replaced, so that digestion removes this segment and allows the insertion of the designed fragments.

It is also essential to ensure that the BsaI recognition sites are oriented correctly (inverted orientation) to generate the desired overhangs and to cut the backbone precisely at the intended insertion site. Any incorrect placement or orientation of these sites can lead to incompatible sticky ends and result in assembly failure.

I imported all sequences into Benchling and created a new assembly using the Golden Gate cloning option. I selected the pET-28 plasmid as the backbone and added the designed fragments, including the linker and the fluorescent timer protein, as inserts. I specified the use of the BsaI restriction enzyme and defined the final construct as circular. Since all sequences were already designed with appropriate BsaI recognition sites, I selected the option to use existing restriction sites for fragment generation. I then attempted to run the assembly.

cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image

However, the assembly failed, and Benchling returned an error indicating that the sticky ends were incompatible. Specifically, the system showed a mismatch between the overhangs “AAGC” (from the vector) and “GAAG” (from the insert). This result indicated that the fragments could not ligate properly. cover image cover image

After analyzing this issue, I realized that the mistake came from misunderstanding how Golden Gate Assembly works. I initially assumed that the overhang corresponds directly to visible nucleotides in the sequence. In reality, the overhang is determined by the position of the BsaI cutting site, not simply by the sequence at the end of the gene. Since BsaI cuts outside of its recognition site, the actual generated overhang in the vector was “AAGC” and not “GAAG” as I had expected.

This mismatch between expected and real overhangs caused the failure of the assembly. Additionally, the cloning workflow in Benchling does not automatically correct or reinterpret overhangs; it strictly checks for compatibility. Therefore, any small design error leads to a complete assembly failure.

In order to overcome the limitations encountered in the first approach, I tried another method available in Benchling by using the Assembly tool dedicated to multi-fragment cloning. This method is specifically designed to simulate Golden Gate Assembly in a more automated and flexible way, allowing better handling of fragment compatibility and overhang generation.

First, I opened the Assembly tool from the bottom toolbar and created a new assembly. I then added all the required DNA sequences, including the pET-28 plasmid as the backbone and the designed fragments (linker and Medium FT) as inserts. After that, I selected the BsaI restriction enzyme as the Type IIS restriction enzyme used for the assembly. cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image

Unlike the previous method, this approach automatically analyzed the positions of the BsaI recognition sites and simulated the digestion process. It generated the correct sticky ends based on the actual cutting positions of the enzyme and evaluated the compatibility between fragments. This allowed the system to correctly align and assemble the different parts according to their matching overhangs.

After running the assembly, the construct was successfully generated as a circular plasmid. I carefully verified that all fragments were assembled in the correct order and orientation. I also confirmed that no frameshift was introduced across the junctions and that the reading frame was maintained from the GST sequence through the linker and into the fluorescent timer protein. In addition, I checked that no unwanted BsaI sites remained inside the final construct and that all restriction sites had been properly removed during the assembly process.

cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image cover image

this is the direct link to benchling for this assembly: using the assembly tool

in this homework, AI ChatGPT assisted me in organizing and clearly articulating my answers and descriptions, ensuring that the content is well-structured and easy to understand.

Assignment: Asimov Kernel

Assignees for this section MIT/Harvard students Required Committed Listeners Required

  1. Create a Repository for your work
  2. Create a blank Notebook entry to document the homework and save it to that Repository
  3. Explore the devices in the Bacterial Demos Repo to understand how the parts work together by running the Simulator on various examples, following the instructions for the simulator found in the “Info” panel (click the “i” icon on the right to open the Info panel)
  4. Create a blank Construct and save it to your Repository
    1. Recreate the Repressilator in that empty Construct by using parts from the Characterized Bacterial Parts repository
    2. Search the parts using the Search function in the right menu
    3. Drag and drop the parts into the Construct
    4. Confirm it works as expected by running the Simulator (“play” button) and compare your results with the Repressilator Construct found in the Bacterial Demos repository
    5. Document all of this work in your Notebook entry - you can copy the glyph image and the simulator graphs, and paste them into your Notebook
  5. Build three of your own Constructs using the parts in the Characterized Bacterials Parts Repo
    1. Explain in the Notebook Entry how you think each of the Constructs should function
    2. Run the simulator and share your results in the Notebook Entry
    3. If the results don’t match your expectations, speculate on why and see if you can adjust the simulator settings to get the expected outcome


Sources:

Week 07 HW: genetic circuits part-II

cover image cover image
Assignment Part 1: Intracellular Artificial Neural Networks (IANNs)

Assignees for this section

MIT/Harvard students Required

Committed Listeners Required

  1. What advantages do IANNs have over traditional genetic circuits, whose input/output behaviors are Boolean functions?
  2. Describe a useful application for an IANN; include a detailed description of input/output behavior, as well as any limitations an IANN might face to achieve your goal.
  3. Below is a diagram depicting an intracellular single-layer perceptron where the X1 input is DNA encoding for the Csy4 endoribonuclease and the X2 input is DNA encoding for a fluorescent protein output whose mRNA is regulated by Csy4. Tx: transcription; Tl: translation.Draw a diagram for an intracellular multilayer perceptron where layer 1 outputs an endoribonuclease that regulates a fluorescent protein output in layer 2.

Traditional genetic circuits work using Boolean logic, where the output is binary (either ON or OFF). The output depends on whether the input signals pass a fixed threshold. For example, in a genetic AND gate, a protein is only produced when both transcription factors A and B are present above a certain level. If one of them is missing or below the threshold, the output is zero. This type of system is useful for simple decisions, but it has important limitations because real biological signals are usually continuous, variable, and noisy, not strictly ON or OFF.

Intracellular Artificial Neural Networks (IANNs) solve these limitations by mimicking artificial neural networks inside the cell. Instead of treating inputs as binary, IANNs assign a continuous weight to each input. These weighted inputs are then summed, and the result is passed through a biological activation mechanism (such as a riboswitch or a protease-regulated system) to generate a graded output.

image image

IANN approach provides several important advantages:

  • Continuous output resolution: Unlike Boolean circuits that only produce ON or OFF outputs, IANNs generate different levels of expression depending on the strength of the inputs. This allows cells to respond in a more precise and dose-dependent way, which is important for applications like metabolic regulation or controlled therapeutic delivery.

  • Weighted signal integration: Each input does not contribute equally. Instead, every signal has a specific weight that determines how much it influences the final output. This allows the system to prioritize certain signals over others, which is not possible in traditional AND/OR gates where all inputs are treated equally.

  • Robustness to biological noise: Cellular environments are naturally noisy, and signals can vary between cells. Because IANNs work with continuous values rather than strict thresholds, they are more tolerant to noise and variability, making them more reliable in real biological conditions.

  • Greater computational power: A multilayer IANN can act as a universal function approximator, meaning it can represent very complex relationships between inputs and outputs. In contrast, Boolean circuits are limited to simple logical combinations, which restricts the complexity of decisions they can perform.

  • Rational tunability: The weights and biases in an IANN can be adjusted through DNA design (for example, by modifying promoters or regulatory elements) or improved through directed evolution. This makes it possible to “train” the system to recognize complex patterns, such as a specific combination of biomarkers, with much higher precision than traditional Boolean circuits.

  • Application of an IANN: Smart Lactase-Producing Probiotic System

As a highly relevant and practical application of an Intracellular Artificial Neural Network (IANN) I chose to apply it in the engineering of a probiotic bacterium capable of context-aware lactase production for the management of lactose intolerance. Unlike conventional synthetic circuits that respond to a single input in a binary manner, this system integrates multiple physiological signals from the gastrointestinal environment to produce a graded and condition-dependent enzymatic response.

-Input Layer: Multidimensional Environmental Sensing

The system incorporates multiple biologically relevant input signals, each representing a distinct physiological parameter of the gastrointestinal environment:

X1: Lactose concentration This serves as the primary input signal, directly reflecting the presence and abundance of the substrate requiring enzymatic degradation.

X2: pH level This input provides spatial context by distinguishing between different regions of the gastrointestinal tract. The acidic pH of the stomach versus the near-neutral pH of the intestine allows the system to restrict activation to physiologically appropriate locations, thereby preventing premature or energetically wasteful enzyme production.

X3: Inflammatory biomarkers Molecules such as nitric oxide, reactive oxygen species or cytokine-associated metabolites act as indicators of intestinal stress or dysbiosis. This input enables modulation of the system’s response based on host physiological state, allowing adaptive tuning of output under pathological conditions.

Lactose sensitivity can be increased using a strong promoter or high-affinity regulator, corresponding to a positive weight. pH sensitivity may be implemented through a regulatory element that suppresses output under acidic conditions, corresponding to a negative or inhibitory weight. Inflammatory signals could be integrated via modulatory promoters or regulatory RNAs that amplify output under stress conditions, acting as an adjustable positive or negative weight depending on the desired response.

At the molecular level, each input is transduced into regulatory signals (e.g., transcription factors, small RNAs, or protease-mediated regulators). These signals are then integrated through combinatorial gene regulation, where promoter strengths, ribosome binding site efficiencies, and degradation dynamics collectively encode the effective weights.

The aggregated signal undergoes a transformation through a biological activation function, which may be implemented via nonlinear regulatory elements such as riboswitches, cooperative transcriptional regulators, or proteolytic cascades. This step introduces thresholding and saturation effects analogous to activation functions in artificial neural networks, thereby enabling continuous and nonlinear input–output relationships.

Output Layer: Graded Lactase Expression

The final output of the system is the expression of the lactase enzyme, with expression levels determined by the integrated and nonlinearly transformed input signal

This enables a spectrum of responses:

  • Sub-threshold activation: (e.g., low lactose concentration or inhibitory pH conditions) result in negligible or no enzyme production.
  • Intermediate activation: moderate enzyme expression
  • High activation: (e.g., high lactose concentration under optimal pH conditions, potentially combined with inflammatory signals) drive maximal enzyme production.

Functional Behavior and Decision-Making Capability

The system effectively implements a context-dependent decision-making process, wherein output is not determined by a single condition but by the weighted combination of multiple environmental cues. For example:

  • The presence of lactose alone is insufficient to trigger activation under acidic conditions, thereby preventing inappropriate expression in the stomach.
  • Under intestinal pH, lactose induces activation in a concentration-dependent manner.
  • In the presence of both high lactose and inflammatory signals, the system can upregulate lactase production, potentially enhancing digestive efficiency under stress conditions.

Limitations and Practical Constraints

Despite its conceptual advantages, the implementation of such an IANN-based system faces several challenges:

  • Stochastic gene expression: Intrinsic and extrinsic noise can introduce variability in circuit performance across individual cells. Parameter tuning complexity: Precise calibration of weights and activation thresholds through genetic elements (e.g., promoters, RBSs) remains experimentally demanding.
  • Kinetic limitations: Transcriptional and translational processes impose temporal delays, limiting the speed of system response.
  • Regulatory crosstalk: Interactions between synthetic and endogenous pathways may lead to unintended behaviors.
  • Metabolic burden: The expression of complex regulatory networks can reduce host fitness and stability.
  • Environmental variability: Dynamic and heterogeneous gut conditions may challenge the robustness and predictability of the system.

Implementation of a Multilayer Perceptron Using Endoribonucleases

To implement a multilayer perceptron in a biological system, the output of one computational layer must regulate the activity of the next. This can be achieved using a cascade of endoribonucleases, where each layer processes inputs and produces a regulatory molecule that serves as the input for the subsequent layer.

image image
  • Input Representation

The system integrates multiple biological inputs represented as molecular signals:

X1: Csy4 endoribonuclease (constitutively or inducibly expressed)

X2: an additional regulatory signal (e.g., inducible promoter or transcriptional activator)

X3: environmental or metabolic signal (e.g., pH, or inflammatory markers such as nitric oxide)

These inputs are converted into regulatory effects at the gene expression level, analogous to numerical inputs in an artificial neural network.

image image
  • Layer 1: Intermediate Processing In the first layer, the inputs jointly regulate the expression of an intermediate endoribonuclease (e.g., Cas6a). The mRNA encoding this enzyme is engineered to contain specific recognition sites for Csy4. As a result:

–> The presence of Csy4 (X1) induces cleavage of the mRNA, leading to repression of Cas6a expression

–> The second input (X2) can act as an activator, promoting transcription of the Cas6a gene

Thus, Layer 1 integrates activating and inhibitory signals. The resulting expression level of Cas6a reflects a balance between these opposing regulatory effects, analogous to a weighted sum followed by a nonlinear activation function in a perceptron.

  • Layer 2: Output Generation The output of Layer 1 (Cas6a protein) serves as the regulatory input for Layer 2.

The mRNA encoding a reporter protein (e.g., GFP) is engineered to contain Cas6a recognition sites. Consequently:

–> High levels of Cas6a lead to cleavage of GFP mRNA and repression of fluorescence

–> Low levels of Cas6a allow GFP expression

This establishes a second computational layer in which the input is not external, but derived from the processed output of the first layer.

  • System-Level Behavior

This cascading architecture enables hierarchical signal processing within the cell. –> When Csy4 levels are high, Cas6a production is suppressed, allowing GFP expression

–> When Csy4 levels are low and activation dominates, Cas6a is produced and represses GFP

Therefore, the final output depends on both the original inputs and the intermediate computation performed in Layer 1.

Assignment Part 2: Fungal Materials

Assignees for this section

MIT/Harvard students Required

Committed Listeners Required

  1. What are some examples of existing fungal materials and what are they used for? What are their advantages and disadvantages over traditional counterparts?
  2. What might you want to genetically engineer fungi to do and why? What are the advantages of doing synthetic biology in fungi as opposed to bacteria?

Fungal Materials and Their Uses

Fungi, especially their root-like networks called mycelium, are becoming a surprisingly powerful source of sustainable materials. Unlike plastics or leather, which require heavy manufacturing and chemicals, mycelium-based materials are grown. The fungi take agricultural waste—things like sawdust, rice husks, or hemp—and weave it into solid, structured materials. It’s almost like nature is doing the 3D printing for us.

One of the most familiar uses today is in packaging. Mycelium can form lightweight, shock-absorbing foams that replace Styrofoam or plastic inserts. Fragile items like electronics, furniture, or delicate goods can be safely packed in these eco-friendly alternatives. Companies like Ecovative and IKEA have already begun experimenting with this approach, showing that sustainable materials don’t have to compromise practicality. image_ref image image

Fungi are also stepping into the fashion world through myco-leather. By processing mats of mycelium into flexible sheets, it’s possible to make shoes, bags, and even clothing without harming animals. Myco-leather is fully biodegradable, reduces chemical waste, and offers a much lower environmental footprint than traditional leather production. It’s a great example of how biology can meet design. image_ref image image

In construction, fungal materials are finding their place as well. Mycelium boards can provide thermal and acoustic insulation, or serve as lightweight panels for ceilings and walls. They are naturally fire-resistant, resistant to pests like termites, and completely biodegradable. This means that even in building applications, mycelium offers both functionality and sustainability.

The versatility doesn’t stop there. Designers and researchers are exploring fungal foams, textiles, and even furniture, taking advantage of mycelium’s ability to grow into complex shapes. People are also experimenting with specialty applications, like wearable electronics, wound dressings, filters, and acoustic panels. Fungi aren’t just materials—they’re living factories that can be shaped, molded, and sometimes even programmed to do more. image_ref image image

Advantages and Disadvantages of Fungal Materials

AdvantagesDisadvantages
Made from renewable agricultural waste and fully biodegradableLower mechanical strength compared to plastics, metals, or treated leather
“Grown” in controlled conditions with minimal energy and no toxic chemicalsSensitive to moisture; can deform or degrade if untreated
Naturally fire-resistant and termite-resistantSlower production—growing a material takes days or weeks
Lightweight with good strength-to-weight ratioBatch-to-batch variability due to biological growth
Can return nutrients to the soil after disposalLimited durability under extreme conditions without extra treatment

Genetic Engineering and Synthetic Biology in Fungi

Fungi are not just fascinating organisms—they are also incredibly versatile tools for engineering. If I were to genetically engineer fungi, I would aim to enhance the properties that currently limit their use while maximizing their natural strengths. For instance, one limitation of mycelium-based materials is their mechanical strength, which can make them less competitive compared to plastics or synthetic foams. I would focus on modifying the cell wall composition or growth patterns to produce stronger, more durable materials, making fungi a realistic alternative for packaging, textiles, and construction.

Another area I would target is environmental resilience. Fungal materials are naturally biodegradable, which is a huge advantage, but they can degrade too quickly in humid or wet environments. By engineering fungi to better tolerate moisture or extreme temperatures, it would be possible to create materials that maintain their structure and functionality in a wider range of conditions, expanding their practical applications.

Beyond materials, fungi can also be engineered for functional enhancements. I would consider adding traits like pigmentation for natural coloring, antimicrobial properties to extend shelf life, or even self-healing abilities so that minor damage doesn’t ruin the product. These modifications could transform mycelium into “smart materials” that are not only sustainable but also highly functional.

Why Use Fungi Instead of Bacteria?

Fungi offer several important advantages over bacteria when it comes to synthetic biology. First, as eukaryotic organisms, they have more advanced cellular machinery. This allows them to properly fold and modify complex proteins through processes like glycosylation, which is essential for many pharmaceuticals and functional biomolecules.

Another major advantage is their filamentous growth. Many fungi grow as long branching structures (hyphae), which makes them very efficient at secreting enzymes and other products into their environment. This simplifies downstream processing because the desired product is often already outside the cell.

Fungi also have a much richer and more diverse metabolism compared to most bacteria. They naturally produce a wide range of secondary metabolites, which means they can be engineered to generate a broader variety of useful compounds, from drugs to pigments to bioactive molecules. In addition, fungi are generally more robust in industrial settings. They can grow on cheap, low-quality substrates like agricultural waste and tolerate harsher conditions than many bacteria, making them more practical for large-scale, sustainable production.

That said, working with fungi can be more complex. They tend to grow more slowly than bacteria, and genetic engineering tools are less standardized. However, despite these challenges, their unique capabilities make them extremely valuable for applications where bacteria fall short.

Assignment Part 3: First DNA Twist Order

Assignees for this section

MIT/Harvard students Required

Committed Listeners Required

  1. Review the Individual Final Project documentation guidelines.
  2. Submit this Google Form with your draft Aim 1, final project summary, HTGAA industry council selections, and shared folder for DNA designs. DUE MARCH 20 FOR MIT/HARVARD/WELLESLEY STUDENTS
  3. Review Part 3: DNA Design Challenge of the week 2 homework. Design at least 1 insert sequence and place it into the Benchling/Kernel/Other folder you shared in the Google Form above. Document the backbone vector it will be synthesized in on your website.

for this part, I have developed three potential ideas for my final project and would greatly appreciate your feedback to help refine my direction. While I am still open to suggestions, I currently find myself most aligned with my second idea, as it feels both biologically intuitive and well-matched to the techniques we have learned throughout the course.

The idea I am leaning toward is focused on engineering an E. coli reporter system to monitor protein aging during heterologous expression using a fluorescent timer protein. I am particularly drawn to this concept because it allows me to integrate multiple core synthetic biology tools, including DNA construct design, protein engineering, and computational structure prediction, while also remaining experimentally feasible within the scope of the course. In addition, the system is mechanistically clear, which makes it easier to design, test, and interpret.

I have further refined this idea into a more specific and functional design: a time-dependent protein quality control system in which a fluorescent timer regulates the exposure of a degron, leading to the selective degradation of aged proteins. In this system, a protein of interest is fused to a fluorescent timer and a C-terminal degron. As the protein matures and the timer shifts from its “young” to “old” fluorescence state, conformational or structural changes are expected to increase the accessibility of the degron. This, in turn, allows recognition by the host proteolytic machinery, enabling targeted degradation of older protein populations. image image The key modification from the original idea is the addition of a functional outcome—degradation—rather than only monitoring protein age. This transforms the system from a passive reporter into an active quality control mechanism. The purpose of this change is to address a limitation in current heterologous expression systems, where proteins can accumulate in misfolded or non-functional states over time. By selectively degrading older or potentially damaged proteins, this system could improve overall protein quality and stability.

The broader gap I am attempting to address is the lack of dynamic, time-resolved control over protein lifespan in bacterial systems. Most current approaches either measure protein expression statically or rely on constitutive degradation signals that do not account for protein age. This project introduces a strategy to link protein function, age, and degradation in a single genetically encoded system.

At this stage, I would greatly value any feedback on the conceptual design, feasibility, or potential improvements. In particular, I would appreciate input on whether the proposed mechanism for degron exposure is realistic, and whether there are alternative design strategies that could strengthen the system. Any suggestions on experimental design, protein choice, or construct optimization would also be extremely helpful.

Please feel free to share feedback through any preferred channel, including email or whatsApp. Thank you for your time and guidance.

Designing the isert sequence in Benchling:

for this idea i designed the genetic construct in Benchling that encodes a fusion protein consisting of GST as the protein of interest, followed by a flexible linker, a fluorescent timer protein, a second short linker, and a C-terminal ssrA degron whose sequences are represented in the following table:

Genetic ElementFunctionDNA Sequence (5' → 3')
Start CodonInitiates translation
ATG
Protein of Interest (GST - Schistosoma japonicum)Reporter protein for studying protein aging and degradation
AGCCCGATCCTGGGTTACTGGAAAATCAAGGGCCTGGTGCAACCGACCCGCCTGCTGCTGGAATACCTGGAGGAAAAATACGAGGAACACCTGTATGAGCGTGACGAAGGCGATAAGTGGCGTAACAAGAAATTCGAGCTGGGTCTGGAATTTCCGAACCTGCCGTACTATATTGACGGCGATGTGAAACTGACCCAGAGCATGGCGATCATTCGTTACATCGCGGACAAACACAACATGCTGGGTGGCTGCCCGAAGGAGCGTGCGGAAATTAGCATGCTGGAGGGCGCGGTGCTGGATATTCGTTACGGTGTTAGCCGTATCGCGTATAGCAAAGACTTCGAAACCCTGAAGGTGGATTTTCTGAGCAAACTGCCGGAGATGCTGAAGATGTTCGAGGACCGTCTGTGCCACAAAACCTATCTGAACGGTGACCACGTTACCCACCCGGATTTTATGCTGTACGACGCGCTGGATGTGGTTCTGTATATGGACCCGATGTGCCTGGATGCGTTCCCGAAGCTGGTTTGCTTTAAGAAACGTATCGAGGCGATTCCGCAAATCGACAAGTACCTGAAAAGCAGCAAGTATATTGCGTGGCCGCTGCAAGGTTGGCAAGCGACCTTTGGTGGCGGTGATCACCCGCCGAAG
Linker 1 ((Gly₄Ser)₃)Flexible linker between GST and timer
GGTGGCGGTGGCTCGGGCGGTGGTGGGTCGGGTGGCGGCGGATCT
Fluorescent Timer (Medium FT)Reports protein age (green to red transition)
ATGGTGAGCAAGGGCGAGGAGGATAACATGGCCATCATCAAGGAATTCATGCGTTTCAAGGTGCACCTGGAGGGCTCCGTGGACGGCCACGAGTTCGAGATCGAGGGCGAGGGCGAGGGCCGCCCCTACGAGGGCACCCAGAGCGCCAAGCTGAAGGTGACCAAGGGTGGCCCCCTGCCCTTCGCCTGGGACATCCTGTCCCCTCAGTTCATGTACGGCTCCAGGGCCTACGTGAAGCACCCCGCCGACATCCCCGACTACTGGAAGCTGTCCTTCCCCGAGGGCTTCAAGTGGGAGCGCGTGATGAACTTCGAGGATGGCGGCGTGGTGACCGTGACCCAGGACTCCTCCCTGCAGGACGGCGAGTTCATCTACAAGGTGAAGCTGCGCGGCACCAACTTCCCTTCCGACGGCCCCGTAATGCAGAAGAAGACCATGGGCTGGGAGGCCTCCTCCGAGCGGATATACCCCGAGGACGGCGCCCTGAAGGGCGAGATCAAGCAGAGGCTGAAGCTGAAGGACGGCGGCCACTACGACGCTGAGGTCAAGACCACCTACAAGGCCAAGAAGCCCGTGCTGCTGCCCGGCGCCTACAACGTCAACATCAAGATGGACATCACCTCCCACAACGAGGACTACACCATCGTTGAACAGTGCGAACGCGCCGAGGGCCACCATTCCACCGGCGGCATGGACGAGCTGTACAAGTAA
Linker 2 (GGGGS)Provides flexibility and enables degron exposure
GGTGGTGGTGGTAGC
Degron (ssrA tag)Targets protein for degradation by ClpXP
GCTGCTAACGACGAAAACTACGCTCTGGCTGCT
Stop CodonTerminates translation
TAA

This design enables time-dependent exposure of the degron, allowing selective degradation of aged proteins by the host proteolytic system.

The designed insert will be cloned into a pET28 expression vector for protein expression in Escherichia coli BL21(DE3). This vector provides a T7 promoter, ribosome binding site, transcription terminator, and an N-terminal His₆ tag for protein purification. Therefore, only the coding sequence of the fusion protein was designed in Benchling.

the direct link to Benchling: GST_Timer_Degron_Insert

In this homework, ChatGPT helped me structure and write the answers and descriptions clearly, while Cloud AI generated the diagrams comparing Boolean genetic circuits and INNAs, the example illustrating a multilayer perceptron application, and the diagram describing my final project idea proposal.


Sources:

  • Brophy, J. A. N., & Voigt, C. A. (2014). Principles of Genetic Circuit Design. Nature Methods, 11(5), 508–520. https://doi.org/10.1038/nmeth.2926 Differences Between SLP and MLP | PDF | Theoretical Computer Science | Machine Learning. (n.d.). Retrieved March 30, 2026, from https://fr.scribd.com/document/858039220/Single-Layer-Perceptron-and-Multilayer-Perceptron
  • Gandia, A., van den Brandhof, J. G., Appels, F. V. W., & Jones, M. P. (2021). Flexible Fungal Materials: Shaping the Future. Trends in Biotechnology, 39(12), 1321–1331. https://doi.org/10.1016/j.tibtech.2021.03.002
  • Halužan Vasle, A., & Moškon, M. (2024). Synthetic biological neural networks: From current implementations to future perspectives. BioSystems, 237, 105164. https://doi.org/10.1016/j.biosystems.2024.105164
  • Hinneburg, H., Gu, S., & Naseri, G. (2025). Fungal Innovations—Advancing Sustainable Materials, Genetics, and Applications for Industry. Journal of Fungi, 11(10). https://doi.org/10.3390/jof11100721
  • Lim, H. G., Jang, S., Jang, S., Seo, S. W., & Jung, G. Y. (2018). Design and optimization of genetically encoded biosensors for high-throughput screening of chemicals. Current Opinion in Biotechnology, Analytical Biotechnology, 54, 18–25. https://doi.org/10.1016/j.copbio.2018.01.011
  • Mattern, D. J., Valiante, V., Unkles, S. E., & Brakhage, A. A. (2015). Synthetic biology of fungal natural products. Frontiers in Microbiology, 6, 775. https://doi.org/10.3389/fmicb.2015.00775
  • Moorman, A., Samaniego, C. C., Maley, C., & Weiss, R. (2019). A Dynamical Biomolecular Neural Network. 2019 IEEE 58th Conference on Decision and Control (CDC), 1797–1802. https://doi.org/10.1109/CDC40024.2019.9030122
  • Parhizi, Z., Dearnaley, J., Kauter, K., Mikkelsen, D., Pal, P., Shelley, T., & Burey, P. (Polly). (2025). The Fungus Among Us: Innovations and Applications of Mycelium-Based Composites. Journal of Fungi, 11(8), 549. https://doi.org/10.3390/jof11080549
  • Seak, L. C. U., Lo, O. L. I., Suen, W. C.-W., & Wu, M.-T. (2021). Next-generation biocomputing: Mimicking artificial neural network with genetic circuits (p. 2021.03.12.435120). bioRxiv. https://doi.org/10.1101/2021.03.12.435120
  • Secret fungi in everyday life | Kew. (n.d.). Retrieved March 30, 2026, from https://www.kew.org/read-and-watch/everyday-fungi-food-medicine Stock, C. H., Harvey, S. E., Ocko, S. A., & Ganguli, S. (2022). Synaptic balancing: A biologically plausible local learning rule that provably increases neural network noise robustness without sacrificing task performance. PLoS Computational Biology, 18(9), e1010418. https://doi.org/10.1371/journal.pcbi.1010418
  • Undecided with Matt Ferrell. (2021, June 22). Is Mycelium Fungus the Plastic of the Future? [Video recording]. https://www.youtube.com/watch?v=cApVVuuqLFY
  • van der Linden, A. J., Pieters, P. A., Bartelds, M. W., Nathalia, B. L., Yin, P., Huck, W. T. S., Kim, J., & de Greef, T. F. A. (2022). DNA Input Classification by a Riboregulator-Based Cell-Free Perceptron. ACS Synthetic Biology, 11(4), 1510–1520. https://doi.org/10.1021/acssynbio.1c00596
  • Wang, X., Chen, Y.-Z., Qiu, X.-D., Chen, L., Teng, Y.-M., Ding, C., Huang, Y.-T., Wang, S.-Y., Liu, S.-Y., Ding, B., Laborda, P., & Zhu, S.-Q. (2026). Bioactivity and mechanisms of Ewingella americana for the control of Alternaria leaf spot on peanut. Physiological and Molecular Plant Pathology, 142, 103088. https://doi.org/10.1016/j.pmpp.2025.103088
  • Yang, P., Condrich, A., Lu, L., Scranton, S., Hebner, C., Sheykhhasan, M., & Ali, M. A. (2024). Genetic Engineering in Bacteria, Fungi, and Oomycetes, Taking Advantage of CRISPR. DNA, 4(4), 427–454. https://doi.org/10.3390/dna4040030

Week 09 HW: Cell Free Systems

cover image cover image

Homework Part A: General and Lecturer-Specific Questions

General homework questions

Assignees for this section

  • MIT/Harvard students Required
  • Committed Listeners Required
  1. Explain the main advantages of cell-free protein synthesis over traditional in vivo methods, specifically in terms of flexibility and control over experimental variables. Name at least two cases where cell-free expression is more beneficial than cell production.
  2. Describe the main components of a cell-free expression system and explain the role of each component.
  3. Why is energy provision regeneration critical in cell-free systems? Describe a method you could use to ensure continuous ATP supply in your cell-free experiment.
  4. Compare prokaryotic versus eukaryotic cell-free expression systems. Choose a protein to produce in each system and explain why.
  5. How would you design a cell-free experiment to optimize the expression of a membrane protein? Discuss the challenges and how you would address them in your setup.
  6. Imagine you observe a low yield of your target protein in a cell-free system. Describe three possible reasons for this and suggest a troubleshooting strategy for each.

  1. Main Advantages of Cell-Free Protein Synthesis (CFPS) Over Traditional In Vivo Methods

Cell-free protein synthesis removes the constraints of using living cells. You are working in a test tube, which gives you direct control over the reaction environment without worrying about cell viability. image ref cover image cover image

Flexibility and Control:

  • Direct manipulation: You can easily change pH, salt concentration, redox potential, or add detergents, chaperones, or unnatural amino acids at any time. In living cells, these changes would kill the cells or fail to enter.
  • No cell walls or membranes: You add DNA directly to the extract. There is no need for transformation, selection, or cell lysis steps. This saves hours or days.
  • Toxic protein production: You can synthesize proteins that would kill living cells (e.g., membrane proteins, proteases, toxins).
  • Speed and efficiency: Protein production takes 2–4 hours instead of days. All energy goes into making your target protein, not cell growth. Two cases where CFPS is more beneficial than cell production:
  1. High-throughput screening of enzyme variants or genetic circuits – Because reactions are fast and can be done in 96- or 384-well plates, you can test hundreds of conditions or mutants in a single afternoon.

  2. Production of toxic membrane proteins (e.g., GPCRs, viral ion channels) – These proteins kill E. coli or insect cells when produced in vivo. In CFPS, you can add detergents or nanodiscs directly to the reaction to keep the protein soluble and stable.

  3. Main Components of a Cell-Free Expression System and Their Roles

A cell-free system combines cellular machinery with necessary nutrients and energy. Below are the key components and what each does.

image ref cover image cover image

ComponentRole in the System
Cell extract (lysate)Derived from broken cells (e.g., E. coli, wheat germ, rabbit reticulocytes). Provides ribosomes, tRNAs, aminoacyl-tRNA synthetases, initiation/elongation/termination factors, and native enzymes needed for transcription and translation.
Genetic template (DNA or mRNA)The instruction manual. DNA (plasmid or linear PCR product) is transcribed into mRNA, then translated into protein. If you add mRNA directly, translation starts immediately without transcription.
Amino acidsThe 20 building blocks that ribosomes link together to form the protein chain.
Energy source (ATP, GTP)Provides the chemical energy needed for bond formation during translation, transcription, and tRNA charging.
Energy regeneration systemConverts spent ADP back to ATP. Without this, the reaction stops within 10–20 minutes. Common systems include creatine phosphate/creatine kinase or phosphoenolpyruvate (PEP)/pyruvate kinase.
RNA polymerase (e.g., T7 RNA polymerase)If using a DNA template with a T7 promoter, you add this enzyme separately to transcribe DNA into mRNA efficiently.
Buffer solution (salts and cofactors)Maintains optimal pH (usually 7.4–8.0) and ionic conditions. Magnesium (Mg²⁺) and potassium (K⁺) concentrations are critical – too little and ribosomes fall apart, too much and they stop working.
RNase and protease inhibitorsProtect your mRNA and protein from degradation by native enzymes present in the cell extract.

These components are combined either as a crude extract (fast and cheap) or a PURE system (reconstituted from purified components, cleaner but more expensive).

  1. Why Energy Provision Regeneration Is Critical and a Method to Ensure Continuous ATP Supply
  • Why it is critical?

Cell-free systems lack the metabolic networks of living cells that continuously generate ATP. Translation consumes ATP rapidly – each peptide bond uses 2 ATP equivalents. Without regeneration, ATP drops to zero within 10–20 minutes, and protein synthesis stops. To produce protein for 2–6 hours, you need a way to keep making ATP from ADP.

image ref cover image cover image

Method for continuous ATP supply: Phosphoenolpyruvate (PEP) / Pyruvate Kinase system

  • What you add: Phosphoenolpyruvate (PEP) and the enzyme pyruvate kinase.
  • How it works: Pyruvate kinase transfers a high-energy phosphate group from PEP to ADP, regenerating ATP and producing pyruvate as a byproduct.
  • Why it works well: PEP has a higher phosphate transfer potential than ATP, so the reaction favors ATP formation. It is reliable and commonly used in E. coli systems.

Alternative methods (if PEP causes problems):

If the PEP system presents limitations, other options can be used:

  • Creatine phosphate / creatine kinase: Converts ADP + creatine phosphate → ATP + creatine. Very common and stable.
  • Glucose / hexokinase or maltodextrin – cheaper but can cause pH drops.
  1. Comparison of Prokaryotic vs. Eukaryotic Cell-Free Expression Systems

Cell-free expression systems can be broadly divided into prokaryotic and eukaryotic platforms, and the choice between them mainly depends on the complexity of the target protein.

  • Prokaryotic system (e.g., Escherichia coli)

These systems are typically derived from E. coli and are widely used because they are fast, cost-effective, and produce high protein yields in a short time. However, they lack the machinery needed for post-translational modifications such as glycosylation, and they often have difficulty forming correct disulfide bonds and folding complex proteins properly.

  • Eukaryotic system (e.g., rabbit reticulocyte lysate, wheat germ extract)

These systems provide a more suitable environment for protein folding. They contain molecular chaperones and can support disulfide bond formation and, in some cases, post-translational modifications. However, they are generally more expensive, slower, and produce lower yields compared to prokaryotic systems.

  • Choosing proteins for each system

–> For prokaryotic systems:

The general rule is to choose proteins that are simple, relatively small, and do not require post-translational modifications or complex folding. These proteins should be able to fold easily in the cytoplasm. Based on these criteria, bacterial luciferase is a suitable choice. This enzyme produces a measurable light signal, making it very useful as a reporter protein. It does not require glycosylation and can be efficiently expressed and folded in E. coli, allowing easy detection through luminescence assays.

–> For eukaryotic systems:

The selection criteria are different. Proteins are usually more complex, may contain multiple domains, require disulfide bonds, or need chaperones for correct folding. Some are also membrane proteins and need a suitable environment to function. Membrane proteins, such as G protein-coupled receptors (GPCRs), are good examples. These proteins have complex structures with multiple transmembrane domains and require proper folding machinery and membrane-like conditions. Such requirements cannot be met by prokaryotic systems, while eukaryotic systems can support their correct folding and functionality

FeatureProkaryotic (e.g., E. coli lysate)Eukaryotic (e.g., Wheat germ, Rabbit reticulocyte, Insect cell)
YieldHigh (up to 1–2 mg/mL)Low to moderate (µg/mL range)
SpeedFast – 2 to 4 hoursSlower – 4 to 12 hours
CostLowHigh
Folding machineryLimited chaperones; no natural membrane structuresBetter chaperones; some systems contain microsomes (ER vesicles)
Post-translational modifications (PTMs)None (no glycosylation, limited disulfide bonds)Can perform glycosylation, phosphorylation, and efficient disulfide bonds (if microsomes present)
Best forSimple cytoplasmic proteins, enzymes, high-throughput screeningComplex human proteins, antibodies, secreted proteins, membrane proteins requiring PTMs
  1. Designing a Cell-Free Experiment to Optimize Membrane Protein Expression

Optimizing the expression of a membrane protein in a cell-free system requires careful consideration of the protein’s complexity, folding requirements, and membrane integration. Membrane proteins are challenging to produce because of their hydrophobic transmembrane domains, tendency to aggregate, and need for a membrane-like environment and proper chaperones.

  • Choosing the right Expression System

The choice of a cell-free system depends on the nature and complexity of the membrane protein:

  • Prokaryotic system (e.g., Escherichia coli): Suitable for simpler membrane proteins with few transmembrane domains that do not require complex folding or post-translational modifications. Advantages include fast expression, high yield, and low cost. However, proper folding must be supported using membrane mimics such as liposomes, nanodiscs, or mild detergents.

  • Eukaryotic system (e.g., rabbit reticulocyte lysate, wheat germ extract): Preferable for complex membrane proteins with multiple transmembrane domains or disulfide bonds. These systems contain molecular chaperones and provide a more natural folding environment, reducing aggregation and increasing the chance of functional protein production. Limitations include higher cost, slower expression, and lower yields.

  • Providing a Membrane-Like Environment

Membrane proteins require an environment that mimics a lipid bilayer. In both prokaryotic and eukaryotic systems, this can be achieved by:

  • Adding liposomes or nanodiscs
  • Using mild detergents carefully optimized to prevent aggregation

This ensures proper insertion of the protein into a membrane-like environment, which is critical for correct folding and functionality.

  • Optimizing Folding and Expression

To further improve expression and functionality:

  • Add chaperones if the protein is prone to misfolding
  • Adjust reaction conditions such as temperature, Mg²⁺ concentration, and DNA template concentration
  • Use a continuous ATP regeneration system (e.g., PEP/pyruvate kinase) to sustain protein synthesis
  • Employ a Continuous Exchange Cell-Free (CECF) setup to extend reaction time up to 24 hours. This setup constantly provides fresh energy (ATP/GTP) and removes inhibitory byproducts, which significantly improves protein yield and folding efficiency

Challenges and how to address them:

ChallengeWhy it happensSolution
Protein aggregationMembrane proteins are hydrophobic and clump together in water.Add liposomes or nanodiscs from the start. Test different detergents (0.1–1% DDM, Brij-35, or LMNG).
Low yieldDetergents can inhibit ribosomes.Titrate detergent concentration – start low, increase until protein is soluble but yield remains acceptable.
Ribosome stallingThe hydrophobic nascent chain sticks to the ribosome exit tunnel.Optimize the N-terminal sequence. Use a fusion tag like Mistic (from Bacillus subtilis) that helps membrane proteins fold.
No activity (misfolding)Protein inserted incorrectly or in wrong lipid environment.Test different lipid compositions (e.g., POPC, POPG, or E. coli polar lipids). Add chaperones (GroEL/GroES).
Short reaction timeEnergy runs out or inhibitors accumulate.Use CECF (dialysis) format. Double the energy regeneration components.

Optimization checklist:

  • Titrate magnesium (8–16 mM) – critical for ribosome function.
  • Test temperatures (20°C, 25°C, 30°C, 37°C).
  • Try 2–3 different detergents or lipid preparations.
  • Run a small-scale (10 µL) screening reaction before scaling up.
  1. Low Yield of Target Protein – Three Possible Reasons and Troubleshooting

If your cell-free reaction produces very low yield protein, check these common issues:

Reason 1: Low quality of DNA template

The DNA may contain inhibitors (salts, ethanol, phenol, agarose) or be degraded by nucleases. Without a good template, no mRNA is made.

  • Troubleshooting:

✅ Purify DNA using a spin column kit (not just alcohol precipitation).

✅ Avoid using DNA cut from agarose gels – re-extract if necessary.

✅ Check DNA concentration and run an agarose gel to see if it is intact.

✅ Use 10–20 µg of plasmid or 5–10 µg of linear PCR product per 1 mL reaction.

Reason 2: Codon bias (rare codons in the target gene)

If your gene contains many codons that are rare in the host (e.g., human gene expressed in E. coli extract), ribosomes stall or terminate early. This produces truncated or no protein.

  • Troubleshooting:

✅ Re-synthesize the gene with codons optimized for your extract (E. coli or wheat germ). Many online tools and services do this.

✅ Use an extract from a strain that supplies extra rare tRNAs (e.g., E. coli Rosetta or BL21 CodonPlus).

✅ Switch to a PURE system, which is less sensitive to codon bias.

Reason 3: Rapid energy depletion

ATP runs out after 30–60 minutes because the energy regeneration system is weak or missing. The reaction stops while plenty of template and amino acids remain.

  • Troubleshooting:

✅ Switch to a Continuous Exchange Cell-Free (CECF) format (dialysis membrane or two-chamber system). This constantly supplies fresh energy and removes waste.

✅ Increase the concentration of your energy regeneration components (e.g., double creatine phosphate from 50 mM to 100 mM).

✅ Use a more efficient energy source: PEP/pyruvate kinase or a maltodextrin-based system.

✅ Check the pH after the reaction – if it dropped below 7.0, your energy system may be producing acid. Switch to creatine phosphate (less pH drop).

Additional common reasons (if the above don’t help):

  • Protein aggregation: Lower temperature to 20–25°C. Add 0.5% detergent or 1 mM DTT.
  • RNase contamination: Use nuclease-free tubes, add RNase inhibitor (e.g., murine RNase inhibitor at 1 U/µL), and wear gloves.
  • Wrong magnesium concentration: Test a range from 8 to 16 mM Mg²⁺. Too low and ribosomes dissociate; too high and they lock up.
Homework question from Kate Adamala

Assignees for this section

  • MIT/Harvard students Required
  • Committed Listeners Required

Design an example of a useful synthetic minimal cell as follows:

  1. Pick a function and describe it. a. What would your synthetic cell do? What is the input and what is the output? b. Could this function be realized by cell-free Tx/Tl alone, without encapsulation? c. Could this function be realized by genetically modified natural cell? d. Describe the desired outcome of your synthetic cell operation.
  2. Design all components that would need to be part of your synthetic cell. a. What would be the membrane made of? b. What would you encapsulate inside? Enzymes, small molecules. c. Which organism your Tx/Tl system will come from? Is bacterial OK, or do you need a mammalian system for some reason? (hint: for example, if you want to use small molecule modulated promotors, like Tet-ON, you need mammalian) d. How will your synthetic cell communicate with the environment? (hint: are substrates permeable? or do you need to express the membrane channel?)
  3. Experimental details a. List all lipids and genes. (bonus: find the specific genes; for example, instead of just saying “small molecule membrane channel” pick the actual gene.) b. How will you measure the function of your system?

  1. Pick a function and describe it.

a. What would your synthetic cell do? What is the input and what is the output?

My synthetic minimal cell (SMC) is a “killer biosensor” that detects the presence of Staphylococcus aureus and responds by producing and secreting lysostaphin, a specific anti-staphylococcal enzyme.

  • Input: AIP-1 (autoinducing peptide-1), a quorum sensing molecule secreted by S. aureus (Group I strains) when it reaches high cell density.
  • Output: Lysostaphin (27 kDa zinc metalloprotease from Simulans staphylolyticus), which specifically cleaves the pentaglycine cross-bridges in the S. aureus cell wall, causing bacterial lysis.

Overall function: The SMC acts as a sentinel that detects S. aureus quorum signaling and releases a targeted killer, preventing infection, biofilm formation, and the spread of antibiotic-resistant strains.

b. Could this function be realized by cell-free Tx/Tl alone, without encapsulation?

No. Without encapsulation, the cell-free reaction would produce lysostaphin immediately and continuously, regardless of whether AIP-1 is present. The SMC would release its output constitutively, wasting the enzyme and providing no sensing function. Encapsulation creates a barrier that allows the system to wait for the input signal before producing the output. Additionally, without a membrane: The membrane-bound receptor AgrC could not be properly inserted and oriented and Lysostaphin would diffuse away uncontrollably instead of being released only after detecting S. aureus.

c. Could this function be realized by genetically modified natural cell?

Yes, in principle, but with significant drawbacks compared to a synthetic minimal cell (SMC). Natural GMOs can grow, divide, and potentially spread in the environment, and they may transfer genes to other bacteria through horizontal gene transfer. They can also mutate over time and lose their function, and the produced antibacterial molecule (e.g., lysostaphin) might harm the host cell itself. In contrast, SMCs do not replicate, cannot transfer genes, and do not evolve, making them safer and more stable. Additionally, their activity is more controlled, since the toxic compound is produced only when needed and released outward, which makes SMCs more suitable for applications such as medical treatments or topical use.

d. Describe the desired outcome of your synthetic cell operation.

In the presence of S. aureus (which secretes AIP-1), the synthetic cell detects AIP-1 via the membrane-bound AgrC receptor. This triggers a phosphorylation cascade that activates AgrA, which then binds the P2 promoter and drives transcription of the lysostaphin gene. Lysostaphin is produced inside the vesicle and secreted into the environment. The released lysostaphin specifically cleaves the pentaglycine bridges in the S. aureus cell wall, causing bacterial lysis and death.

In the absence of AIP-1 (no S. aureus), the synthetic cell remains inactive. The P2 promoter is “off” (no leak), and no lysostaphin is produced. This ensures the toxin is only made when and where it is needed.

  1. Design all components that would need to be part of your synthetic cell.

a. What would the membrane be made of?

The membrane needs to be stable but also allow the AgrC histidine kinase (a transmembrane protein) to insert properly. A suitable choice is liposomes composed of:

  • POPC (1-palmitoyl-2-oleoyl-sn-glycero-3-phosphocholine) – 60 mol% → Main structural lipid of the membrane
  • Cholesterol – 30 mol% → Increases membrane stability and reduces leakage
  • DOPG (1,2-dioleoyl-sn-glycero-3-phospho-(1’-rac-glycerol)) – 10 mol% → Adds negative charge, which helps the insertion and function of membrane proteins like AgrC

b. What would you encapsulate inside? Enzymes, small molecules.

Inside the synthetic cell, i would encapsulate the basic components needed for protein production and function. First, a cell-free transcription–translation system from Escherichia coli is included, which contains all the machinery such as ribosomes, tRNAs, enzymes, and T7 RNA polymerase to make proteins.

I also add the DNA templates: agrC and agrA genes (from Staphylococcus aureus) under a constitutive promoter to sense the signal and activate the response, and the lysostaphin (lys) gene (from Staphylococcus simulans) under the P2 promoter to produce the antibacterial protein. A secretion signal is fused to the lys gene so the protein can be exported outside the cell.

In addition, small molecules like:

  • ATP, GTP, CTP, UTP (nucleotide triphosphates for transcription)
  • 20 amino acids (building blocks for protein synthesis)
  • Creatine phosphate + creatine kinase (energy regeneration system)
  • Magnesium acetate (10–14 mM) – critical for ribosome function
  • Potassium glutamate (100–150 mM) – maintains ionic strength
  • DTT (1–2 mM) – maintains reducing environment
  • RNase inhibitors – protect mRNA from degradation

c. Which organism will your Tx/Tl system come from?

The Tx/Tl system will come from a bacterial source, specifically an Escherichia coli extract. This is because the AgrC/AgrA system is naturally bacterial and works well in an E. coli cell-free system, where AgrC can insert into liposomes properly. In addition, lysostaphin is a bacterial enzyme that does not require complex modifications, so it can be produced efficiently in this system. Finally, using a bacterial extract is simpler, faster, and cheaper than using a mammalian system, which is not needed in this case.

d. How will your synthetic cell communicate with the environment?

This synthetic cell communicates with its environment in a simple and efficient way using natural bacterial mechanisms:

For input, the signaling molecule AIP-1 does not need to enter the cell; instead, it binds directly to the external part of AgrC, a membrane protein embedded in the liposome. This means the sensor is already on the surface, so no channels are needed.

For output, lysostaphin (a relatively large protein, about 27 kDa) cannot pass through the membrane by diffusion. To solve this, a secretion signal peptide is added to lysostaphin, which directs it to the membrane during its synthesis. The protein is then transported across the membrane through the SecYEG translocon, a natural protein channel present in the Escherichia coli extract. This allows the protein to be released outside the synthetic cell in a controlled and efficient way, without needing artificial pores.

  1. Experimental details

a. List all lipids and genes (specific names).

  • Lipids:
LipidFull namemol%
POPC1-palmitoyl-2-oleoyl-sn-glycero-3-phosphocholine60%
CholesterolCholesterol30%
DOPG1,2-dioleoyl-sn-glycero-3-phospho-(1'-rac-glycerol)10%
  • Genes:
GeneSourcePromoterFunction
agrCS. aureus (GI, for example: strain RN6390)Constitutive (T7)Membrane histidine kinase that binds AIP-1 on the extracellular side
agrAS. aureus (same strain)Constitutive (T7)Response regulator; when phosphorylated by AgrC, activates P2 promoter
lys (lysostaphin) Fused to sec-secretion signalSimulans staphylolyticusP2 promoter (from S. aureus agr operon)Zinc metalloprotease that kills S. aureus, directed to be secreted across the membrane via the sec-secretion signal

Cell-free Tx/Tl system: All machinery for transcription and translation from E. coli extract (ribosomes, tRNAs, aminoacyl-tRNA synthetases, initiation/elongation/termination factors, T7 RNA polymerase).

b. How will you measure the function of your system?

  • Measurement 1: AIP-1 sensing (dose–response)

The synthetic cells can be exposed to different concentrations of AIP-1. After a few hours, lysostaphin production is measured using methods like ELISA, Western blot, or an enzyme activity assay. If the system works properly, higher AIP-1 levels should lead to higher lysostaphin production.

  • Measurement 2: Lysostaphin production (fluorescent reporter)

The lysostaphin gene can be replaced with GFP (green fluorescent protein) under the same promoter. The synthetic cells can then be monitored over time using a plate reader to measure fluorescence. Higher fluorescence indicates stronger gene expression.

  • Measurement 3: Killing of Staphylococcus aureus (functional assay)

The synthetic cells can be incubated with live bacteria in culture medium. After several hours, bacterial growth can be measured using OD600, colony counting (CFU), or live/dead staining. Reduced growth shows that the system is effective.

  • Measurement 4: Secretion efficiency

The synthetic cells can be centrifuged to separate them from the surrounding liquid. Lysostaphin activity is then measured both in the supernatant (outside) and inside the cells. A good system will show most of the protein in the supernatant.

  • Measurement 5: Promoter leakiness (control test)

The synthetic cells can be tested without adding AIP-1 to check background expression. Ideally, very little lysostaphin should be produced. If significant production is observed, the promoter may be leaky and require optimization.

cover image cover image
Homework question from Peter Nguyen

Assignees for this section

  • MIT/Harvard students Required
  • Committed Listeners Required

Freeze-dried cell-free systems can be incorporated into all kinds of materials as biological sensors or as inducible enzymes to modify the material itself or the surrounding environment. Choose one application field — Architecture, Textiles/Fashion, or Robotics — and propose an application using cell-free systems that are functionally integrated into the material. Answer each of these key questions for your proposal pitch:

  • Write a one-sentence summary pitch sentence describing your concept.
  • How will the idea work, in more detail? Write 3-4 sentences or more.
  • What societal challenge or market need will this address?
  • How do you envision addressing the limitation of cell-free reactions (e.g., activation with water, stability, one-time use)?

  1. One-sentence pitch

A wall paint containing synthetic minimal cells that detect toxic mold signals in damp walls and produce enzymes to neutralize mycotoxins and inhibit mold growth.

image ref cover image cover image

  1. How will the idea work?

The paint is embedded with microcapsules containing freeze-dried synthetic minimal cells (SMCs). When the wall becomes damp, the SMCs are activated by chemical signals released by mold, such as those from Stachybotrys chartarum. Once triggered, the SMCs produce enzymes or antimicrobial proteins that either degrade mycotoxins or prevent further mold growth. This creates a self-protecting coating that actively reduces mold and mycotoxin levels in real-time, improving indoor air safety.

  1. What societal challenge or market need does this address?

Toxic wall moisture is a serious indoor environmental problem. Persistent dampness encourages growth of black mold, which releases mycotoxins harmful to human health, causing respiratory issues, chronic fatigue, and neurological problems. Current paints only act as passive barriers and do not remove toxins. This smart paint provides active protection, reducing health risks and the need for costly remediation.

  1. How will you address limitations of cell-free systems?

The SMCs are freeze-dried within protective microcapsules, remaining inactive until moisture activates them. Microcapsules shield the system during storage and paint application. Activation only occurs when mold is present, ensuring efficient use. The one-time-use limitation is addressed by applying fresh paint layers during regular maintenance, keeping the wall continuously protected.

Homework question from Ally Huang

Assignees for this section

  • MIT/Harvard students Required
  • Committed Listeners Required

Freeze-dried cell-free reactions have great potential in space, where resources are constrained. As described in my talk, the Genes in Space competition challenges students to consider how biotechnology, including cell-free reactions, can be used to solve biological problems encountered in space. While the competition is limited to only high school students, your assignment will be to develop your own mock Genes in Space proposal to practice thinking about biotech applications in space!

For this particular assignment, your proposal is required to incorporate the BioBits® cell-free protein expression system, but you may also use the other tools in the Genes in Space toolkit (the miniPCR® thermal cycler and the P51 Molecular Fluorescence Viewer). For more inspiration, check out https://www.genesinspace.org/ .

  1. Provide background information that describes the space biology question or challenge you propose to address. Explain why this topic is significant for humanity, relevant for space exploration, and scientifically interesting. (Maximum 100 words)
  2. Name the molecular or genetic target that you propose to study. Examples of molecular targets include individual genes and proteins, DNA and RNA sequences, or broader -omics approaches. (Maximum 30 words)
  3. Describe how your molecular or genetic target relates to the space biology question or challenge your proposal addresses. (Maximum 100 words)
  4. Clearly state your hypothesis or research goal and explain the reasoning behind it. (Maximum 150 words)
  5. Outline your experimental plan - identify the sample(s) you will test in your experiment, including any necessary controls, the type of data or measurements that will be collected, etc. (Maximum 100 words)

  1. One-sentence summary pitc

We will use a freeze-dried cell-free system to test how microgravity affects protein production using a GFP reporter, providing insight into reduced collagen synthesis in space.

  1. How the idea works

Freeze-dried cell-free reactions containing a GFP reporter gene will be prepared in sealed chambers. In space, they will be rehydrated and incubated under microgravity conditions. GFP fluorescence will act as a direct indicator of protein synthesis efficiency. By comparing fluorescence levels between space and Earth conditions, we can determine whether microgravity directly affects the molecular machinery responsible for producing proteins such as collagen.

  1. Societal challenge / market need

Long-duration space missions lead to bone loss and tissue weakening in astronauts, partly due to reduced production of structural proteins like collagen. Understanding whether this reduction is caused by fundamental limits in protein synthesis will help develop countermeasures for bone loss, injury prevention, and tissue regeneration, improving astronaut health during missions to Mars and beyond.

  1. Limitation of cell-free reactions and how to address them

Cell-free reactions are single-use and require activation by water. To overcome this, we will freeze-dry the reactions in sealed chambers, ensuring long-term stability. The experiment will be activated by rehydration in space, allowing controlled and efficient protein production measurements under microgravity conditions.

  1. Molecular or genetic target

Green Fluorescent Protein (GFP) gene used as a reporter to measure protein synthesis efficiency linked to collagen-related biological processes.

  1. How the target relates to the space biology challenge

Collagen is essential for maintaining bone and tissue structure, but its production decreases in microgravity. Instead of directly expressing collagen, which is complex, GFP is used as a reporter to measure overall protein synthesis efficiency. If microgravity reduces GFP production, it suggests that the basic machinery needed to produce proteins like collagen is affected. This helps determine whether tissue weakening in space is caused by direct physical effects on protein production or by cellular regulation, providing clearer insight into astronaut health challenges.

  1. Hypothesis or research goal

We hypothesize that microgravity reduces protein synthesis efficiency, which contributes to decreased production of structural proteins such as collagen in astronauts. The goal is to measure GFP production in a cell-free system under microgravity and Earth conditions. Since cell-free systems isolate transcription and translation from cellular signaling, any observed decrease in GFP fluorescence would indicate that physical factors—such as altered diffusion, molecular interactions, or protein folding—directly impact protein synthesis. This would suggest that microgravity imposes fundamental constraints on biological processes, helping explain tissue weakening. The results could guide the development of targeted countermeasures to maintain astronaut health during long-duration missions.

  1. Experimental plan

Freeze-dried BioBits® reactions containing GFP DNA will be used. Samples include: (1) microgravity test reactions, (2) Earth-based positive controls, and (3) negative controls without DNA. Reactions will be rehydrated and incubated using the miniPCR®. GFP fluorescence will be measured with the P51 Molecular Fluorescence Viewer. Fluorescence intensity will be compared between conditions to determine whether microgravity reduces protein synthesis efficiency.

  • For this homework, I used DeepSeek and Google as sources of information. ChatGPT was used to improve the structure and clarity of the writing, while Cloud AI was used to generate the illustration of the synthetic minimal cell function.
Homework Part B: Individual Final Project

Assignees for this section

  • MIT/Harvard students Required
  • Committed Listeners Required

We’d like students to start exploring their final project in depth this week! Of your three Aims, for this week you should have at least Aim 1 decided and written down.

  1. Put your chosen final project slide in the appropriate slide deck following the instructions on slide 1: MIT/Harvard/Wellesley ONE FINAL PROJECT IDEA Committed Listener ONE FINAL PROJECT IDEA
  2. Submit this Final Project selection form if you have not already.
  3. Begin planning how you will write your final project documentation based on these guidelines
  4. Prepare your first DNA order and put it in the “Twist (MIT)” or “Twist (Nodes)” tab of the 2026 HTGAA Ordering: DNA, Reagents, Consumables spreadsheet, as appropriate. First Twist order deadline for MIT/Harvard/Wellesley students is Friday, April 3 at 11PM ET First Twist order deadline for Committed Listeners is Friday, April 10 at 11PM ET. (Your Node Lead will place the Twist order, so please work with them to finalize your constructs and ordering decisions.)

Sources:

Week 10 HW: Imaging And Measurement

cover image cover image
Homework: Final Project

Assignees for this section

  • MIT/Harvard students Required
  • Committed Listeners Required

For your final project:

  • Please identify at least one (ideally many) aspect(s) of your project that you will measure. It could be the mass or sequence of a protein, the presence, absence, or quantity of a biomarker, etc.
  • Please describe all of the elements you would like to measure, and furthermore describe how you will perform these measurements.
  • What are the technologies you will use (e.g., gel electrophoresis, DNA sequencing, mass spectrometry, etc.)? Describe in detail.

My project aims to express the carbon monoxide dehydrogenase (CODH) pathway from Oligotropha carboxidovorans in Nicotiana tabacum (tobacco) using a two-plasmid system. I need to measure whether the system works at every level — from DNA integration to enzyme function to plant health. Below i included what I will measure, how I will measure it, and the technologies I will use:

1. Confirming DNA Integration and Sequence

What I measure: Whether the seven CODH genes are present in the tobacco genome and whether their sequences are correct.

How I measure it:

  • Genomic PCR: Extract DNA from leaves, design primers specific to each of my seven codon-optimized genes, run PCR, and look for bands on an agarose gel.
  • Border-specific PCR: Use one primer in the T-DNA border (LB or RB) and one primer in my gene to confirm the entire T-DNA integrated.
  • Sanger sequencing: Send PCR products to a sequencing facility, align the returned sequences against my Benchling design using SnapGene. image ref image image

Technologies: PCR thermocycler, agarose gel electrophoresis, UV transilluminator, Sanger sequencing service, sequence alignment software.

2. Confirming mRNA Transcription

What I measure: Whether the seven genes are being transcribed into mRNA, and whether the three structural subunits (CoxL, CoxM, CoxS) are expressed at balanced levels.

How I measure it:

  • Extract total RNA from leaves using an RNA extraction kit.
  • Treat with DNase to remove genomic DNA.
  • Convert mRNA to cDNA using reverse transcriptase.
  • Run qPCR with gene-specific primers and SYBR Green.
  • Include reference genes for normalization.
  • Compare Ct values across the three structural subunits.

Technologies: RNA extraction kit, DNase, reverse transcriptase, qPCR machine, SYBR Green. image ref image image

3. Confirming Protein Presence and Assembly

What I measure: Whether CoxL, CoxM, and CoxS are present, whether the chloroplast transit peptide was cleaved, and whether the three subunits assemble into the complex.

How I measure it:

  • Isolate intact chloroplasts using Percoll gradient centrifugation.
  • Lyse chloroplasts gently and perform Co-IP using anti-FLAG magnetic beads (FLAG is on CoxS).
  • Elute with FLAG peptide.
  • Split eluate: run on Tricine-SDS-PAGE (silver stain) to see individual subunits at 88 kDa (CoxL), 32 kDa (CoxM), and 18 kDa (CoxS).
  • Run on Blue Native PAGE (Coomassie stain) to see the assembled complex at ~280 kDa.
  • For maturation proteins: run anti-FLAG Western (detects CoxD) and anti-His Western (detects His-tagged CoxF) on total chloroplast extract.

Technologies: Ultracentrifuge, anti-FLAG magnetic beads, PAGE equipment, silver stain, Coomassie stain, Western blot transfer system, chemiluminescence imager.

image ref image image

image ref image image

image ref image image

4. Confirming Chloroplast Targeting

What I measure: Whether the used chloroplast transit peptides direct proteins to the chloroplast.

How I measure it:

  • Build a separate reporter construct: promoter + CTP + GFP + terminator.
  • Transform into tobacco, select on hygromycin.
  • Take fresh leaf samples, mount on slides with water.
  • Observe under confocal microscope: GFP channel (green) and chlorophyll autofluorescence (red).
  • Calculate Pearson’s correlation coefficient using ImageJ (target >0.7).

Technologies: Confocal laser scanning microscope, ImageJ software. image ref image image

5. Confirming CO Oxidation Activity

What I measure: Whether the assembled CODH enzyme can oxidize CO to CO₂.

How I measure it:

  • Gas phase (whole plant): Place transformed plant in sealed transparent chamber, inject CO gas, record CO concentration in separate timelines using electrochemical CO sensor.
  • Methylene blue (purified enzyme): Purify CODH complex via anti-FLAG Co-IP, add to reaction with methylene blue and CO in anaerobic cuvette, measure absorbance at 600 nm at different timelines. Calculate specific activity (μmol CO/min/mg protein).

Technologies: Sealed gas chamber, electrochemical CO sensor, spectrophotometer, anaerobic cuvettes. image ref image image

6. Confirming Cofactor Incorporation

What I measure: Whether the CODH complex contains molybdenum, copper, and iron-sulfur clusters.

How I measure it:

  • ICP-MS: Send purified CODH complex to core facility. Measure Mo, Cu, and Fe content. Calculate metal-to-protein stoichiometry.
  • UV-Vis spectroscopy: Measure absorbance spectrum of purified complex from 300-700 nm. Look for peak at 420 nm (Fe-S clusters). Technologies: ICP-MS instrument, UV-Vis spectrophotometer.

7. Confirming Electron Transfer Compatibility

What I measure: Whether electrons from CODH go to the photosynthetic electron transport chain or leak to oxygen.

How I measure it:

  • Compare CO oxidation rate in light vs. dark using the gas chamber setup.
  • Calculate light:dark ratio. Ratio >2 indicates electrons go to photosynthetic chain (requires light). Ratio ~1 indicates electrons go directly to oxygen (oxidative stress risk).

Technologies: Sealed gas chamber, electrochemical CO sensor, light source, dark cover.

8. Monitoring Plant Health

What I measure: Whether expressing CODH causes stress or benefits photosynthesis.

How I measure it:

  • Chlorophyll fluorescence (Fv/Fm): Dark-adapt leaf for 20 minutes, measure with PAM fluorometer. Healthy plant = 0.80-0.83.
  • CO₂ assimilation: Use infrared gas analyzer (IRGA) to measure net CO₂ uptake by leaf. Compare transformed vs. wild-type.
  • Biomass: Dry plants at 70°C for 48 hours, weigh shoot and root. Compare transformed vs. wild-type.
  • ROS detection: Stain leaf discs with NBT (detects superoxide, turns blue) and DAB (detects H₂O₂, turns brown). Photograph and quantify staining.

Technologies: PAM fluorometer, LI-COR IRGA, analytical balance, NBT/DAB staining, light microscope, ImageJ. image ref image image

image ref image image Histochemical detection of H2O2 by DAB staining (a), superoxide radical by NBT staining (b)

9. Monitoring Silencing Over Time

What I measure: Whether expression remains stable across generations (T0 → T1 → T2).

How I measure it:

  • Grow T0 plants (primary transformants), measure mRNA by RT-qPCR.
  • Self-pollinate T0 to obtain T1 seeds.
  • Grow T1 plants, repeat RT-qPCR.
  • Grow T2 plants, repeat RT-qPCR.
  • Calculate silencing index = Expression(T1)/Expression(T0). Index >0.8 = stable.

Technologies: RT-qPCR, plant growth facilities.

Homework: Waters Part I — Molecular Weight

Assignees for this section

  • MIT/Harvard students Required
  • Committed Listeners Required

We will analyze an eGFP standard on a Waters Xevo G3 QTof MS system to determine the molecular weight of intact eGFP and observe its charge state distribution in the native and denatured (unfolded) states. The conditions for LC-MS analysis of intact protein cause it to unfold and be detected in its denatured form (due to the solvents and pH used for analysis).

  1. Based on the predicted amino acid sequence of eGFP (see below) and any known modifications, what is the calculated molecular weight? You can use an online calculator like the one at https://web.expasy.org/compute_pi/ eGFP Sequence: MVSKGEELFTG VVPILVELDG DVNGHKFSVS GEGEGDATYG KLTLKFICTT GKLPVPWPTL VTTLTYGVQC FSRYPDHMKQ HDFFKSAMPE GYVQERTIFF KDDGNYKTRA EVKFEGDTLV NRIELKGIDF KEDGNILGHK LEYNYNSHNV YIMADKQKNG IKVNFKIRHN IEDGSVQLAD HYQQNTPIGD GPVLLPDNHY LSTQSALSKD PNEKRDHMVL LEFVTAAGIT LGMDELYKLE HHHHHH Note: This contains a His-purification tag (HHHHHH) and a linker (the LE before it).
  2. Calculate the molecular weight of the eGFP using the adjacent charge state approach described in the recitation. Select two charge states from the intact LC-MS data (Figure 1) and:
  3. Determine z for each adjacent pair of peaks (n,n+1) using: n = (m/z(n+1)−1)/(m/z(n)−m/z(n+1))
  4. Determine the MW of the protein using the relationship between m/zn, MW and z.
  5. Calculate the accuracy of the measurement using the deconvoluted MW from 2.2 and the predicted weight of the protein from 2.1 using: Accuracy = (|Calculated MW – Theoretical MW|) / (Theoretical MW) x 1,000,000
  6. Can you observe the charge state for the zoomed-in peak in the mass spectrum for the intact eGFP? If yes, what is it? If no, why not? image image Figure 1. Mass Spectrum of intact eGFP protein from the Waters Xevo G3 LC-MS (a mass spectrometer with 30,000 resolution) with individual charge state peaks labeled with m/z values.

  1. Theoretical Molecular Weight Calculation

The theoretical molecular weight of eGFP was calculated using the online tool ExPASy Compute pI/Mw tool (Swiss Institute of Bioinformatics). The full amino acid sequence of eGFP, including the C-terminal His-tag (HHHHHH) and linker (LE), was entered into the calculator. image image The computed molecular weight obtained from this tool was: 28006.60 Da image image This value was used as the reference theoretical mass for comparison with the experimentally determined molecular weight obtained from LC-MS analysis.

  1. Calculating the Experimental Molecular Weight (MW)

    2.1. Identification of Adjacent Charge States

Step 1: Identifing Two Adjacent Peaks from Figure 01

let’s use the following values from this figure:

m/z(n) = 903.7148

m/z(n+1) = 875.4421

Step 2: Solve for the Charge State (n)

The relationship between the two peaks is:

n = (m/z(n+1)−1)/(m/z(n)−m/z(n+1))

Let’s plug in our example numbers:

n = 875.4421 – 1 / (903.7148 – 875.4421)

n = 874.4421/ (28.2727)

n = 30.93

Since the charge state must be a whole integer, we round this to the nearest whole number. Therefore, n = 31. This means the peak at m/z 903.7148 is the +31 charge state. From this value, we can extract the charge state for the second adjacent peak: n+1 = 32, which means the peak at m/z 875.4421 is the +32 charge state.

2.2. Calculating (MW)

Now that we know n, we can calculate M using the following formula, which accounts for the mass of the protons that are adding the charge:

m/z = MW of protein + mass of all added protons / total number of charges (n)

MW of protein = (m/z x total number of charges (n)) – (mass of all added protons)

Note: mass of all added protons is: the total number of charges (n) x the mass of a proton (approximately 1.0078 Da) (H)

Using the charge state of the first peak:

MW = (m/z(n) x zn) – (zn x H)

MW = (875.4421 x 32) - (32 x 1.0078)

MW = 28014.1472 – 32.2496

MW = 27981.8976 Da

Using the second peak, I found: MW = 27983.917 Da, so the average experimental molecular weight of this protein is ≈ 27982.9073 Da By comparing the experimental result, we just calculated to the theoretical weight from Step 1, the resulted experimental molecular weight is approximate to the theoretical value calculated 28006.60 Da.

2.3. Calculating the Measurement Accuracy

The formula for accuracy is:

Accuracy = (|Calculated MW – Theoretical MW|) / (Theoretical MW) x 1,000,000

Accuracy = (|27982.9073 – 28006.60 |) / (28006.60) x 1,000,000

Accuracy = (9.56) / (28006.60) x 1,000,000

Accuracy = 845.96 ppm > 50 ppm

The measured accuracy (~846 ppm) is significantly higher than the acceptable threshold of 50 ppm.

This deviation is most likely due to instrumental factors, such as imperfect calibration of the mass spectrometer, which can lead to slight inaccuracies in measured m/z values. Since the theoretical mass was calculated directly from the provided amino acid sequence, it is unlikely that the discrepancy arises from errors in the protein sequence or its expression.

  1. Charge State Determination (Zoomed Peak)

No, we cannot. The inability to determine the charge state from the zoomed-in peak is mainly due to the relationship between isotope spacing and instrument resolution. Proteins are made of atoms that exist in different isotopic forms, such as 12C and 13C, which create small differences in mass. In their neutral state, these isotopes are separated by about 1 Da. However, in mass spectrometry, we measure the mass-to-charge ratio (m/z), so the space between isotopic peaks becomes (1/z), where (z) is the charge. This means that as the charge increases, the spacing between peaks becomes smaller.

For large proteins like eGFP (approximately 28 kDa), the charge state is relatively high. As a result, the spacing between isotopic peaks becomes extremely small. For example, if the charge is around (z ≈ 19), the spacing between peaks is only about 0.05 (m/z). These very small differences are difficult for the instrument to detect.

The limitation comes from the resolution of the mass spectrometer. Resolution refers to the ability of the instrument to distinguish between two very close peaks. In this case, the required spacing (around 0.05 (m/z)) is smaller than what the instrument can clearly resolve. Instead of observing distinct isotopic peaks, the signals merge together and appear as a single broad and jagged peak.

Because the individual isotope peaks are not visible, it is not possible to measure their spacing and determine the charge state directly. Therefore, an alternative approach, such as the adjacent charge state method, must be used to calculate the charge and molecular weight.

Homework: Waters Part II — Secondary/Tertiary structure

Assignees for this section

  • MIT/Harvard students Required
  • Committed Listeners Required

We will analyze eGFP in its native, folded state and compare it to its denatured, unfolded state on a quadrupole time-of-flight MS. We will be doing MS-only analysis (no liquid chromatography, also known as “direct infusion” experiments) on the Waters Xevo G3-QToF MS.

  1. Based on learnings in the lab, please explain the difference between native and denatured protein conformations. For example, what happens when a protein unfolds? How is that determined with a mass spectrometer? What changes do you see in the mass spectrum between the native and denatured protein analyses (Figure 2)? image image Figure 2. Comparison of the mass spectra between denatured (top) and native (bottom) eGFP standard on the Waters Xevo G3 QTof MS.
  2. Zooming into the native mass spectrum of eGFP from the Waters Xevo G3 QTof MS (see Figure 3), can you discern the charge state of the peak at ~2800 ? What is the charge state? How can you tell? image image Figure 3. Native eGFP mass spectrum from the Waters Xevo G3 Q-Tof MS. The inset is a zoomed-in view of the charge state at ~2800 m/z on a mass spectrometer with 30,000 resolution.

  1. the difference between native and denatured protein conformations

What happens when a protein unfolds?

In its native state, a protein such as eGFP is folded into a compact three-dimensional structure (often described as a beta-barrel). In this conformation, many basic amino acid residues (such as lysine and arginine) are buried inside the protein and are not easily accessible. When the protein becomes denatured, typically due to acidic or organic solvents, it loses this structure and unfolds into a more extended chain. This unfolding exposes a larger surface area and reveals previously hidden basic sites.

How is this determined with a Mass Spectrometer?

Mass spectrometry detects the charge-to-mass ratio (m/z). Because an unfolded protein has more surface area and more exposed basic sites, it can pick up a much higher number of protons (H+) during Electrospray Ionization (ESI). So, in simple way:

  • Native (folded) protein: Compacted structure → Fewer exposed basic sites → Binds fewer protons → low charge state (low z)
  • Denatured (unfolded) protein: Extended, flexible structure → More exposed basic sites → Binds more protons → high charge state (high z)

Changes Observed in the Mass Spectrum (Figure 2)

These differences in charge directly affect the mass-to-charge ratio (m/z): Since m/z= m x 1/z, a higher charge (z) results in a lower m/z

  • Denatured (in Green): The peaks are shifted to the left (lower m/z). This is because the charge (z) is high. Since z is the denominator in m/z, a higher charge results in a lower m/z value. The distribution is also very broad, indicating many different charge states are possible for a flexible, unfolded chain.
  • Native (in Red): The peaks are shifted to the right (higher m/z). A folded protein is “shielded,” so it can only pick up a few protons. Fewer protons mean a lower z, which results in a much higher m/z value.
  1. When analyzing Figure 3 of the native mass spectrum of eGFP, I initially noticed a possible confusion in the question. The prompt refers to a zoomed-in region around m/z ~2800, however, the zoomed image shown in the figure is actually centered on the peak at m/z ~2545, not 2800. Because of this mismatch, I decided to carefully analyze the figure in two complementary ways to ensure a complete and correct interpretation.

Case 1: Analysis of the zoomed-in region (m/z ~2545)

Although the question mentions ~2800, the zoomed panel clearly shows the peak at m/z ≈ 2545. In this zoomed region, individual isotopic peaks are visible. This is important because isotopic resolution allows us to determine the charge state using peak spacing.

→ Method used: isotopic spacing

  • In mass spectrometry, isotopic peaks of a given charge state are separated by: Δ(m/z) = 1/z
  • From the zoomed spectrum, the spacing between adjacent isotopic peaks: Looking at the labeled values around ~2544–2545:

2544.8552 → 2544.7637 ≈ 0.0915 m/z

2544.7637 → 2544.6719 ≈ 0.0918 m/z

Average spacing ≈ 0.092 m/z

Calculation: z = 1/ 0.092 ≈ 10.86

Considering the measured values shown in the figure (around 2545.03–2545.22), the spacing is most consistent with: +11

Case 2: Interpretation of the peak at m/z ~2800 (main spectrum)

In the full (non-zoomed) spectrum, there is also a broader peak around m/z ~2800, but:

It is not zoomed in and the isotopic pattern is not resolved, Therefore, charge state cannot be directly read from spacing in this region What I did to solve this

Since isotopic resolution is not available at ~2800, I used the adjacent peak relationship between charge states in native mass spectrometry:

  • Neighboring charge states follow predictable shifts in m/z
  • Using the relationship between the 2545 peak and the 2799 peak:

n = (m/z(n+1)−1)/(m/z(n)−m/z(n+1))

n = 2545 – 1 / (2799 – 2545)

n = 2544 / (254)

n = 10.01

This indicates that the peak at ~2800 corresponds to the next charge state after +10.

Homework: Waters Part III — Peptide Mapping - primary structure

Assignees for this section

  • MIT/Harvard students Required
  • Committed Listeners Required

We will digest the eGFP protein standard into peptides using trypsin (an enzyme that selectively cleaves the peptide bond after Lysine (K) and Arginine (R) residues. The resulting peptides will be analyzed on the Waters BioAccord LC-MS to measure their molecular weights and fragmented to confirm the amino acid sequence within each peptide – generating a “peptide map”. This process is used to confirm the primary structure of the protein.

There are a variety of tools available online to calculate protein molecular weight and predict a list of peptides generated from a tryptic digest. We will be using tools within the online resource Expasy (the bioinformatics resource portal of the Swiss Institute of Bioinformatics (SIB)) to predict a list of tryptic peptides from eGFP.

  1. How many Lysines (K) and Arginines (R) are in eGFP? Please circle or highlight them in the eGFP sequence given in Waters Part I question 1 above. (Note: adding the sequence to Benchling as an amino acid file and clicking biochemical properties tab will show you a count for each amino acid).
  2. How many peptides will be generated from tryptic digestion of eGFP?
    1. Navigate to https://web.expasy.org/peptide_mass/
    2. Copy/paste the sequence above into the input box in the PeptideMass tool to generate expected list of peptides.
    3. Use Figure 4 below as a guide for the relevant parameters to predict peptides from eGFP.
    4. Click “Perform the Cleavage” button in the PeptideMass tool and report the number of peptides generated when using trypsin to perform the digest. image image Figure 4. Example conditions for predicting the number of tryptic peptides from the eGFP standard. Please replicate all parameters shown above.
  3. Based on the LC-MS data for the Peptide Map data generated in lab (please use Figure 5a as a reference) how many chromatographic peaks do you see in the eGFP peptide map between 0.5 and 6 minutes? You may count all peaks that are >10% relative abundance. image image Figure 5a. Total ion chromatogram (TIC) of the eGFP peptide map. The peak at 2.78 minutes is circled, and its MS data is shown in the mass spectrum in Figure 5b, below.
  4. Assuming all the peaks are peptides, does the number of peaks match the number of peptides predicted from question 2 above? Are there more peaks in the chromatogram or fewer?
  5. Identify the mass-to-charge (m/z) of the peptide shown in Figure 5b. What is the charge (z) of the most abundant charge state of the peptide (use the separation of the isotopes to determine the charge state). Calculate the mass of the singly charged form of the peptide (M+H+) based on its m/z and z. image image Figure 5b. Mass spectrum figure to show m/zfor the chromatographic peak at 2.78 min from Figure 5a above. The inset is a zoom-in of the peak at 525.76, to discern the isotope peaks. image image Figure 5c. Fragmentation spectrum of the peptide eluting at retention time 2.78 minutes in Figure 5a (above).
  6. Identify the peptide based on comparison to expected masses in the PeptideMass tool. What is mass accuracy of measurement? Please calculate the error in ppm. (Recall that Accuracy = (|Calculated MW – Theoretical MW|) / (Theoretical MW) x 1,000,000 )
  7. What is the percentage of the sequence that is confirmed by peptide mapping? (see Figure 6) image image Figure 6. Amino Acid Coverage Map of eGFP based on BioAccord LC-MS peptide identification data.

Bonus Peptide Map Questions

  1. Can you determine the peptide sequence for the peptide fragmentation spectrum shown in Figure 5c? (HINT: Use your results from Question 2 above to match the peptide molecular weight that is closest to that shown in Figure 5b. Copy and paste its sequence into this tool online to predict the fragmentation pattern based on its amino acid sequence: http://db.systemsbiology.net/proteomicsToolkit/FragIonServlet.html. What is the sequence of the eGFP peptide that best matches the fragmentation spectrum in Figure 5c?
  2. Does the peptide map data make sense, i.e. do the results indicate the protein is the eGFP standard? Why or why not? Consult with Figure 6, which depicts the % amino acid coverage of peptides positively identified using their calculated mass and fragmentation pattern.

  1. Identification of Cleavage Sites (K and R residues)

To predict the tryptic digestion pattern of eGFP, I first analyzed the amino acid sequence and counted the number of lysine (K) and arginine (R) residues, since trypsin cleaves specifically after these amino acids. image image From the sequence analysis:

  • Number of Lysine (K): 20
  • Number of Arginine (R): 6
  1. Prediction of Tryptic Peptides

To determine the number of peptides generated after digestion, I used the ExPASy PeptideMass tool by inputting the full eGFP sequence and applying trypsin cleavage conditions. image image The tool predicted a total of: 19 peptides image image

The theoretical molecular weight of eGFP used for reference was: Mw (average mass): 28006.60 Da

  1. Chromatographic Peak Analysis

From the total ion chromatogram (TIC) shown in Figure 5a, I counted the number of peaks between 0.5 and 6 minutes, considering only peaks with a relative intensity greater than 10%. The number of observed peaks was: 18

  1. Comparison Between Predicted Peptides and Observed Peaks

The theoretical digestion predicted 19 peptides, while the chromatogram shows 18 peaks. There is slight difference between the theoretical digestion and the chromatogram, but overall, the numbers are very close, indicating good agreement between theoretical prediction and experimental data.

  1. Peptide Mass and Charge Determination

From Figure 5b, the most abundant peak was observed at: m/z = 525.76

By analyzing the isotope spacing:

  • 526.25918 – 525.76712 = 0.49
  • 526.76845 - 526.25918 = 0.50

Δm/z ≈ 0.5 → z = 1/ Δm/z = 1/ 0.5 = 2

Thus, the peptide is doubly charged (z=2).

The molecular weight was calculated using:

MW = (m/z x z) – (z x H)

MW = (525.76 x 2) – (2 x 1.0078)

MW = 1049.5044 ≈ 1050

  1. Peptide Identification

Using the predicted peptide list from the ExPASy tool, I compared the calculated experimental mass (1049.5044 Da) with theoretical peptide masses. The closest match was:

  • Peptide sequence: FEGDTLVNR with Theoretical mass: 1050.5214 Da

This confirms that the detected peptide corresponds to this sequence.

Then the mass accuracy was calculated using:

Accuracy = (|Calculated MW – Theoretical MW|) / (Theoretical MW) x 1,000,000

Accuracy = (|1049.5044 – 1050.5214|) / (1050.5214) x 1,000,000

Accuracy = (1.017) / (1050.5214) * 1,000,000

Accuracy = 968.09 ppm > 10

  1. Sequence Coverage (Figure 6) From the coverage map shown in Figure 6, approximately: 88% of the eGFP sequence was identified This high coverage indicates that most of the protein sequence was successfully confirmed through peptide mapping.

Bonus part:

  1. Peptide Sequence Confirmation Using Fragmentation

To confirm the identity of the peptide, I used the mass obtained from the LC-MS analysis and matched it with the predicted tryptic peptides. The peptide with the closest theoretical mass was identified as FEGDTLVNR, with a theoretical mass of 1050.52149 Da. To validate this identification, I used a fragmentation prediction tool to generate the expected b- and y-ion fragments of this peptide.
image image the resulted fragments are as following: image image
I then compared these predicted fragments with the experimental MS/MS spectrum shown in Figure 5c. Several peaks in the spectrum matched the predicted fragments, especially the y-ions, like :1050.52149; 903.45308; 774.41049; 602.36208, which confirms that the sequence FEGDTLVNR is correct. The experimental mass of the peptide was 1050.52438 Da, which is very close to the theoretical value. I calculated the mass accuracy using the ppm formula and obtained: accuracy ≈2.75 ppm
This very low error (well below 10 ppm) indicates high measurement accuracy and strong agreement between experimental and theoretical data.

  1. Sequence Coverage and Protein Confirmation

To evaluate whether the results confirm the identity of the protein, I analyzed the sequence coverage shown in Figure 6. The coverage percentage was approximately 88%, indicating that a large portion of the eGFP sequence was successfully identified. Additionally, the identified peptide FEGDTLVNR (positions 115–123) is located within the covered regions of the sequence, confirming that this peptide contributes to the overall sequence identification. image image This high sequence coverage, along with the accurate peptide identification and fragmentation matching, confirms that the analyzed protein is indeed eGFP. Although some regions are not covered (likely due to peptides that are too small or poorly ionized), the overall results provide strong confidence in the protein identification.

Homework: Waters Part IV — Oligomers

Assignees for this section

  • MIT/Harvard students Required
  • Committed Listeners Required

We will determine Keyhole Limpet Hemocyanin (KLH)’s oligomeric states using charge detection mass spectrometry (CDMS). CDMS single-particle measurements of KLH allow us to make direct mass measurements to determine what oligomeric states (that is, how many protein subunits combine) are present in solution. Using the known masses of the polypeptide subunits (Table 1) for KLH, identify where the following oligomeric species are on the spectrum shown below from the CDMS (Figure 7):

  • 7FU Decamer
  • 8FU Didecamer
  • 8FU 3-Decamer
  • 8FU 4-Decamer

Polypeptide Subunit Name | Subunit Mass | 7FU | 340 kDa 8FU | 400 kDa Table 1: KLH Subunit Masses

image image Figure 7. Mass spectrum of Keyhole Limpet Hemocyanin (KLH) acquired on the CDMS.


Oligomer Identification Using CDMS

To determine the oligomeric states of Keyhole Limpet Hemocyanin (KLH), I used the subunit masses provided in Table 1 and calculated the expected total mass for each oligomeric form. The given subunit masses are:

  • 7FU = 340 kDa
  • 8FU = 400 kDa

Mass Calculations

For each oligomer, the total mass was calculated by multiplying the subunit mass by the number of subunits:

  • 7FU Decamer (10 subunits): 10×340 = 3400kDa = 3.4MDa
  • 8FU Didecamer (20 subunits): 20×400 = 8000kDa = 8 MDa
  • 8FU 3-Decamer (30 subunits): 30×400 = 12000kDa = 12 MDa
  • 8FU 4-Decamer (40 subunits): 40×400 = 16000kDa = 16 MDa

Note: While assigning the oligomeric peaks in the CDMS spectrum (Figure 7), I noticed that for the first three oligomers there are clear red peaks, but for the fourth one (~16 MDa), there is only a small blue signal without a corresponding red peak. This made me question why there are two different colors in the spectrum and why the fourth oligomer does not have a red peak.

After looking into this, I understood that the two colors represent different types of data:

  • The blue line corresponds to the raw signal detected by the instrument. It includes all detected ions and therefore appears noisy and irregular.
  • The red peaks correspond to a fitted model (Gaussian fit) generated by the software. This fit is applied to the raw data to determine the most accurate position (center) of each mass peak.

This means that the red peaks represent the most reliable mass values, while the blue signal shows all detected data, including weaker or less clear signals.

Using this understanding, I assigned the oligomers as follows:

  • The peak at 3.4 MDa (red) corresponds to the 7FU decamer
  • The peak at 8.33 MDa (red) corresponds to the 8FU didecamer
  • The peak at 12.67 MDa (red) corresponds to the 8FU 3-decamer
  • For the fourth oligomer (~16.0 MDa), I observed only a small blue “hump” in the region between 16–17 MDa, without any red fitted peak.

This can be explained by the fact that:

  • The signal for this oligomer is much weaker compared to the others
  • There may be fewer particles detected at this mass
  • The signal may be too noisy or not well-defined

Because of this, the software was not able to confidently fit a Gaussian curve, and therefore no red peak was generated. Despite this, the presence of the blue signal at the expected mass range (~16 MDa) still indicates the existence of the 8FU 4-decamer, even if it is less abundant or less stable.

Homework: Waters Part V — Did I make GFP?

Assignees for this section

  • MIT/Harvard students Required
  • Committed Listeners Required

Please fill out this table with the data you acquired from the lab work done at the Waters Immerse Lab in Cambridge, or else the data screenshots in this document if you were unable to have lab work done at Waters.

ParameterTheoreticalObserved / Measured (Intact LC-MS)PPM Mass Error
Molecular weight (kDa)

ParameterTheoreticalObserved / Measured (Intact LC-MS)PPM Mass Error
Molecular weight (kDa)28.006627.9829846

For this homework, I used AI tools such as ChatGPT and DeepSeek to help structure my ideas and improve the clarity of my writing. I also used NotebookLM to better understand the provided resources and supporting materials. For the final project measurements, DeepSeek suggested including the last four key measurements, which I integrated into my analysis.


Sources:

Week 11 HW: Bioproduction & Cloud Labs

Part A: The 1,536 Pixel Artwork Canvas | Collective Artwork

Assignees for this section

  • MIT/Harvard students Required
  • Committed Listeners Required
  1. Contribute at least one pixel to this global artwork experiment before the editing ends on Sunday 4/19 at 11:59 PM EST.
    • A personalized URL was sent to the email address associated with your Discourse account, and you can discuss the artwork on the Discourse.
    • If you did not have a chance to contribute, it’s okay, just make sure you become a TA this fall! 😉
  2. Make a note on your HTGAA webpages including:
    • what you contributed to the community bioart project (e.g., “I made part of the DNA on the bottom right plate”)
    • what you liked about the project, and
    • what about this collaborative art experiment could be made better for next year.

Contribution to the Collective Bioart Project

image image

I contributed to several designs during the experiment. My final contribution was trying to create a geometric pattern inspired by Islamic geometric art in the bottom-right corner of the pixel canvas. The design did not stay until the end because other participants kept modifying it, but it was interesting to see how the artwork kept changing with everyone’s input. image image What I Liked About the Project

I really liked the collaborative aspect of this project. It was fun to work with others at the same time, contribute to different designs, and watch them change in real time. The canvas was dynamic and creative, and it encouraged experimentation and shared participation.

Suggestions for Improvement

One improvement could be to limit each participant to only one or two pixels. This would encourage more collaboration, because people would need to work together to create designs instead of working alone on bigger parts. It could make the final artwork more coordinated and truly collaborative.

Part B: Cell-Free Protein Synthesis | Cell-Free Reagents

Assignees for this section

  • MIT/Harvard students Required
  • Committed Listeners Required
  1. Referencing the cell-free protein synthesis reaction composition (the middle box outlined in yellow on the image above, also listed below), provide a 1-2 sentence description of what each component’s role is in the cell-free reaction. E. coli Lysate
  • BL21 (DE3) Star Lysate (includes T7 RNA Polymerase)

Salts/Buffer

  • Potassium Glutamate
  • HEPES-KOH pH 7.5
  • Magnesium Glutamate
  • Potassium phosphate monobasic
  • Potassium phosphate dibasic

Energy / Nucleotide System

  • Ribose
  • Glucose
  • AMP
  • CMP
  • GMP
  • UMP
  • Guanine

Translation Mix (Amino Acids)

  • 17 Amino Acid Mix
  • Tyrosine
  • Cysteine

Additives

  • Nicotinamide

Backfill

  • Nuclease Free Water
  1. Describe the main differences between the 1-hour optimized PEP-NTP master mix and the 20-hour NMP-Ribose-Glucose master mix shown in the Google Slide above. (2-3 sentences)
  2. Bonus question: How can transcription occur if GMP is not included but Guanine is?

  1. Component Roles (20-Hour NMP–Ribose–Glucose System)

image ref image image

  • E. coli Lysate

BL21 (DE3) Star Lysate: Provides the core cellular machinery required for gene expression, including ribosomes, tRNAs, aminoacyl-tRNA synthetases, and metabolic enzymes. The BL21 (DE3) strain also supplies T7 RNA polymerase, enabling strong transcription from T7 promoters.

  • Salts / Buffer

Potassium Glutamate: Maintains proper ionic strength and mimics the natural intracellular environment, helping stabilize proteins and support enzyme activity.

HEPES-KOH (pH 7.5): Acts as a buffering agent to keep the pH stable, which is essential for maintaining enzyme function during long incubations.

Magnesium Glutamate: Provides Mg²⁺ ions, which are essential cofactors for ribosome stability, RNA polymerase activity, and interactions with nucleic acids.

Potassium Phosphate (monobasic/dibasic): Serves as a secondary buffer and provides inorganic phosphate needed for ATP regeneration and nucleotide metabolism.

  • Energy / Nucleotide System

Ribose: Feeds into the pentose phosphate pathway to generate precursors (like PRPP) required for nucleotide synthesis.

Glucose: Acts as the main energy source, supporting ATP production through metabolic pathways such as glycolysis.

AMP, CMP, UMP: These nucleoside monophosphates (NMPs) are low-cost precursors that are enzymatically converted into NTPs (ATP, CTP, UTP) for RNA synthesis.

Guanine: Supplied as a nucleobase that is converted into GMP through salvage pathways, then further phosphorylated into GTP for transcription.

  • Translation Mix (Amino Acids)

17 Amino Acid Mix + Tyrosine + Cysteine: Provide all amino acids required for protein synthesis. Tyrosine and cysteine are added separately because they are less stable or less soluble in standard mixtures.

  • Additives / Backfill

Nicotinamide: Acts as a precursor for NAD⁺, an important cofactor in metabolic reactions that support long-term energy regeneration.

Nuclease-Free Water: Serves as the solvent to adjust final concentrations while preventing degradation of DNA or RNA by nucleases.

2. Differences Between 1-Hour PEP-NTP and 20-Hour NMP–Ribose–Glucose Systems

The 1-hour PEP-NTP system is designed for rapid protein production by providing ready-to-use NTPs and a high-energy phosphate donor (PEP), allowing fast transcription and translation but for a short duration due to quick depletion of resources. In contrast, the 20-hour NMP–Ribose–Glucose system uses cheaper precursors (NMPs, ribose, glucose) and relies on the lysate’s metabolic pathways to gradually regenerate NTPs and energy, enabling longer and more sustained protein production.

3. Bonus Question

Transcription can still occur without GMP because the system includes guanine, which is converted into GMP through the salvage pathway. In this process, guanine is combined with PRPP (derived from ribose metabolism) to form GMP, which is then phosphorylated into GDP and GTP. The produced GTP is then used by T7 RNA polymerase for RNA synthesis.

Part C: Planning the Global Experiment | Cell-Free Master Mix Design

Assignees for this section

  • MIT/Harvard students Required
  • Committed Listeners Required
  1. Given the 6 fluorescent proteins we used for our collaborative painting, identify and explain at least one biophysical or functional property of each protein that affects expression or readout in cell-free systems. (Hint: options include maturation time, acid sensitivity, folding, oxygen dependence, etc) (1-2 sentences each)

The amino acid sequences are shown in the HTGAA Cell-Free Benchling folder.

  1. Create a hypothesis for how adjusting one or more reagents in the cell-free mastermix could improve a specific biophysical or functional property you identified above, in order to maximize fluorescence over a 36-hour incubation. Clearly state the protein, the reagent(s), and the expected effect.

  2. The second phase of this lab will be to define the precise reagent concentrations for your cell-free experiment. You will be assigned artwork wells with specific fluorescent proteins and receive an email with instructions this week (by April 24). You can begin composing master mix compositions here.

  3. The final phase of this lab will be analyzing the fluorescence data we collect to determine whether we can draw any conclusions about favorable reagent compositions for our fluorescent proteins. This will be due a week after the data is returned (date TBD!). The reaction composition for each well will be as follows:

  • 6 μL of Lysate
  • 10 μL of 2X Optimized Master Mix from above
  • 2 μL of assigned fluorescent protein DNA template
  • 2 μL of your custom reagent supplements
  • Total: 20 μL reaction

1. Fluorescent Protein Properties Affecting Cell-Free Expression

The biophysical properties of fluorescent proteins (FPs), including folding efficiency, maturation time, pH sensitivity, oxygen dependence, and structural stability, play a critical role in determining their fluorescence output in cell-free systems, especially during extended incubations such as 36 hours.

sfGFP (Superfolder GFP)

This protein exhibits very fast folding and high structural stability, with efficient chromophore maturation that is oxygen-dependent. Its resistance to misfolding and aggregation allows it to maintain strong and consistent fluorescence over long incubation periods, making it a reliable reference protein in cell-free systems.

mRFP1 (Monomeric Red Fluorescent Protein 1)

It is characterized by slow maturation kinetics and incomplete chromophore formation, which delays the appearance of fluorescence. Additionally, it may form non-fluorescent intermediates, leading to lower overall signal intensity compared to more advanced red fluorescent proteins.

mKO2 (Monomeric Kusabira Orange 2)

This protein shows relatively fast maturation and high brightness, but its chromophore formation is strongly dependent on oxygen availability and can also be influenced by temperature. In conditions with limited oxygen or suboptimal temperature, its fluorescence intensity may be reduced.

mTurquoise2 It has a complex maturation mechanism and high quantum yield but is sensitive to environmental conditions such as pH and oxygen levels. Acidic conditions can reduce fluorescence, while insufficient oxygen may limit proper chromophore formation.

mScarlet-I

This protein is known for its very high brightness due to an excellent extinction coefficient and quantum yield. However, its performance depends on proper folding, and it can be sensitive to temperature or conditions that promote misfolding, which may reduce fluorescence output.

Electra2

Electra2 is engineered for rapid maturation and improved performance under reducing conditions commonly found in cell-free systems. Its stability in such environments allows it to maintain fluorescence where other proteins may struggle, although its long-term stability or photostability may vary depending on conditions.

2. Hypothesis (Electra2 Optimization)

For Electra2, fluorescence output over a 36-hour incubation may be limited by the availability of nucleotides and the sustainability of transcription in the cell-free system. I hypothesize that increasing the concentrations of ribose and nucleoside monophosphates (AMP, CMP, UMP, and guanine) will enhance the regeneration of nucleoside triphosphates (NTPs) through the lysate’s metabolic pathways. Ribose can be converted into phosphoribosyl pyrophosphate (PRPP), which is required for nucleotide synthesis, while NMPs and guanine serve as precursors that are enzymatically converted into NTPs. By increasing these components, the system should maintain a continuous supply of NTPs, thereby sustaining transcription by T7 RNA polymerase and increasing mRNA production over time. As a result, this enhanced transcriptional activity is expected to support prolonged translation and lead to higher cumulative protein production and fluorescence intensity over the 36-hour period. This strategy is particularly suitable for Electra2, which is designed for rapid maturation and can efficiently convert increased protein synthesis into measurable fluorescence.

image ref image image

3. Master Mix Design: The Three-Well Strategy

To test this hypothesis, I designed three distinct reagent compositions to identify the “sweet spot” between fuel availability and metabolic stability.

  • Mix 1: The “Maximized” Fuel Mix (Well Q4-H20)

Goal: To test the absolute capacity of the system by pushing precursors to the high end.

Key Adjustments: Ribose was increased to 19.0 g/L (+63.4 %) and NMPs (AMP/CMP/UMP) were increased by 60-100 %. Guanine was doubled to 0.313 mM to provide a surplus of base molecules for the salvage pathway.

  • Mix 2: The “Intermediate” Mix (Well Q4-G21)

Goal: To establish a bridge between the standard mix and the maximum boost.

Key Adjustments: Ribose was set at 15.0 g/L (+29%) and NMPs/Guanine were increased by 20-33 %. This well helps determine if the “Max” mix is overkill or if a moderate increase is sufficient.

  • Mix 3: The “Direct Supply” Mix (Well Q4-I21)

Goal: To test if bypassing the enzymatic salvage of Guanine improves initial speed.

Key Adjustments: While maintaining the Intermediate fuel levels, I added 0.500 mM of pure GMP. This tests whether providing a direct nucleotide (GMP) is more efficient for Electra2 than relying solely on Guanine-to-GMP conversion.

Final Concentration Comparison Table

ComponentMix 1 (Max Fuel)Mix 2 (Intermediate)Mix 3 (Direct Boost)
Cell Lysate1X (6.00 µL)1X (6.00 µL)1X (6.00 µL)
DNA Template50 nM (2.00 µL)50 nM (2.00 µL)50 nM (2.00 µL)
Ribose19.000 g/L15.000 g/L15.000 g/L
AMP1.000 mM0.750 mM0.750 mM
CMP0.750 mM0.500 mM0.500 mM
UMP0.750 mM0.500 mM0.500 mM
GMP0.000 mM0.000 mM0.500 mM
Guanine0.313 mM0.188 mM0.156 mM
Potassium Glutamate312.563 mM312.563 mM312.563 mM
Magnesium Glutamate6.975 mM6.975 mM6.975 mM
HEPES-KOH (pH 7.5)45.000 mM45.000 mM45.000 mM
17 Amino Acid Mix4.063 mM4.063 mM4.063 mM
Glucose1.250 g/L1.250 g/L1.250 g/L
Nicotinamide3.125 mM3.125 mM3.125 mM
Backfill (NF Water)0.175 µL1.225 µL1.150 µL

4. Data Analysis Strategy

Once the 36-hour fluorescence data is returned, I will compare the slopes and peak intensities of these three wells.

  • Validation: If Mix 1 > Mix 2 > Standard, the limiting factor was raw fuel.
  • Metabolic Insights: If Mix 3 reaches a plateau faster than Mix 2, it proves the enzymatic conversion of Guanine was a kinetic bottleneck for Electra2 production.
Part D: Build-A-Cloud-Lab | (optional) Bonus Assignment

Assignees for this section

  • MIT/Harvard students optional
  • Committed Listeners optional Use this simulation tool to create an interesting looking cloud lab out of the Ginkgo Reconfigurable Automation Carts. This is just a minimal implementation so far, but I would love to see some fun designs!

Sources:

  • Banks, A. M., Whitfield, C. J., Brown, S. R., Fulton, D. A., Goodchild, S. A., Grant, C., Love, J., Lendrem, D. W., Fieldsend, J. E., & Howard, T. P. (2022). Key reaction components affect the kinetics and performance robustness of cell-free protein synthesis reactions. Computational and Structural Biotechnology Journal, 20, 218–229. https://doi.org/10.1016/j.csbj.2021.12.013
  • Burrington, L. R., Watts, K. R., & Oza, J. P. (2021). Characterizing and Improving Reaction Times for E. coli-Based Cell-Free Protein Synthesis. ACS Synthetic Biology, 10(8), 1821–1829. https://doi.org/10.1021/acssynbio.1c00195
  • Deng, H., Callender, R., Schramm, V. L., & Grubmeyer, C. (2010). Pyrophosphate Activation in Hypoxanthine-Guanine Phosphoribosyltransferase with Transition State Analogue. Biochemistry, 49(12), 2705–2714. https://doi.org/10.1021/bi100012u
  • Dopp, B. J. L., Tamiev, D. D., & Reuel, N. F. (2019). Cell-free supplement mixtures: Elucidating the history and biochemical utility of additives used to support in vitro protein synthesis in E. coli extract. Biotechnology Advances, 37(1), 246–258. https://doi.org/10.1016/j.biotechadv.2018.12.006
  • Dudzinska, W., Lubkowska, A., Dolegowska, B., Safranow, K., & Jakubowska, K. (2010). Adenine, guanine and pyridine nucleotides in blood during physical exercise and restitution in healthy subjects. European Journal of Applied Physiology, 110(6), 1155–1162. https://doi.org/10.1007/s00421-010-1611-7
  • Gregorio, N. E., Levine, M. Z., & Oza, J. P. (2019). A User’s Guide to Cell-Free Protein Synthesis. Methods and Protocols, 2(1), 24. https://doi.org/10.3390/mps2010024
  • Hashimura, H., Nakagawa, H., & Sawai, S. (2025). Use of blue fluorescent protein Electra2 for live-cell imaging in Dictyostelium discoideum. microPublication Biology. https://doi.org/10.17912/micropub.biology.001774
  • Hove-Jensen, B., Andersen, K. R., Kilstrup, M., Martinussen, J., Switzer, R. L., & Willemoës, M. (2016). Phosphoribosyl Diphosphate (PRPP): Biosynthesis, Enzymology, Utilization, and Metabolic Significance. Microbiology and Molecular Biology Reviews : MMBR, 81(1), e00040-16. https://doi.org/10.1128/MMBR.00040-16
  • Jiang, L., Zhao, J., Lian, J., & Xu, Z. (2018). Cell-free protein synthesis enabled rapid prototyping for metabolic engineering and synthetic biology. Synthetic and Systems Biotechnology, 3(2), 90–96. https://doi.org/10.1016/j.synbio.2018.02.003
  • Jiang, N., Ding, X., & Lu, Y. (2021). Development of a robust Escherichia coli-based cell-free protein synthesis application platform. Biochemical Engineering Journal, 165, 107830. https://doi.org/10.1016/j.bej.2020.107830
  • Krinsky, N., Kaduri, M., Shainsky-Roitman, J., Goldfeder, M., Ivanir, E., Benhar, I., Shoham, Y., & Schroeder, A. (2016). A Simple and Rapid Method for Preparing a Cell-Free Bacterial Lysate for Protein Synthesis. PLOS ONE, 11(10), e0165137. https://doi.org/10.1371/journal.pone.0165137
  • Vengut-Climent, E., Peñalver, P., Lucas, R., Gómez-Pinto, I., Aviñó, A., Muro-Pastor, A. M., Galbis, E., de Paz, M. V., Fonseca Guerra, C., Bickelhaupt, F. M., Eritja, R., González, C., & Morales, J. C. (2018). Glucose-nucleobase pairs within DNA: Impact of hydrophobicity, alternative linking unit and DNA polymerase nucleotide insertion studies †Electronic supplementary information (ESI) available. See DOI: 10.1039/c7sc04850e. Chemical Science, 9(14), 3544–3554. https://doi.org/10.1039/c7sc04850e
  • Zhang, Y., Huang, Q., Deng, Z., Xu, Y., & Liu, T. (2018). Enhancing the efficiency of cell-free protein synthesis system by systematic titration of transcription and translation components. Biochemical Engineering Journal, 138, 47–53. https://doi.org/10.1016/j.bej.2018.07.001