Homework

Weekly homework submissions:

Week 1 HW: Principles and Practices
Class Assignment — DUE BY START OF FEB 10 LECTURE (1) First, describe a biological engineering application or tool you want to develop and why. This could be inspired by an idea for your HTGAA class project and/or something for which you are already doing in your research, or something you are just curious about.
Week 2 HW: Read, write & edit
Homework Week 2 Part 1: Benchling & In-silico Gel Art Import the Lambda DNA. Simulate Restriction Enzyme Digestion with the following Enzymes: EcoRI HindIII BamHI KpnI EcoRV SacI SalI Create a pattern/image in the style of Paul Vanouse’s Latent Figure Protocol artworks.
Week 3 HW: Lab automation
Homework Week 3 Assignment: Python Script for Opentrons Artwork One of the great parts about having an automated robot is being able to precisely mix, deposit, and run reactions without much intervention, and design and deploy experiments remotely. For this week, we’d like for you to do the following:

Find and describe a published paper that utilizes the Opentrons or an automation tool to achieve novel biological applications.

Week 4 HW: Protein Design Part I
Part A. Conceptual Questions Answer any NINE of the following questions from Shuguang Zhang: (i.e. you can select two to skip)

How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons) Assuming the meat is a red meat like beef, there would be approximately 20-25g of protein per 100g of meat [1, 2].

Week 5 HW: Protein Design Part II
Part A: SOD1 Binder Peptide Design (From Pranam) Superoxide dismutase 1 (SOD1) is a cytosolic antioxidant enzyme that converts superoxide radicals into hydrogen peroxide and oxygen. In its native state, it forms a stable homodimer and binds copper and zinc. Mutations in SOD1 cause familial Amyotrophic Lateral Sclerosis (ALS). Among them, the A4V mutation (Alanine → Valine at residue 4) leads to one of the most aggressive forms of the disease. The mutation subtly destabilizes the N-terminus, perturbs folding energetics, and promotes toxic aggregation.
Week 6 HW: Genetic Circuits Part I: Assembly Technologies
Assignment: DNA Assembly 1. What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose? The template DNA is the mUAV plasmid at Used at 20 ng/µL, with 0.8 µL added to the reaction. The primers are “colour forward” and “colour reverse”. Give that the stock concentration is 5 µL, using 2.5 µL of each primer in a total reaction volume of 25 µL results in a final primer concentration of 0.5 µM.
Week 7 Genetic Circuits Part II
Assignment Part 1: Intracellular Artificial Neural Networks (IANNs)

What advantages do IANNs have over traditional genetic circuits, whose input/output behaviors are Boolean functions? The advantages of IANNs over traditional circuits include: (i) Continuous processing which allows them to constantly measure changes in concentration gradients of cellular inputs rather than just their absolute presence or absence. (ii) Relatively easier to scale up. That is, new inputs can be programmed by integrating additional weighted connections to existing nodes without completely rewiring the circuit.

Week 9 Week 9 — Cell-Free Systems
General homework questions

Explain the main advantages of cell-free protein synthesis over traditional in vivo methods, specifically in terms of flexibility and control over experimental variables. Name at least two cases where cell-free expression is more beneficial than cell production. Advantages of cell-free protein synthesis (CFPS) over traditional in-vivo methods: (i) Greater flexibility and control: Given that cells do not need to stay “alive” and the absence of a cell wall, it is possible to manipulate cells in real time; add chaperones, cofactors etc [1].

Week 10 — Imaging and measurement
Homework: Waters Part I — Molecular Weight We will analyze an eGFP standard on a Waters Xevo G3 QTof MS system to determine the molecular weight of intact eGFP and observe its charge state distribution in the native and denatured (unfolded) states. The conditions for LC-MS analysis of intact protein cause it to unfold and be detected in its denatured form (due to the solvents and pH used for analysis).
Week 11 — Week 11 — Bioproduction & Cloud Labs
Homework: Week 11 — Bioproduction & Cloud Labs Part A: The 1,536 Pixel Artwork Canvas | Collective Artwork

Contribute at least one pixel to this global artwork experiment before the editing ends on Sunday 4/19 at 11:59 PM EST. • A personalized URL was sent to the email address associated with your Discourse account, and you can discuss the artwork on the Discourse. • If you did not have a chance to contribute, it’s okay, just make sure you become a TA this fall! 😉

Week 1 HW: Principles and Practices

Class Assignment — DUE BY START OF FEB 10 LECTURE

(1) First, describe a biological engineering application or tool you want to develop and why. This could be inspired by an idea for your HTGAA class project and/or something for which you are already doing in your research, or something you are just curious about.

By leveraging biological engineering tools, such as CRISPR systems, I would like to develop highly specific nucleic acid biosensors and synthetic circuits to detect M. tuberculosis and resistance mutations with high precision and speed. The inspiration for this comes from working on my MSc project, where I studied the genomic epidemiology of multi-drug-resistant tuberculosis (MDR-TB) using WGS data. My work focused on downstream analyses (phylogenetics, transmission clustering, regression, and machine learning), with particular attention to population structure and epidemiological interpretation. However, when working on my project, I found that genomic data of MDR-TB is geographically imbalanced, limiting the representativeness of global MDR-TB patterns and, ultimately, timely detection and treatment. This is especially true in high burden countries. As a result, I would like to explore the application of biosensors and genetic circuitry to add an additional layer of surveillance alongside traditional methods; biosensor or genetic circuit engineered to detect specific MDR-TB resistance markers or lineage-specific sequences, potentially using luminescence as a real-time readout to provide rapid, high-throughput signals.

Brief on the biology and possible mechanism for the tool: 🛠️ 🧬

Unlike many other bacteria that can share drug‑resistance genes with each other through horizontal gene transfer, Mycobacterium tuberculosis mainly becomes drug resistant through mutations in its own DNA (Single Nucleotide Polymorphisms (SNPs), insertions/deletions (indels)) [1]. Simultaneously, the ability of M. tuberculosis to persist within human hosts exposes it to prolonged immune pressure, driving adaptive changes in virulence‑associated loci such as phoR, mymA and the mce1 operon that can influence how different lineages transmit or interact with particular human populations [2¬–4]. As a result, the proposed bio-engineering tool could take the form of a bio-sensor, where CRISPR-based device could be programmed to recognise TB resistance mutations or an engineered genetic circuit that only produces a light or electrical signal when multiple resistance signatures are present. Such a device would convert the presence of specific mutations into a measurable output that can be rapidly read and fed into surveillance models.

(2) Next, describe one or more governance/policy goals related to ensuring that this application or tool contributes to an “ethical” future, like ensuring non-malfeasance (preventing harm). Break big goals down into two or more specific sub-goals. Below is one example framework (developed in the context of synthetic genomics) you can choose to use or adapt, or you can develop your own. The example was developed to consider policy goals of ensuring safety and security, alongside other goals, like promoting constructive uses, but you could propose other goals for example, those relating to equity or autonomy.

Governance Goal 1: Prevent harm or misuse

As genomic data can be geo-located and time-stamped, there are risks for community stigmatization and political duress. Therefore, to mitigate against these risks, the governance goal should implement frameworks that: (i) Require ethical review and oversight of bio-sensor data and its secondary uses (ii) Establish strict guidelines on the limits of how precise location data can be shared or publicized (iii) Establish clear accountability mechanisms for state and private actors

Governance Goal 2: Promote equity in data collection, analysis and development

To prevent further exacerbation of inequities biological data collection and usage, the framework will implement mechanisms that ensure: (i) Control of locally generated data by implementing country (ii) Inclusion of implementing country as equal partners in analysis and interpretation (iii) Prioritization of under-sampled regions to improve representativeness and combining outputs with timely access to treatment and care.

(3) Next, describe at least three different potential governance “actions” by considering the four aspects below (Purpose, Design, Assumptions, Risks of Failure & “Success”). Try to outline a mix of actions (e.g. a new requirement/rule, incentive, or technical strategy) pursued by different “actors” (e.g. academic researchers, companies, federal regulators, law enforcement, etc). Draw upon your existing knowledge and a little additional digging, and feel free to use analogies to other domains (e.g. 3D printing, drones, financial systems, etc.).

Purpose: What is done now and what changes are you proposing?
Design: What is needed to make it “work”? (including the actor(s) involved - who must opt-in, fund, approve, or implement, etc)
Assumptions: What could you have wrong (incorrect assumptions, uncertainties)?
Risks of Failure & “Success”: How might this fail, including any unintended consequences of the “success” of your proposed actions?

Governance Action 1: Regulation and creation of standards for early-stage bio-sensor development

Purpose: Early-stage bio-sensor development research is guided by bio/genetic engineering but requires safety and bio-security risks. I am proposing specific standards and regulatory requirements for early-stage biosensor design, ensuring safety, transparency, and responsible innovation before deployment. This could be in the form of new regulatory support or reference diagnostics.

Design: Actors may include public health agencies, national regulators in science, and diagnostic developers. Establish validation criteria, accuracy thresholds, metadata standards, and geolocation safeguards. In addition, embed standards into existing public health TB surveillance programmes.

Assumptions: This initiative assumes that regulators will be quick to evaluate bio-sensor technologies. Also assumes public health surveillance will be quick to agree and implement technology across the existing surveillance system.

Risk of failure: Bureaucracy may hinder technological innovation and deployment. Unintended consequences include a premature reliance on bio-sensor technology which could lead to false positive cases and mis-directed public health strategies.

Governance Action 2: Pre-sequencing rapid signal regulatory pathways

Purpose: Currently, bio-sensor outputs such as CRISPR signals and genomic data are not integrated in low to middle-income countries Therefore, I would like to propose the creation of formal pathways that enable rapid biosensor signals to feed into surveillance systems before whole genome sequencing (WGS), with defined quality, privacy, and data use standards.

Design: Actors include public health agencies, national regulators in science, and diagnostic developers. Actors may also include international bodies such as the WHO. There may be potential to expand the WHO’s ‘attributes and principles on genomic data-sharing platforms supporting surveillance of pathogens’ [5–7].

Assumptions: This assumes developers implement required standards and metadata. Also assumes public health agencies can incorporate new signal streams effectively.

Risk of failure: Disagreements about implementation into existing surveillance pathways. State agencies may lack technical expertise to train workers to evaluate, interpret and act on rapid biosensor signals. This could lead to misinterpretation and/or delayed action

Governance Action 3: Ethical data access and sharing standards (with local and community engagement requirements)

Purpose: Many genomic and bio-engineering projects lack consistent standards for privacy, consent, equity, and local engagement. A proposed change could be the mandatory implementation of ethical standards for data access combined with mandatory local/community engagement, ensuring transparency, and equitable benefit-sharing.

Design: Develop standardised model data agreements which specify permissible uses, benefit-sharing obligations, and consent mechanisms. Furthermore, advisory boards and steering committees can be established to ensure engagement, feedback, and regular assessment of processes.

Assumptions: This assumes that communities where the technology is planned to be implemented will agree to engage meaningfully. It also assumes that cross-country coordination on ethical standards will be possible.

Risk of failure: Strict data provisions may slow down implementation, collection and action. There may be failure to engage communities as they may view the initiative to engage them as superficial.

(4) Next, score (from 1-3 with, 1 as the best, or n/a) each of your governance actions against your rubric of policy goals. The following is one framework but feel free to make your own:

Does the option:	Option 1	Option 2	Option 3
	Regulation and creation of standards for early-stage bio-sensor development	Pre-sequencing rapid signal regulatory pathways	Ethical data access and sharing standards
🦠🛡️Enhance Biosecurity
• By preventing incidents	1	2	2
• By helping respond	2	3	2
🧪Foster Lab Safety
• By preventing incidents	2	2	2
• By helping respond	3	3	2
🌱Protect the environment
• By preventing incidents	1	2	2
• By helping respond	2	1	1
⚖️Other considerations
• Minimizing costs and burdens to stakeholders	1	1	1
• Feasibility	2	2	2
• Not impede research	2	2	2
• Promote constructive applications	1	1	1

5. Last, drawing upon this scoring, describe which governance option, or combination of options, you would prioritize, and why. Outline any trade-offs you considered as well as assumptions and uncertainties. For this, you can choose one or more relevant audiences for your recommendation, which could range from the very local (e.g. to MIT leadership or Cambridge Mayoral Office) to the national (e.g. to President Biden or the head of a Federal Agency) to the international (e.g. to the United Nations Office of the Secretary-General, or the leadership of a multinational firm or industry consortia). These could also be one of the “actor” groups in your matrix.

Based on the inputs and ranking in the matrix above, I would prioritize the following:

(i) Regulation and creation of standards for early-stage bio-sensor development (ii) Ethical data access and sharing standards with local community engagement

Together both these actions would address both the technical and social foundations required for responsible deployment of biosensors. Standards would ensure that biosensors are developed safely, setting incentive structures to develop lab safety protocols and enforce biosecurity. Local community engagement, training, and capacity building will help build trust, protect rights, and enable effective use of surveillance data across settings.

References

Richard M. Jones, Kristin N. Adams, Hassan E. Eldesouky, and David R. Sherman “The evolving biology of mycobacterium tuberculosis drug resistance.” Frontiers in Cellular and Infection Microbiology 2022.
Sebastien Gagneux “Ecology and evolution of mycobacterium tuberculosis.” Nature Reviews Microbiology 2018.
Qingyun Liu, Jianhao Wei, Yawei Li, Mei Wang, Jun Su, et al. “Mycobacterium tuberculosis clinical isolates carry mutational signatures of host immune environments.” Science Advances 2020.
Á. Chiner-Oms, L. Sánchez-Busó, J. Corander, S. Gagneux, S. R. Harris, et al. “Genomic determinants of speciation and spread of the mycobacterium tuberculosis complex.” Science Advances 2019.
World Health Organization. Attributes and principles of genomic data-sharing platforms supporting surveillance of pathogens with epidemic and pandemic potential. World Health Organization; 2025.
Carter L, Yu MA, Sacks J, Barnadas C, Pereyaslov D, Cognat S, et al. Global genomic surveillance strategy for pathogens with pandemic and epidemic potential 2022–2032. Bulletin of the World Health Organization. 2022 Apr 1;100(04):239–9A.
Trump BD, Florin MV, Perkins E, et al. Biosecurity for Synthetic Biology and Emerging Biotechnologies: Critical Challenges for Governance. 2021 Sep 8. In: Trump BD, Florin MV, Perkins E, et al., editors. Emerging Threats of Synthetic Biology and Biotechnology: Addressing Security and Resilience Issues [Internet]. Dordrecht (DE): Springer; 2021. Chapter 1. Available from: https://www.ncbi.nlm.nih.gov/books/NBK584259/ doi: 10.1007/978-94-024-2086-9_

Assignment (Week 2 Lecture Prep)

Homework Questions from Professor Jacobson

Nature’s machinery for copying DNA is called polymerase. What is the error rate of polymerase? How does this compare to the length of the human genome. How does biology deal with that discrepancy?

The error rate of polymerase is 1 error per 10⁶ nucelotides, where this can range from expected error frequency from 1 error per 10⁴ to approximately 10⁶ [1]. The human genome has 3 x 10⁹ base pairs, this is around 3 billion nucleotides. This is much larger (approx. 3000 times) than 10⁶-nucleotide error rate of polymerase. Biology deals with this through a process of proofreading; cells use polymerase proofreading and mismatch repair to reduce errors to just a few per genome per replication [2].

How many different ways are there to code (DNA nucleotide code) for an average human protein? In practice what are some of the reasons that all of these different codes don’t work to code for the protein of interest?

Average Human Protein: 1036 bp As 1 codon = 3 nucleotides

∴ Total amino acids = 1036/3 ~ 345

Given 3 nuclotide-codons and 1 codon codes for 1 amino acid, there are 3³⁴⁵ different ways to code for an average human protein.

Given 3³⁴⁵ DNA sequences code for the same protein, only some of it works due to codon preferences and bias, repetitive or unstable sequences, and mRNA folding [3].

References

Kunkel TA, Bebenek K. DNA replication fidelity. In: Brenner S, Miller JH, editors. DNA Replication and Human Disease. Bethesda (MD): National Center for Biotechnology Information (US); 2002. Available from: [https://www.ncbi.nlm.nih.gov/books/NBK9940/](https://www.ncbi.nlm.nih.gov/books/NBK9940/]
Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P. Molecular Biology of the Cell. 4th ed. New York: Garland Science; 2002. ISBN: 0-8153-3218-1, 0-8153-4072-9.
Lin J, Chen Y, Zhang Y, Lin H, Ouyang Z, et al. Deciphering the role of RNA structure in translation efficiency. BMC Bioinformatics. 2022;23:559

Homework Questions from Dr. LeProust

What’s the most commonly used method for oligo synthesis currently?

Oligonucleotide synthesis is the chemical process of making short fragments of DNA or RNA with a defined sequence, typically using step‑by‑step addition of nucleotide building blocks on a solid support [1]. For enzyme-free synthesis, the process involves sequentially adding nucleotide units to a growing chain, typically using solid- or liquid-phase synthesis [2]. The most common method is solid phase oligo phosphoramidite synthesis. As it is now automated and uses high quality short sequences, it is widely used in biotech companies around the world [3–4].

Why is it difficult to make oligos longer than 200nt via direct synthesis?

As length is increased, chemical synthesis becomes less efficient. As a result, there is a loss in product yield, greater rate of error accumulation (higher substitution or deletion rates), and an increased difficulty in purifying the final product due to the introduction of truncated and mis-incorporated oligos [5].

Why can’t you make a 2000bp gene via direct oligo synthesis?

As oligosynthesis adds one nucleotide at a time, increasing length will lead to a greater accumulation of errors (substitutions/deletions). The truncated or defective sequences become increasingly difficult to purify [6]. Therefore, direct synthesis of a 2000bp gene is not practical despite surface-based methods and capture-based purification [7].

References

Beaucage SL, Caruthers MH. Deoxynucleoside phosphoramidites—A new class of key intermediates for deoxypolynucleotide synthesis. Tetrahedron Letters. 1981;22(20):1859–62. doi:10.1016/S0040-4039(01)90461-7.
Bachem. What is oligonucleotide synthesis & how does it work? [Internet]. Bubendorf: Bachem; 2024 Aug 26 [cited 2026 Feb 10]. Available from: https://www.bachem.com/articles/oligonucleotides/how-does-oligonucleotide-synthesis-work/
ScienceDirect. Oligonucleotide synthesis [Internet]. Amsterdam: Elsevier; 2024 [cited 2026 Feb 10]. Available from: https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/oligonucleotidesynthesis
ATDBio. Solid-phase oligonucleotide synthesis: The Phosphoramidite method [Internet]. Southampton: ATDBio; 2024 [cited 2026 Feb 10]. Available from: https://atdbio.com/nucleic-acids-book/Solid-phase-oligonucleotide-synthesis#The-Phosphoramidite-method
Kosuri S, Church GM. Large-scale de novo DNA synthesis: technologies and applications. Nat Methods. 2014;11:499–507. doi:10.1038/nmeth.2918.
Pichon M, Hollenstein M. Controlled enzymatic synthesis of oligonucleotides. Commun Chem. 2024;7:138. doi:10.1038/s42004-024-01216-0.
Yin Y, Arneson R, Yuan Y, Fang S. Long oligos: direct chemical synthesis of genes with up to 1728 nucleotides. Chem Sci. 2025;16:1966–73. doi:10.1039/D4SC06958G.

Homework Question from George Church

Choose ONE of the following three questions to answer; and please cite AI prompts or paper citations used, if any.

1. [Using Google & Prof. Church’s slide #4] What are the 10 essential amino acids in all animals and how does this affect your view of the “Lysine Contingency”?

Essential amino acids are defined as the amino acids that the animal body cannot synthesize, and therefore must obtain from diet. The essential amino acids in animals are are: isoleucine, leucine, lysine, threonine, tryptophan, methionine, histidine, valine, and phenylalanine. In addition, cysteine and tyrosine are often described as conditionally essential because they cannot be synthesized de novo in animals and are instead produced from methionine and phenylalanine, respectively [1].

Given lysine is one of essential amino acids that is universal for all animals, the “Lysine Contingency” is not an exclusive real control mechanism. Even if it hypothetically existed and could be removed, animals could easily source it from food, either meats, beans, or grains.

References

Hou Y, Wu G. Nutritionally essential amino acids. Adv Nutr. 2018;9(6):849–851. doi:10.1093/advances/nmy054

Week 2 HW: Read, write & edit

Homework Week 2

Part 1: Benchling & In-silico Gel Art

Import the Lambda DNA. Simulate Restriction Enzyme Digestion with the following Enzymes:

EcoRI
HindIII
BamHI
KpnI
EcoRV
SacI
SalI

Create a pattern/image in the style of Paul Vanouse’s Latent Figure Protocol artworks.

You might find Ronan’s website a helpful tool for quickly iterating on designs!

Playing around with the digest enzymes

Getting an “S”, well…sort of:

Part 3: DNA Design Challenge

3.1. Choose your protein.: In recitation, we discussed that you will pick a protein for your homework that you find interesting. Which protein have you chosen and why? Using one of the tools described in recitation (NCBI, UniProt, google), obtain the protein sequence for the protein you chose. [Example from our group homework, you may notice the particular format — The example below came from UniProt]

I have chosen Tumor Necrosis Factor- Alpha (TNF-α).

Why:

Reasons for choosing this protein include my interest in dermatology and chronic diseases. It is a key inflammatory cytokine in many skin and insulin resistant conditions. I am interested in psoriasis, particularly plaque psoriasis and its relation to insulin resistance and diabetes [1]. This is because this is something my Mum has suffered from the last couple of years, recently developing some pre-diabetes.

Protein Sequence:

NP_000585.2 tumor necrosis factor [Homo sapiens] MSTESMIRDVELAEEALPKKTGGPQGSRRCLFLSLFSFLIVAGATTLFCLLHFGVIGPQREEFPRDLSLI SPLAQAVRSSSRTPSDKPVAHVVANPQAEGQLQWLNRRANALLANGVELRDNQLVVPSEGLYLIYSQVLF KGQGCPSTHVLLTHTISRIAVSYQTKVNLLSAIKSPCQRETPEGAEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL

References

Moller DE. Potential role of TNF-α in the pathogenesis of insulin resistance and type 2 diabetes. Trends Endocrinol Metab. 2000 Aug;11(6):212-217. doi:10.1016/S1043-2760(00)00272-1.

3.2. Reverse Translate: Protein (amino acid) sequence to DNA (nucleotide) sequence.

To obtain the nucleotide sequence encoding TNF-α, I retrieved the validated human mRNA record (NCBI RefSeq: NM_000594.4) from NCBI. From this record, I extracted the coding sequence (CDS), which corresponds to the protein sequence NP_000585.2. Only the CDS was used for downstream codon optimization. See below:

ATGAGCACTGAAAGCATGATCCGGGACGTGGAGCTGGCCGAGGAGGCGCTCCCCAAGAAGACAGGGGGGCCCCAGGGCTCCAGGCGGTGCTTGTTCCTCAGCCTCTTCTCCTTCCTGATCGTGGCAGGCGCCACCACGCTCTTCTGCCTGCTGCACTTTGGAGTGATCGGCCCCAGAGGGAAGAGTTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAGGCAGTCAGATCATCTTCTCGAACCCCGAGTGACAAGCCTGTAGCCCATGTTGTAGCAAACCCTCAAGCTGAGGGGCAGCTCCAGTGGCTGAACCGCCGGGCCAATGCCCTCCTGGCCAATGGCGTGGAGCTGAGAGATAACCAGCTGGTGGTGCCATCAGAGGGCCTGTACCTCATCTACTCCCAGGTCCTCTTCAAGGGCCAAGGCTGCCCCTCCACCCATGTGCTCCTCACCCACACCATCAGCCGCATCGCCGTCTCCTACCAGACCAAGGTCAACCTCCTCTCTGCCATCAAGAGCCCCTGCCAGAGGGAGACCCCAGAGGGGGCTGAGGCCAAGCCCTGGTATGAGCCCATCTATCTGGGAGGGGTCTTCCAGCTGGAGAAGGGTGACCGACTCAGCGCTGAGATCAATCGGCCCGACTATCTCGACTTTGCCGAGTCTGGGCAGGTCTACTTTGGGATCATTGCCCTGTGA

3.3. Codon optimization. Once a nucleotide sequence of your protein is determined, you need to codon optimize your sequence. You may, once again, utilize google for a “codon optimization tool”. In your own words, describe why you need to optimize codon usage. Which organism have you chosen to optimize the codon sequence for and why?

For codon optimization, I chose the online codon optimizing tool:

https://en.vectorbuilder.com/tool/codon-optimization.html

From my input:

I got: Pasted Sequence: GC=59.84%, CAI=0.49

From my output:

Improved DNA[1]: GC=59.97%, CAI=0.92

For CAI (Codon Adaptation Index), this indicates strong expected expression.

For GC content, after optimization it remained near 60%, within a suitable range for Escherichia coli, supporting stable and efficient gene synthesis.

I selected Escherichia coli strain K-12 MG1655 as the target organism for codon optimization because it is a well-studied laboratory strain with a completely sequenced and annotated genome [1–2].

Codon Optimized TNF-Alpha for improved expression of Escherichia coli

CTGAGCCCGTTCAACAACCCGCTGCTGCGCCCGTTTCTGATTCTGTATGAACATTAAAAACATGATCCGGGCCGTGGCGCAGGTCGCGGCGGCGCGCCGCAGGAAGATCGTGGCGCACCGGGCTTACAGGCCGTGCTGGTTCCGCAGCCGCTGCTGCTGCCGGATCGCGGCCGTCGTCACCATGCCCTGCTGCCGGCGGCCCTGTGGTCGGATCGTCCGCAGCGTGAAGAATTTCCGCGCGATCTGAGCCTGATTAGCCCGCTGGCGCAGGCCGTGCGTAGCAGCAGCCGCACCCCGTCAGATAAACCGGTGGCGCACGTGGTGGCAAATCCGCAGGCCGAAGGTCAGCTGCAGTGGCTGAATCGTCGCGCGAATGCCCTGTTAGCCAATGGTGTGGAACTGCGCGATAATCAGCTGGTGGTGCCGTCAGAAGGTCTGTACCTGATCTATTCGCAGGTGCTGTTTAAAGGCCAGGGCTGTCCGAGCACCCATGTGCTGCTGACCCACACCATTAGCCGCATTGCGGTGAGCTACCAGACCAAAGTGAACCTGCTTTCTGCGATTAAAAGCCCGTGCCAGCGTGAAACCCCGGAAGGCGCGGAAGCGAAACCGTGGTACGAACCGATTTATCTGGGCGGCGTGTTCCAGCTGGAAAAAGGCGATCGTCTGAGCGCGGAAATTAATCGCCCGGATTATCTGGATTTTGCGGAAAGCGGTCAGGTGTATTTCGGCATTATTGCCTTGTAA

References

Lukjancenko O, Wassenaar TM, Ussery DW. Comparison of 61 sequenced Escherichia coli genomes. Microb Ecol. 2010 Nov;60(4):708-20. doi:10.1007/s00248-010-9717-3. PMID:20623278; PMCID:PMC2974192.
Yannai A, Katz S, Hershberg R. The codon usage of lowly expressed genes is subject to natural selection. Genome Biol Evol. 2018 May;10(5):1237–46. doi:10.1093/gbe/evy084.

3.4. You have a sequence! Now what?

What technologies could be used to produce this protein from your DNA? Describe in your words the DNA sequence can be transcribed and translated into your protein. You may describe either cell-dependent or cell-free methods, or both.

After codon optimizing the TNF- α DNA sequence, it can be used to produce protein either through cell-dependent or cell-free systems.

For cell-dependent systems, the DNA will first need to be cloned using and inserted into an expression vector, this is then introduced into live host cells such as E. coli or eukaryotic cells, where cellular machinery transcribes the DNA into mRNA and then translates the mRNA into TNF‑α protein during growth and metabolism; this is seen in standard biotechnology production processes [1–2].

For cell-free systems, crude cell extracts provide all the machinery for transcription, translation, protein folding, and energy metabolism [3]. Therefore, when the codon optimized DNA is added, the TNF‑α protein will be produced in-vitro and under controlled conditions.

Both these methods rely on the flow of information from DNA to mRNA to protein; the Central Dogma of Molecular Biology.

References

Lukjancenko O, Wassenaar TM, Ussery DW. Comparison of 61 sequenced Escherichia coli genomes. Microb Ecol. 2010 Nov;60(4):708-20. doi: 10.1007/s00248-010-9717-3. Epub 2010 Jul 11. PMID: 20623278; PMCID: PMC2974192.
Swartz JR. Advances in Escherichia coli production of therapeutic proteins. Curr Opin Biotechnol. 2001 Oct;12(5):195–201. doi:10.1016/s0958-1669(00)00198-5. PMID:11513436.
Carlson ED, Gan R, Hodgman CE, Jewett MC. Cell-free protein synthesis: Applications come of age. Biotechnol Adv. 2012 Sep-Oct;30(5):1185-94. doi:10.1016/j.biotechadv.2011.09.016. PMID:22001003; PMCID:PMC3359644.

3.5. [Optional] How does it work in nature/biological systems?

1. Describe how a single gene codes for multiple proteins at the transcriptional level. 2. Try aligning the DNA sequence, the transcribed RNA, and also the resulting translated Protein!!! See example below. [Example shows the biomolecular flow in central dogma from DNA to RNA to Protein] Special note that all “T” were transcribed into “U” and that the 3-nt codon represents 1-AA.

Part 4: Prepare a Twist DNA Synthesis Order

4.2. Build Your DNA Insert Sequence

Link to the sequence (first attempt):

https://benchling.com/s/seq-92QKTmxOZ4NOBZloFYXH?m=slm-ih8RIVqVkxJpGYbdm50f

Link to corrected sequence:

https://benchling.com/s/seq-AKpYnuHnRmdf5XnJxSv8?m=slm-sqc6y4bFyGTTvcXYx3Q9

4.3-4.5. Building Expression Cassette and Plasmid

Plasmid with Expression Cassette

https://benchling.com/s/seq-dx10o3kwSJPyNLgmJDGo?m=slm-V5wHDO0G8ZxGTwWVp2A7

Part 5: DNA Read/Write/Edit

5.1 DNA Read

(i) What DNA would you want to sequence (e.g., read) and why? This could be DNA related to human health (e.g. genes related to disease research), environmental monitoring (e.g., sewage waste water, biodiversity analysis), and beyond (e.g. DNA data storage, biobank).

I would want to sequence Mycobacterium Tuberculosis DNA. I would like to focus on virulence‑associated loci such as phoR, mymA and the mce1, and lineage defining SNPs, such as rpoB, katG, inhA promoter, gyrA, embB.

To integrate with surveillance, I would potentially try to store drug resistance and mutation outputs from my detection bio-tool into a DNA-based archive. This could help build a long-term genomic repository.

(ii) In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why? Also answer the following questions: 1. Is your method first-, second- or third-generation or other? How so? 2. What is your input? How do you prepare your input (e.g. fragmentation, adapter ligation, PCR)? List the essential steps. 3. What are the essential steps of your chosen sequencing technology, how does it decode the bases of your DNA sample (base calling)? 4. What is the output of your chosen sequencing technology?

To perform sequencing for the drug-resistant DNA, short-read sequencing is ideal for identifying the key resistance driving genes for profiling and analysis. In contrast, long-read sequencing (e.g. Oxford Nanopore) would make rapid detection, which is useful in high-burden regions, but has slightly lower accuracy. Therefore, short-read sequencing is ideal for identifying key resistance-driving genes for profiling and analysis (for e.g. using Illumina) [1]. It involves DNA extraction, fragmentation, adapter ligation, cluster amplification, and sequencing by synthesis, with base-calling software decoding the sequence from fluorescent signals. The output includes high-quality short reads, aligned sequences, and variant calls for resistance and lineage analysis. In contrast, long-read sequencing enables rapid detection in high-burden regions but has slightly lower accuracy and may require deeper coverage.

References

The CRyPTIC Consortium and the 100,000 Genomes Project. Prediction of Susceptibility to First-Line Tuberculosis Drugs by DNA Sequencing. N Engl J Med. 2018;379:1403–1415. doi:10.1056/NEJMoa1800474.

5.2 DNA Read

(i) What DNA would you want to synthesize (e.g., write) and why? These could be individual genes, clusters of genes or genetic circuits, whole genomes, and beyond. As described in class thus far, applications could range from therapeutics and drug discovery (e.g., mRNA vaccines and therapies) to novel biomaterials (e.g. structural proteins), to sensors (e.g., genetic circuits for sensing and responding to inflammation, environmental stimuli, etc.), to art (DNA origamis). If possible, include the specific genetic sequence(s) of what you would like to synthesize! You will have the opportunity to actually have Twist synthesize these DNA constructs! :)

I would like to design a genetic circuit that could be integrated into a microbial chassis or a cell-free system, which would enable it to detect molecular signatures for key multi-drug /extra-drug-resistant tuberculosis and activates a fluorescent reporter when present in a sample. Examples of this have been seen in research that looks at how biosensors are used to detect heavy metal in water through recombinase-based logic gates [1]. Such CRISPR‑based detection systems can be programmed with guides targeting lineage‑specific SNPs (e.g., Beijing/East Asian, Indo-American) [2] alongside resistance mutations so that the circuit only activates a fluorescent reporter when both types of signatures are present. Potentially, CRISPR‑Cas12/13 coupled with allele‑specific amplification can discriminate single‑base changes for lineage and resistance detection with high specificity. There is also a possibility of integrating all of this into a microfluidic biosensor, enabling automated, low-volume, rapid, and multiplexed detection suitable for environmental and point-of-care surveillance [3].

(ii) In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why? Also answer the following questions:

Is your method first-, second- or third-generation or other? How so?
What is your input? How do you prepare your input (e.g. fragmentation, adapter ligation, PCR)? List the essential steps.
What are the essential steps of your chosen sequencing technology, how does it decode the bases of your DNA sample (base calling)?
What is the output of your chosen sequencing technology?

For this synthesis, I would use synthetic DNA platforms and include CRISPR guide sequences, promoters, and fluorescent reporter proteins. These technologies would allow for quick prototyping, flexibility with design and would allow for automated printers to synthesize sequences up to multiple kilobases accurately.

Essential steps would include: full sequence of nucleotides and CRISPR guides, promoters and reporter proteins; setting the oligonucleotide assembly, this includes making assemblies of short oligos through PCR or ligation. These would need to be further tested and validated to ensure proper functioning of the circuit.

Limitations include, time, fixing errors, and scaling the device. These large constructs and may take time due to the complexity associated with multiple variants.

References

Mathur S, Singh D, Ranjan R. Genetic circuits in microbial biosensors for heavy metal detection in soil and water. Biochem Biophys Res Commun. 2023 Apr 16;652:131–137. doi:10.1016/j.bbrc.2023.02.031.
Napier, G., Campino, S., Merid, Y. et al. Robust barcoding and identification of Mycobacterium tuberculosis lineages for epidemiological and clinical studies. BMC Genome Med 12, 114 (2020). https://doi.org/10.1186/s13073-020-00817-3
Didarian R, Azar MT. Microfluidic biosensors: revolutionizing detection in DNA analysis, cellular analysis, and pathogen detection. Biomed Microdevices. 2025;27:10. doi:10.1007/s10544-025-00741-6.

5.3 DNA Edit

(i) What DNA would you want to edit and why? In class, George shared a variety of ways to edit the genes and genomes of humans and other organisms. Such DNA editing technologies have profound implications for human health, development, and even human longevity and human augmentation. DNA editing is also already commonly leveraged for flora and fauna, for example in nature conservation efforts, (animal/plant restoration, de-extinction), or in agriculture (e.g. plant breeding, nitrogen fixation). What kinds of edits might you want to make to DNA (e.g., human genomes and beyond) and why?

For editing, I would use CRISPR-Cas systems to introduce lineage specific SNPs and resistant mutations into safe mycobacterial strains or cell-free systems [1]. This allow me to test the genetic circuit, validate the CRISPR guides, and generate controls for MDR-TB detection.

(ii) What technology or technologies would you use to perform these DNA edits and why? Also answer the following questions:

How does your technology of choice edit DNA? What are the essential steps?
What preparation do you need to do (e.g. design steps) and what is the input (e.g. DNA template, enzymes, plasmids, primers, guides, cells) for the editing?
What are the limitations of your editing methods (if any) in terms of efficiency or precision?

Essential steps would include: designing guide RNAs to target SNPs/loci related to drug resistance; integrating of editing components into cells or a cell-free platform; functional testing to ensure sequences properly activate the fluorescent reporter protein within the circuit.

Preparation would require designing the guide RNAs and providing either a cell-free system or microbial framework as the host.

Limitations include: possible off-target edits; increased complexity when introducing multiple edits or larger constructs, which can affect throughput and precision.

References

Molla KA, Yang Y. CRISPR/Cas mediated base editing: technical considerations and practical applications. Trends Biotechnol. 2019 Oct;37(10):1121–1142. doi:10.1016/j.tibtech.2019.03.008. Review of CRISPR base editing systems and how they introduce precise nucleotide changes without double strand breaks.

Week 3 HW: Lab automation

Homework Week 3

Assignment: Python Script for Opentrons Artwork

One of the great parts about having an automated robot is being able to precisely mix, deposit, and run reactions without much intervention, and design and deploy experiments remotely. For this week, we’d like for you to do the following:

1. Find and describe a published paper that utilizes the Opentrons or an automation tool to achieve novel biological applications.

Initially tried to do a Sonic 3 & Knuckles logo (Classic sonic game) silhouette. The design with some tweaks was hopeful. In the end, I went with the Batman Beyond logo, as it was simple and only had one colour (given the limitations of our node). The final code (with the help of Gemini) I used is below:

*### *### YOUR CODE HERE to create your design

############################################################################## Simple Design: Batman Beyond Logo ##############################################################################

spacing = 1.7 design_points = []

#* We use absolute values abs(i) to ensure perfect left-right symmetry for i in range(-18, 19): # Horizontal span (~72mm total) for j in range(-15, 12): # Vertical span

      x = abs(i)
      # 1. Top Wing Edge (slopes up to the points)
      if j < (0.5 * x) + 3:
          
          # 2. Bottom "V" Shape
          if j > (1.2 * x) - 16:
              
              # 3. Inner Wing Cutouts (The 'U' shapes next to the head)
              # If we are not in the cutout zone, add the point
              is_cutout = (2 < x < 7) and (j > -2)
              
              # 4. The Head (Center spike)
              is_head = (x <= 1) and (j < 5)
              
              if not is_cutout or is_head:
                  design_points.append((i * spacing, j * spacing, 'Red'))

#* EXECUTION points_for_color = [p for p in design_points if p[2] == ‘Red’]

if points_for_color: pipette_20ul.pick_up_tip() pipette_20ul.aspirate(15, location_of_color(‘Red’))

  for x, y, c in points_for_color:
      if pipette_20ul.current_volume < 0.5:
          pipette_20ul.aspirate(15, location_of_color('Red'))
      
      target = center_location.move(types.Point(x=x, y=y))
      dispense_and_detach(pipette_20ul, 0.5, target)

  pipette_20ul.drop_tip()

############################################################################## END OF CODE ##############################################################################

The output:

With some manual tweaks, what I initially wanted to do:

However, the end product was pretty atrocious and would take way too much time to fix, given I’m doing this in the last minute. So atrocious that I won’t paste it.

Post-Lab Questions

2. Write a description about what you intend to do with automation tools for your final project. You may include example pseudocode, Python scripts, 3D printed holders, a plan for how to use Ginkgo Nebula, and more. You may reference this week’s recitation slide deck for lab automation details.

While your description/project idea doesn’t need to be set in stone, we would like to see core details of what you would automate. This is due at the start of lecture and does not need to be tested on the Opentrons yet.

(Answer 1)

Paper: An Automated Versatile Diagnostic Workflow for Infectious Disease Detection in Low-Resource Settings

Miren Urrutia Iturritza, Phuthumani Mlotshwa, Jesper Gantelius, Tobias Alfvén, Edmund Loh, Jens Karlsson, Chris Hadjineophytou, Krzysztof Langer, Konstantinos Mitsakakis, Aman Russom, Håkan N. Jönsson, Giulia Gaudenzi

https://doi.org/10.3390/mi15060708

This paper describes how researchers built an automated diagnostic workflow for detection of infectious diseases in low-resource settings [1]. Specifically, they tested for Neisseria meningitidis; a gram-negative bacterium that cause serious meningitis and blood infections in humans.

For their workflow, they used Opentrons OT-One-S Hood. This is an open-source liquid handling robot, which can be bought at a relatively low cost. The researchers wrote custom software developed at SciLifeLab Nanobiotechnology division [2] to create scripts for their workflow.

Materials and reagents were organized onto the OT-One-S Hood robot, with racks and tubes with primers, buffers, and enzymes, the MiniPCR® mini8 thermal cycler, magnetic bead racks, waste containers, and microarray holders, to analyze Neisseria meningitidis DNA in both clinical and spiked samples. “Clinical” samples refere to specimens collected from individuals, where “spiked” samples were lab prepared samples where a known amount of Neisseria meningitidis DNA.

The robot then performs all the necessary pipetting steps, RNA amplication of ctrA gene (as its conserved, species-specific gene essential for capsule formation, making it a reliable marker [3]), enzymatic digestion, and deposition onto paper-based microarrays. The only manual steps were the opening and closing of tube lids before and after the DNA amplification, and the exonuclease digestion steps on the MiniPCR® mini8 thermal cycler [1].

The study showed that, automated liquid handling can detect Neisseria meningitidis in low-resource settings, though accuracy and reproducibility were not fully validated.

References

Urrutia Iturritza M, Mlotshwa P, Gantelius J, Alfvén T, Loh E, Karlsson J, Hadjineophytou C, Langer K, Mitsakakis K, Russom A, et al. An automated versatile diagnostic workflow for infectious disease detection in low-resource settings. Micromachines. 2024;15(6):708. doi:10.3390/mi15060708.
Langer K, Joensson HN. Rapid production and recovery of cell spheroids by automated droplet microfluidics. SLAS Technol. 2020;25:111–122.
Rivas L, Reuterswärd P, Rasti R, Herrmann B, Mårtensson A, Alfvén T, Gantelius J, Andersson-Svahn H. A vertical flow paper-microarray assay with isothermal DNA amplification for detection of Neisseria meningitidis. Talanta. 2018;183:192–200.

(Answer 2)

For the automation of my project, I plan to use automation tools to develop and test a CRISPR-based biosensor that would be capable of detecting multi-drug-resistant tuberculosis (MDR-TB) signatures. This workflow would involve high-throughput liquid handling and cell-free protein synthesis. Possible steps would include:

(i) Module setup: This would include arranging reagents, tip racks, thermal cyclers, magnetic bead racks, and microarray holders on an Opentrons OT-2 deck [1]. This would be supplemented by temperature modules for incubation and heater-shaker modules for mixing and precise reaction control

(ii) Automated reaction setup: The robot will then perform pipetting of cell-free lysate, DNA templates, CRISPR guides, and cofactors into 96- or 384-well plate. Then multiple combinations of lineage-specific SNP guides and resistance mutation guides will be tested to evaluate ‘AND-gate logic’.

(iii) Incubation: External devices like a plate reader or miniPCR thermal cycler amplification will be loaded. Then Python scripts will be used to control timing, mixtures, and incubation periods.

(iv) Signal detection and analysis: Fluorescent outputs will be measure using devices such as Spark or PHERAstar FSX for high-throughput plate analysis [1]. This will be a measure of change in fluorescence colour which would indicate successful target detection and amplification.

(v) Microfluidic integration (if possible): If possible, will look to integrate 3D printed holders for small microfluidic chips. These can serve as small test cartridges for running multiple tests at once while minimizing manual handling and contamination risk in low-resource settings.

References

[1] Course Recitation Slides. Lab Automation Overview. Course presentation, [institution or course name if known]; Year [cited 2026 Feb 23]. Available from: https://docs.google.com/presentation/d/e/2PACX-1vQc3zo7Z0b6HK7YeC56p_n2RbHNjUHh1HI66DH0cHbFk0db1HlbF7gILE__NCvhUiYMjIGSOHwHPv2_/pub?start=false&loop=false&delayms=3000#slide=id.g2b9b763dcde_1_131

Project Ideas

Go to specific slide

Week 4 HW: Protein Design Part I

Part A. Conceptual Questions

Answer any NINE of the following questions from Shuguang Zhang: (i.e. you can select two to skip)

1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)

Assuming the meat is a red meat like beef, there would be approximately 20-25g of protein per 100g of meat [1, 2].

So, taking the upper end of that range for 500g:

500g x 0.25 = 125g

Given on average 1 amino acid ≈ 100 Daltons, then 1g/mol ≈ 1 Dalton

Therefore,

125 Daltons ≈ 125g/mol

Converting grams to moles:

Moles = mass/molar mass = 125g/125g/mol = 1 mole of amino acids

Converting moles to molecules using Avogadro’s constant [3]:

1 mole ≈ 6.02214076 x 1023 ≈ 6.02 x 1023

University Hospitals Sussex NHS Foundation Trust. Protein fact sheet [Internet]. West Sussex (UK): University Hospitals Sussex NHS Foundation Trust; [cited 2026 Mar 1]. Available from: https://www.uhsussex.nhs.uk/resources/protein-fact-sheet/
Nuffield Health. Best high protein foods [Internet]. Epsom (UK): Nuffield Health; [cited 2026 Mar 1]. Available from: https://www.nuffieldhealth.com/article/best-high-protein-foods
Metric System. Avogadro constant [Internet]. 2024 [cited 2026 Mar 1]. Available from: https://metricsystem.net/si/defining-constants/avogadro-constant/

2. Why do humans eat beef but do not become a cow, eat fish but do not become fish?

When we digest meat or fish, we are breaking them down into their basic constituents, which include amino acids. These amino acids are further used by ribosomes (through DNA instruction) to build human proteins, not that of a cow or fish.

3. Why are there only 20 natural amino acids?

The standard 20 amino acids were formed through evolutionary pressures which selected the acids based on folding capabilities, catalysis, and molecular recognition. These were most likely adopted in pre-biotic conditions through early metabolism/pre-biotic chemistry [1]. Once incorporated into the genetic code, it got fixed given other additions may have created disruptions towards survival.

Exceptions include Pyrrolysine and Selenocysteine, which are naturally occurring amino acids incorporated into proteins via specialized mechanisms; with pyrrolysine encoded by UAG stop codon in certain areas using dedicated tRNA and biosynthetic enzymes, and selenocysteine inserted at UGA codons with a Selenocysteine Insertion Sequence (SECIS) in the mRNA.

Doig AJ. Frozen, but no accident – why the 20 standard amino acids were selected. FEBS Lett. 2016 Dec 7;590(21):3977–3985. doi:10.1111/febs.13982. Available from: https://doi.org/10.1111/febs.13982

4. Can you make other non-natural amino acids? Design some new amino acids.

Yes, it is possible to make non-natural amino acids as well as incorporate them into proteins using engineered tRNA synthase pairs with reassigned codons [1]. Initially will choose a base amino acid  modify side chain to add new function  Synthesize and by introducing protein with engineered tRNA, so that the amino acid can be recognized  insert in specific codon.

To design a new amino acid, I would modify non‑natural amino acid is para‑azido‑L‑phenylalanine (pAzF), which contains an azide (‑N₃) group. When pAzF is genetically incorporated into a protein at a chosen site, the azide can act as a chemical handle attaching a fluorescent dye or imaging agent to that protein. This can help label or track proteins in cells and animals [2].

Bag SS, Saraogi I, Guo J. Editorial: Expansion of the Genetic Code: Unnatural Amino Acids and their Applications. Front Chem. 2022;10:958433. doi:10.3389/fchem.2022.958433.
Lightle HE, Kafley P, Lewis TR, Wang R. Site‑specific protein conjugates incorporating para‑azido‑L‑phenylalanine for cellular and in vivo imaging. Methods. 2023;219:95–101. doi:10.1016/j.ymeth.2023.10.001

5. Where did amino acids come from before enzymes that make them, and before life started?

The origins of amino acids are hypothesized to have emerged from primordial earth [1], and have undergone abiotic synthesis under early environmental conditions (such as electrical discharges and impact‑driven reactions during the Hadean Eon) before life existed; over time, as organisms evolved in the Archean and Proterozoic Eons, they developed enzyme‑mediated biosynthetic pathways to produce amino acids internally, eventually supporting the diversity of life seen in the three domains of Archaea, Bacteria, and Eukarya.

Nature Education. **An evolutionary perspective on amino acids. Nature Scitable. 2014. Available from: https://www.nature.com/scitable/topicpage/an-evolutionary-perspective-on-amino-acids-14568445

6. If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?

If an α-helix is made from D-amino acids instead of L-amino acids, it would form a left-handed helix [1]. In natural proteins, L-amino acids are used, and they form right-handed α-helices. Therefore, the helix built from D-amino acids reverse that twist to make it left-handed.

Perlego. Alpha helix. Perlego Chemistry Index. Available from: https://www.perlego.com/index/chemistry/alpha-helix OpenAI. ChatGPT (version 5.2)

7. Can you discover additional helices in proteins?

Yes, it is possible to have additional helices in proteins, both natural and artificially designed. This includes the 310 helix which is a secondary structure found in proteins and polypeptides.

Another is the pi/π helix, which is a secondary structure found in proteins.

8. Why are most molecular helices right-handed?

9. Why do β-sheets tend to aggregate?

β-sheets tend to aggregate due to their structure as they have exposed edges with available hydrogen bonding groups [1]. This leaves it susceptible to interactions with other β-sheets.

o What is the driving force for β-sheet aggregation?

The intermolecular backbone formed from β-sheet aggregation from hydrogen bonds forming between the backbone groups. Once aligned, hydrophobic side-chain interactions and van der Waals forces between tightly packed residues further stabilize the β-sheet aggregates [1].

Richardson JS, Richardson DC. Natural β-sheet proteins use negative design to avoid edge-to-edge aggregationProc Natl Acad Sci U S A. 2002 Mar 5;99(5):2754–9. doi:10.1073/pnas.052706099.

10. Why do many amyloid diseases form β-sheets?

Give β-sheets allow for extensive intermolecular backbones, it enables multiple proteins to stick together. For example, in Alzheimer’s disease, amyloid-β peptides misfold and aggregate into fibrils that are rich in β-sheet structure. These facilitate plaque formation in the brain [1].

o Can you use amyloid β-sheets as materials?

Amyloid β-sheets can be used as materials because their cross-β structure forms highly stable, self-assembling nanofibers. These properties allow them to be developed into biomaterials such as hydrogels and nanofibers.

Ow SY, Dunstan DE. A brief overview of amyloids and Alzheimer’s disease. Protein Sci. 2014 Oct;23(10):1315–31. doi:10.1002/pro.2524.

Part B. Protein Analysis and Visualization

In this part of the homework, you will be using online resources and 3D visualization software to answer questions about proteins. Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions:

1. Briefly describe the protein you selected and why you selected it.

I selected the Sonic Hedgehog protein. One, because when I was a massive Sonic the Hedgehog fan. Later as I started studying and working, I became interested in biology, neuroscience, and mental health. I found out that the protein has important functions in information exchange at fetal stage, the central nervous system development, tooth enamel growth, and it has also been that it may have potential regenerative functions for hair growth. Whereas, dysregulation can lead to aging-related neurodegenerative diseases such as Alzheimer’s disease, Parkinson’s disease, and amyotrophic lateral sclerosis

2. Identify the amino acid sequence of your protein.

From UniProt:

sp|Q15465|SHH_HUMAN Sonic hedgehog protein OS=Homo sapiens OX=9606 GN=SHH PE=1 SV=1

MLLLARCLLLVLVSSLLVCSGLACGPGRGFGKRRHPKKLTPLAYKQFIPNVAEKTLGASG RYEGKISRNSERFKELTPNYNPDIIFKDEENTGADRLMTQRCKDKLNALAISVMNQWPGV KLRVTEGWDEDGHHSEESLHYEGRAVDITTSDRDRSKYGMLARLAVEAGFDWVYYESKAH IHCSVKAENSVAAKSGGCFPGSATVHLEQGGTKLVKDLSPGDRVLAADDQGRLLYSDFLT FLDRDDGAKKVFYVIETREPRERLLLTAAHLLFVAPHNDSATGEPEASSGSGPPSGGALG PRALFASRVRPGQRVYVVAERDGDRRLLPAAVHSVTLSEEAAGAYAPLTAQGTILINRVL ASCYAVIEEHSWAHRAFAPFRLAHALLAALAPARTDRGGDSGGGDRGGGGGRVALTAPGA ADAPGAGATAGIHWYSQLLYQIGTWLLDSEALHPLGMAVKSS

Results from the collab notebook:

Length: 462 amino acids

Most frequent: A (57 times, 12.3%)

This matches the number provided on UniProt

o How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.

After inputting the sequence into UniProt’s BLAST tool, there are 244 homologs identified.

o Does your protein belong to any protein family?

It belongs to the hedgehog family.

3. Identify the structure page of your protein in RCSB o When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)

After the search, I selected:

8Z2V | pdb_00008z2v

Crystal structure of Sonic hedgehog in complex with antibody 5E1 mutant H-R102A with metals

Resolution: 1.89 Å

This was deposited on April 13, 2024 and released in the PDB on April 16, 2025.

The resolution indicates that it is of good quality due to its resolution of a more detailed structure. This presents a more accurate interpretation of its structure.

o Are there any other molecules in the solved structure apart from protein?

The solved structure 8Z2V includes the heavy and light chains of the antibody 5E1 to which it is bound, as well as several small molecules: glycerol, zinc ions, calcium ions, and a chloride ion.

o Does your protein belong to any structure classification family?

It belongs to the immune system.

4. Open the structure of your protein in any 3D molecule visualization software:

o PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands)

After loading the protein, I got:

o Visualize the protein as “cartoon”, “ribbon” and “ball and stick”

Cartoon:

Ball & Stick:

Ribbon:

o Color the protein by secondary structure. Does it have more helices or sheets?

On PyMol I used:

*# color by secondary structure color red, ss h # helices color yellow, ss s # sheets color green, ss l # loops/coils

Upon visual inspection, there seems to be more sheets.

o Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

For inspection of holes/pockets, I reduced transparency.

Setting: transparency set to 0.30000.

scene: scene stored as “004”.

I then restarted with the following code:

fetch 3m1n show surface show spheres, organic set transparency, 0.3

This showed spheres:

This showed some deeply embedded pockets, with one (I think!) more towards the surface.

May need some help with this!!

Part C. Using ML-Based Protein Design Tools

C1. Protein Language Modeling

1. Deep Mutational Scans

a. Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.

b. Can you explain any particular pattern? (choose a residue and a mutation that stands out)

Dark vertical stripes in the heatmap indicate positions where nearly all mutations score negatively; highly conserved residues critical for SHH function. Position 141 (His), part of the zinc-binding motif, shows strongly negative LLR scores for most substitutions, reflecting its essential role in zinc coordination. Interestingly, our ESMFold experiments confirmed that mutating this site (H141A/H142A) preserved the backbone fold while likely abolishing function, consistent with the language model’s predictions. In contrast, position 39 showed a near-neutral score (-0.08) for arginine substitution, expected given its location in the signal peptide which is cleaved after translation and therefore under weaker evolutionary pressure.

c. (Bonus) Find sequences for which we have experimental scans, and compare the prediction of the language model to experiment.

No systematic DMS dataset exists for SHH as far as I searched. Though, with more time I could do deeper searches. However, I would need some help with this question.

2. Latent Space Analysis

a. Use the provided sequence dataset to embed proteins in reduced dimensionality.

After realising that SHH was not included in the code, with the help of ChatGPT, I coded:

#####################################################################################################

*# 1. Embed SHH sequence shh_tokens = tokenizer( [protein_sequence], # your already-defined protein_sequence variable return_tensors=“pt”, padding=True, truncation=True, max_length=tokenizer.model_max_length )

with torch.no_grad(): shh_outputs = esm2( input_ids=shh_tokens[‘input_ids’], attention_mask=shh_tokens[‘attention_mask’], output_hidden_states=True, )

*# Mean pool the embedding shh_embedding = shh_outputs.hidden_states[-1][0] shh_mask = shh_tokens[‘attention_mask’][0] shh_mean_embedding = shh_embedding[shh_mask == 1].mean(dim=0).cpu().numpy()

*# 2. Stack with existing embeddings and re-run t-SNE all_embeddings = np.vstack([embeddings_array, shh_mean_embedding])

tsne_3d_new = TSNE(n_components=3, perplexity=30, n_iter=300, random_state=42) embeddings_3d_new = tsne_3d_new.fit_transform(all_embeddings)

*# 3. Build dataframe tsne_df_new = pd.DataFrame(embeddings_3d_new, columns=[‘TSNE1’, ‘TSNE2’, ‘TSNE3’])

*# Add labels — SCOP proteins + SHH labels = protein_sequence_annotations[:len(embeddings_array)] + [‘SHH (Sonic Hedgehog)’] tsne_df_new[’label’] = labels tsne_df_new[‘is_SHH’] = [‘SHH’ if i == len(embeddings_array) else ‘Other’ for i in range(len(tsne_df_new))]

*# Create a numerical column for marker size tsne_df_new[‘marker_size’] = tsne_df_new[‘is_SHH’].apply(lambda x: 10 if x == ‘SHH’ else 3)

*# 4. Plot with SHH highlighted fig_shh = px.scatter_3d( tsne_df_new, x=‘TSNE1’, y=‘TSNE2’, z=‘TSNE3’, color=‘is_SHH’, color_discrete_map={‘SHH’: ‘red’, ‘Other’: ’lightblue’}, hover_name=‘label’, title=‘3D t-SNE with SHH Highlighted’, size=‘marker_size’ # Use the new numerical size column )

fig_shh.update_layout(height=800) fig_shh.show() #####################################################################################################

This produced:

b. Analyze the different formed neighborhoods: do they approximate similar proteins?

The 3D t-SNE plot shows a single continuous distribution of SCOP protein embeddings with no sharply defined clusters, suggesting protein sequence space varies gradually across structural families. Outlier points at the periphery represent the most divergent sequences, consistent with the known continuity of protein fold space.

c. Place your protein in the resulting map and explain its position and similarity to its neighbors

SHH appears as a distinct red point near the periphery of the t-SNE cloud, reflecting its unusual biochemical features, including autocatalytic processing and lipid modification, that makes it distinct from most SCOP representatives. Despite this, it remains within the main cloud boundary, indicating shared broad sequence features with neighbouring proteins. Its nearest neighbours would be expected to include other hedgehog family members (IHH, DHH), consistent with ESM2 capturing evolutionary relationships through sequence alone.

C2. Protein Folding

Folding a protein

Fold your protein with ESMFold. Do the predicted coordinates match your original structure?

Total sequence length: 462

Running ESMFold inference for sequence with length 462…

Prediction complete. ptm: 0.603 plddt: 78.225

Results saved to SHH_Fold_V1_3a3ca/

CPU times: user 1min 26s, sys: 8.6 s, total: 1min 35s

Wall time: 2min 8s

ESMFold predicted the SHH structure with a pTM of 0.603 and mean pLDDT of 78.2. The pTM score above 0.5 suggests the overall fold topology is likely correct, while the pLDDT of 78.2 indicates confident but not perfect local coordinate prediction. Regions of lower confidence likely correspond to flexible loops and the signal peptide. A full structural comparison via RMSD alignment to the crystal structure 1VHH would further quantify coordinate accuracy.

Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?

Mutation:

MLLLARCLLLVLVSSLLVCSGLACGPGRGFGKRRHPKKLTPLAYKQFIPNVAEKTLGASG RYEGKISRNSERFKELTPNYNPDIIFKDEENTGADRLMTQRCKDKLNALAISVMNQWPGV KLRVTEGWDEDGAHSEESLHYEGRAVDITTSDRDRSKYGMLARLAVEAGFDWVYYESKAH IHCSVKAENSVAAKSGGCFPGSATVHLEQGGTKLVKDLSPGDRVLAADDQGRLLYSDFLT FLDRDDGAKKVFYVIETREPRERLLLTAAHLLFVAPHNDSATGEPEASSGSGPPSGGALG PRALFASRVRPGQRVYVVAERDGDRRLLPAAVHSVTLSEEAAGAYAPLTAQGTILINRVL ASCYAVIEEHSWAHRAFAPFRLAHALLAALAPARTDRGGDSGGGDRGGGGGRVALTAPGA ADAPGAGATAGIHWYSQLLYQIGTWLLDSEALHPLGMAVKSS

Changed HH→AA at zinc-binding site (positions 141-142)

Total sequence length: 462

Running ESMFold inference for sequence with length 462…

Prediction complete. ptm: 0.602 plddt: 78.128

Results saved to test_2cd60/

CPU times: user 1min 25s, sys: 8.45 s, total: 1min 33s

Wall time: 2min 3s

A double point mutation at the zinc-binding site (H141A/H142A) had negligible effect on predicted structure (pTM 0.603 vs. 0.602, pLDDT 78.2 vs. 78.13), suggesting SHH’s fold is resilient to point mutations even at functionally critical residues.

Mutation:

Alanine substitution was chosen as it removes side chain functionality while preserving backbone geometry, representing a conservative but informative structural perturbation.

MLLLARCLLLVLVSSLLVCSGLACGPGRGFGKRRHPKKLTPLAYKQFIPNVAEKTLGASG RYEGKISRNSERFKELTPNYNPDIIFKDEENTGADRLMTQRCKDKLNALAISVMNQWPGV KLRVTEGWDEDGHHSEESLHYEGRAVDITTSDRDRSKYGMLARLAVEAGFDWVYYESKAH IHCSVKAENSVAAKSGGCFPGSATVHLEQGGTKLVKDLSPGDRVLAADDQGRLLYSAAAAAAAAAAAAAAAAAAAAAAAAFLDRDDGAKKVFYVIETREPRERLLLTAAHLLFVAPHNDSATGEPEASSGSGPPSGGALG PRALFASRVRPGQRVYVVAERDGDRRLLPAAVHSVTLSEEAAGAYAPLTAQGTILINRVL ASCYAVIEEHSWAHRAFAPFRLAHALLAALAPARTDRGGDSGGGDRGGGGGRVALTAPGA ADAPGAGATAGIHWYSQLLYQIGTWLLDSEALHPLGMAVKSS

Replaced 25 residues with Alanines in a surface region

This resulted in:

Total sequence length: 482

Running ESMFold inference for sequence with length 482…

Prediction complete. ptm: 0.554 plddt: 72.026

Results saved to SHH_FinalMut_d0fdf/

CPU times: user 1min 40s, sys: 10.3 s, total: 1min 51s

Wall time: 2min 27s

A large segment mutation was introduced by replacing a surface region with a polyalanine stretch (26 residues), resulting in a slight sequence length increase from 462 to 482 residues due to insertion. This caused a moderate reduction in predicted structural confidence (pTM 0.603 –> 0.554, pLDDT 78.2 –> 72.0), while the fold remained above the 0.5 pTM threshold, indicating overall structural resilience.

C3. Protein Generation

sequence candidates via ProteinMPNN

Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one.

After installing necessary packages for ProteinMPNN, I input the latest PDB for SHH protein, 8Z2V.

Heat map:

Sequence comparison:

Generating sequences…

8Z2V, score=1.5464, fixed_chains=[], designed_chains=[‘A’], model_name=v_48_020

LTPLAYKQFIPNVAEKTLGASGRYEGKISRNSERFKELTPNYNPDIIFKDEENTGADRLMTQRCKDKLNALAISVMNQWPGVKLRVTEGWDEDGHHSEESLHYEGRAVDITTSDRDRSKYGMLARLAVEAGFDWVYYESKAHIHCSVKAE

T=0.1, sample=0, score=0.8104, seq_recovery=0.4467

LTPLAPGERVPPVPEDSPEAAGPYLGRVERGDPRFADLVPDTDPDIEFADADGDGNDRLHTPKLVAVLRRLARLVREAWPGLRLRVLRGWSLDGDGSPRSHHYNGREADVTFSDEDAARLGALAALAVEAGADWVELASPDYVEIAVRPE

The ProteinMPNN probability heatmap shows that most positions along the SHH backbone are highly constrained, with single amino acids receiving probabilities exceeding 0.9 (yellow). This reflects strong structural determinism; the backbone geometry dictates specific residue preferences at key positions. A minority of positions, particularly around residues 95–105, show broader probability distributions across multiple amino acids, indicating structurally tolerant surface-exposed regions. The overall sparsity of high-probability assignments is consistent with the 44.67% sequence recovery observed, where roughly half of positions were confidently recovered while the remainder tolerate sequence variation.

Input this sequence into ESMFold and compare the predicted structure to your original.

Inputting designed sequence back into ESMFold:

Original:

LTPLAYKQFIPNVAEKTLGASGRYEGKISRNSERFKELTPNYNPDIIFKDEENTGADRLMTQRCKDKLNALAISVMNQWPGVKLRVTEGWDEDGHHSEESLHYEGRAVDITTSDRDRSKYGMLARLAVEAGFDWVYYESKAHIHCSVKAE

Designed:

LTPLAPGERVPPVPEDSPEAAGPYLGRVERGDPRFADLVPDTDPDIEFADADGDGNDRLHTPKLVAVLRRLARLVREAWPGLRLRVLRGWSLDGDGSPRSHHYNGREADVTFSDEDAARLGALAALAVEAGADWVELASPDYVEIAVRPE

Total sequence length: 150

Running ESMFold inference for sequence with length 150…

Prediction complete. ptm: 0.910 plddt: 90.664

Results saved to SHH_Inverse_FinalMut_3f7ad/

CPU times: user 10.2 s, sys: 8.37 s, total: 18.6 s

Wall time: 47.2 s

The ProteinMPNN designed sequence, when folded by ESMFold, achieved a pTM of 0.910 and pLDDT of 90.664; substantially higher than the native SHH sequence (pTM 0.603, pLDDT 78.2). This improvement reflects two factors: first, the designed sequence covers only the structured core of SHH (150 residues vs 462), excluding disordered regions such as the signal peptide that reduce confidence scores; second, ProteinMPNN explicitly optimises sequences for backbone compatibility, producing a sequence more thermodynamically suited to the given fold than the evolutionarily derived native sequence.

Part D. Group Brainstorm on Bacteriophage Engineering

Proposal by: Sameen Nasar, Robert C Beck

Group Project Goal: Engineering a chaperone-independent efficient MS2 lysis protein

Project Rationale:

The efficacy of bacteriophage MS2 as an antibacterial agent is currently limited by the host’s ability to evolve resistance. Specifically, E. coli can mutate the molecular chaperone DnaJ (e.g., at position P330), disrupting the essential interaction required for the MS2 lysis (L) protein to fold and function [1.] This interaction is required for proper function of the lysis protein, as DnaJ binds to the N-terminal domain of MS2 lysis protein and alleviates its inhibitory effect on lytic activity.

We propose engineering a self-activating L protein by replacing its inhibitory, chaperone-dependent N-terminal region with a computationally designed, thermodynamically stable scaffold. As this original domain is dispensable for actual lysis but creates the DnaJ dependency [2], our redesign conceptually eliminates the need for the molecular “handshake” between host and phage, allowing MS2 to fold independently and bypass bacterial control mechanisms entirely.

Schematic

MS2 Protein & DnaJ Sequences
↓
AlphaFold-Multimer
Map the DnaJ binding interface

↓
RFDiffusion
Design a stable, independent N-terminal scaffold

↓
ProteinMPNN
Generate amino acid sequences for the new scaffold

↓
ESMFold
Confirm the new single-chain mutant folds correctly

↓
AlphaFold-Multimer
Confirm the mutant no longer binds to DnaJ
↓
Final L Protein Mutant for Synthesis

References

Chamakura KR, Tran JS, Young R. MS2 lysis of Escherichia coli depends on host chaperone DnaJ. J Bacteriol. 2017;199(9):e00058-17. doi:10.1128/JB.00058-17.
Chamakura KR, Edwards GB, Young R. Mutational analysis of the MS2 lysis protein L. Microbiology (Reading). 2017;163(7):961–969. doi:10.1099/mic.0.000485.

Week 5 HW: Protein Design Part II

Part A: SOD1 Binder Peptide Design (From Pranam)

Superoxide dismutase 1 (SOD1) is a cytosolic antioxidant enzyme that converts superoxide radicals into hydrogen peroxide and oxygen. In its native state, it forms a stable homodimer and binds copper and zinc.

Mutations in SOD1 cause familial Amyotrophic Lateral Sclerosis (ALS). Among them, the A4V mutation (Alanine → Valine at residue 4) leads to one of the most aggressive forms of the disease. The mutation subtly destabilizes the N-terminus, perturbs folding energetics, and promotes toxic aggregation.

Your challenge:

Design short peptides that bind mutant SOD1.
Then decide which ones are worth advancing toward therapy.

You will use three models developed in our lab:

• PepMLM: target sequence-conditioned peptide generation via masked language modeling

• PeptiVerse: therapeutic property prediction

• moPPIt: motif-specific multi-objective peptide design using Multi-Objective Guided Discrete Flow Matching (MOG-DFM)

Part 1: Generate Binders with PepMLM

*I initially did this part wrong as I did not introduce the mutation into the sequence, therefore had to do it again. The following is the latest attempt

1. Begin by retrieving the human SOD1 sequence from UniProt (P00441) and introducing the A4V mutation.

From UniProt:

sp|P00441|SODC_HUMAN Superoxide dismutase [Cu-Zn] OS=Homo sapiens OX=9606 GN=SOD1 PE=1 SV=2

MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTS AGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVV HEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

After adding the A4V mutation in position 5, taking in Methione into account:

MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

2. Using the PepMLM Colab linked from the HuggingFace PepMLM-650M model card:

Using PepMLM, four candidate peptides of length 12 were generated conditioned on the mutant SOD1 (A4V) sequence. The generated peptides were WRYGPYAIELAX (pseudo-perplexity 11.85), WRYYVAALEWWE (28.73), WHNYAAAIRLKX (15.20), and WHSYAAAAELKX (9.48). For comparison, the known SOD1-binding peptide FLYRWLPSRRGG was scored against the same target, yielding a pseudo-perplexity of 20.64. Lower pseudo-perplexity values indicate higher model confidence in the predicted binder. Three of the four generated peptides outperformed the known binder, with WHSYAAAAELKX achieving the lowest score of 9.48. Notably, two of the four generated peptides, WRYGPYAIELAX and WHNYAAAIRLKX, contained a terminal X residue, representing an unknown or masked amino acid. This suggests a mismatch between the specified peptide length and the model’s generation process, and these sequences should be treated with caution or re-generated with corrected parameters before advancing to downstream evaluation.

3. Generate four peptides of length 12 amino acids conditioned on the mutant SOD1 sequence.

Binder	Pseudo Perplexity
WRYGPYAIELAX	11.85063412
WRYYVAALEWWE	28.7286821
WHNYAAAIRLKX	15.20319465
WHSYAAAAELKX	9.482601001

4. To your generated list, add the known SOD1-binding peptide FLYRWLPSRRGG for comparison.

To find the perplexity score for FLYRWLPSRRGG, I added this code (with help from LLM) to generate perplexity score on the collab notebook:

known_peptide = “FLYRWLPSRRGG”

ppl_score = compute_pseudo_perplexity(model, tokenizer, protein_seq, known_peptide)

print(f"Peptide: {known_peptide}")

print(f"Pseudo Perplexity: {ppl_score}")

This resulted in:

Binder	Pseudo Perplexity
WRYGPYAIELAX	11.85063412
WRYYVAALEWWE	28.7286821
WHNYAAAIRLKX	15.20319465
WHSYAAAAELKX	9.482601001
FLYRWLPSRRGG	20.63523127

5. Record the perplexity scores that indicate PepMLM’s confidence in the binders.

After generating the 12 amino acid peptides with PepMLM on the mutant SOD1 sequence, I recorded the pseudo-perplexity scores for each (lower scores indicate higher model confidence). I then added the known SOD1-binding peptide FLYRWLPSRRGG as a reference for comparison, yielding a pseudo-perplexity of 20.64. Of the four generated peptides, three outperformed the known binder: WHSYAAAAELKX (9.48), WRYGPYAIELAX (11.85), and WHNYAAAIRLKX (15.20), while WRYYVAALEWWE (28.73) scored worse. The best performing generated peptide, WHSYAAAAELKX, achieved nearly half the perplexity of the known binder, suggesting a strong model confidence in its predicted binding to the A4V mutant SOD1 target.

Part 2: Evaluate Binders with AlphaFold3

1. Navigate to the AlphaFold Server: alphafoldserver.com

2. For each peptide, submit the mutant SOD1 sequence followed by the peptide sequence as separate chains to model the protein-peptide complex.

3. Record the ipTM score and briefly describe where the peptide appears to bind. Does it localize near the N-terminus where A4V sits? Does it engage the β-barrel region or approach the dimer interface? Does it appear surface-bound or partially buried?

SOD1_ProtPeptide1 (WRYGPYAIELAX)

ipTM: 0.33

The predicted complex produced an ipTM score of 0.33, indicating low confidence in the interaction between the peptide and mutant SOD1. In the structural model, the peptide appears detached from the protein surface and does not localize near the N-terminal region where the A4V mutation occurs. Instead, it remains largely solvent-exposed and does not form clear contacts with the β-barrel region of SOD1.

SOD1_ProtPeptide2 (WRYYVAALEWWE)

ipTM: 0.28

The predicted complex produced an ipTM score of 0.28, indicating low confidence in the interaction between the peptide and mutant SOD1. In the structural model, the peptide appears detached from the protein surface, adopting a partially helical conformation in the periphery of the structure but failing to localize near the N-terminal region where the A4V mutation occurs. The peptide does not form clear contacts with the β-barrel core and remains largely solvent-exposed.

SOD1_ProtPeptide3 (WHNYAAAIRLKX) ipTM: 0.39

While this model has a higher ipTM score, it still has the same problems as a detached peptide, and no clear contacts make it solvent exposed.

SOD1_ProtPeptide4 (WHSYAAAAELKX)

ipTM: 0.26

Similar trends with this peptide.

4. In a short paragraph, describe the ipTM values you observe and whether any PepMLM-generated peptide matches or exceeds the known binder.

AlphaFold predictions of SOD-1 peptides produced relatively low ipTM scores ranging from 0.26¬–0.39. This indicates low confidence in stable interactions between the generated peptides and mutant SOD1. In the predicted structures, the peptides generally appear surface-exposed and do not consistently localize near the N-terminal region where the A4V mutation occurs. As a result, they are loosely structured, and do not form clear interfaces with the β-barrel core or the dimer interface.

Part 3: : Evaluate Properties of Generated Peptides in the PeptiVerse

Structural confidence alone is insufficient for therapeutic development. Using PeptiVerse, let’s evaluate the therapeutic properties of your peptide! For each PepMLM-generated peptide:

Paste the peptide sequence.
Paste the A4V mutant SOD1 sequence in the target field.
Check the boxes:
Predicted binding affinity
Solubility
Hemolysis probability
Net charge (pH 7)
Molecular weight

Compare these predictions to what you observed structurally with AlphaFold3. In a short paragraph, describe what you see. Do peptides with higher ipTM also show stronger predicted affinity? Are any strong binders predicted to be hemolytic or poorly soluble? Which peptide best balances predicted binding and therapeutic properties?

Choose one peptide you would advance and justify your decision briefly.

*Candidate Peptides:

#	Binder	Pseudo Perplexity
0	WRYGPYAIELAX	11.85063412
1	WRYYVAALEWWE	28.7286821
2	WHNYAAAIRLKX	15.20319465
3	WHSYAAAAELKX	9.482601001

(0)

(1)

(2)

(3)

#	Binder	ipTM	Predicted binding affinity	Solubility	Hemolysis probability	Netcharge (pH7)	Molecular weight
0	WRYGPYAIELAX	0.33	5.791	1	0.084	-0.24	1320.7
1	WRYYVAALEWWE	0.28	7.750	1	0.190	-1.23	1671.8
2	WHNYAAAIRLKX	0.39	5.972	1	0.019	1.85	1324.8
3	WHSYAAAAELKX	0.26	5.972	1	0.019	1.85	1324.8

Higher ipTM scores do not consistently correspond to stronger predicted binding affinity in this dataset. For example, WHNYAAAIRLKX (ipTM 0.39) has a predicted binding affinity of 5.972, while WRYYVAALEWWE (ipTM 0.28) shows a higher affinity of 7.750 despite its lower structural confidence score. This suggests that ipTM and binding affinity capture different aspects of peptide-target interaction and should be considered together rather than in isolation.

All four generated peptides are highly soluble and show low hemolysis probabilities, indicating a favourable therapeutic safety profile. WHNYAAAIRLKX stands out as the most balanced candidate; it achieves the highest ipTM score (0.39), a competitive predicted binding affinity (5.972), perfect solubility, the lowest hemolysis probability in the dataset (0.019), and a positive net charge (1.85) which may favour interaction with the negatively charged surface regions of SOD1. However, it also has an unknown terminal amino acid, which is a problem for synthesis. Alternatively, WRYYVAALEWWE could also be a candidate due to its higher binding affinity and absence of X residue. Given its higher structural confidence (approx. 0.4) compared to the others, WHNYAAAIRLKX would be the most promising candidate to advance for further investigation.

Part 4: Generate Optimized Peptides with moPPIt

To edit the code, given the sliders were static, I used:

########################################################################################################################################################### *# For meet new selections

SELECTED_OBJECTIVES = [“Hemolysis”, “Solubility”, “Affinity”, “Motif”] OBJECTIVE_WEIGHTS_DICT = { “Hemolysis”: 1.0, “Solubility”: 1.0, “Affinity”: 1.5, “Motif”: 1.0 } OBJECTIVE_WEIGHTS_LIST = [1.0, 1.0, 1.5, 1.0] OBJECTIVES_CFG = { “selected_objectives”: SELECTED_OBJECTIVES, “weights_dict”: OBJECTIVE_WEIGHTS_DICT, “weights_list”: OBJECTIVE_WEIGHTS_LIST, “motif_positions”: “1-10” }

print(“Saved:”) print(“SELECTED_OBJECTIVES =”, SELECTED_OBJECTIVES) print(“OBJECTIVE_WEIGHTS_DICT =”, OBJECTIVE_WEIGHTS_DICT) print(“OBJECTIVE_WEIGHTS_LIST =”, OBJECTIVE_WEIGHTS_LIST) print(“motif_positions =”, OBJECTIVES_CFG[“motif_positions”])

###########################################################################################################################################################

Binder	Hemolysis	Solubility	Binding Affinity	Motif
KKKKYITECLVM	0.9794966895133257	0.6666666269302368	7.177585601806641	0.6455004811286926
ECYYVWTEQGTT	0.9729829281568527	0.8333333134651184	6.359397888183594	0.5219646692276001
KLKQKKFTEKVC	0.9676016941666603	0.7500000	6.8997617	0.7254035472869873
SFQKINEKVKNA	0.9103980	0.6666666269302368	6.861388206481934	0.6815867

Peptides generated with moPPIt differ from those generated by PepMLM through controlled, residue-specific generation targeting positions 1-10 of the A4V mutant SOD1 sequence, with simultaneous optimisation of hemolysis, solubility, affinity, and motif objectives.

The four generated peptides show different balances across the optimised properties. KKKKYITECLVM achieves the highest affinity score (7.178) and a strong hemolysis score (0.979), though its solubility is moderate (0.667). KLKQKKFTEKVC shows the highest motif score (0.725) alongside competitive affinity (6.900), suggesting strong localisation near the targeted N-terminal residues. ECYYVWTEQGTT offers the best solubility (0.833) but the lowest affinity and motif scores of the four. SFQKINEKVKNA presents a balanced profile across all objectives with the lowest hemolysis score (0.910).

Compared to the PepMLM-generated peptides, the moPPIt peptides benefit from explicit multi-objective optimisation, producing sequences with higher predicted affinities and targeted motif engagement rather than purely sequence-conditioned sampling.

Before advancing toward therapeutic development, these peptides would require further evaluation through in vitro binding assays to confirm SOD1 interaction, proteolytic stability testing to assess degradation resistance, and cytotoxicity screening to verify safety before progressing to in vivo studies. Special emphasis should be placed on the haemolysis, given the high scores generated by this model; this may or may not indicate high toxicity.

Part C: Final Project: L-Protein Mutants

After running the code for analysis between predicted mutations and the experimental dataset, there is little to no overlap.

Process

After running the code for analysis between predicted mutations and the experimental dataset, there is little to no overlap. Process:

The model evaluated mutations using a log-likelihood ratio (LLR) derived from the probability distribution predicted by the ESM-2 protein language model. Mutations were then ranked by their LLR score, and predicted mutations were compared with experimental mutations using dataset merging.

The mutation C29R is present in both datasets. Experimental data shows no lysis activity, highlighting the difficulty in modelling predictions, as they do not always correspond to functional outcomes.

Intiailly I tried to geenrate the full length sequences via Excel through updating mutations at specific positions:

This was very tedious, therefore I switched to Python on the desktop. Python was used instead of manual editing in Excel. A script was written to apply selected point mutations to the wild-type sequence by modifying specific residue positions. The code I used is below:

###########################################################################################################################################################

*#### HTGAA W5_HW_Part C: Multimer Assembly ####

*## Base sequence

base_seq = “METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT”

*## Mutations based on experimental dataset and model

*# K50L was done manually on MS EXcel

mutations = {

"Variant1": {"50":"L"},

"Variant2": {"39":"L"},

"Variant3": {"29":"R"},

"Variant4": {"13":"L"},

"Variant5": {"15":"A"}

}

*## Generate list

def apply_mutation(seq, mutation_dict):

seq_list = list(seq)

for pos, aa in mutation_dict.items():

seq_list[int(pos) - 1] = aa # -1 because Python is 0-indexed

return “".join(seq_list)

*# Store sequences

variant_sequences = {}

for name, mut in mutations.items():

variant_sequences[name] = apply_mutation(base_seq, mut)

*## Save variants in text file

with open(“Af2_variants.txt”, “w”) as f:

for name, seq in variant_sequences.items():

    f.write(f"{name}: {seq}\n")

###########################################################################################################################################################

Position of the mutation in L	Base Pair Changed	Amino Acid Position	Amino Acid Change	Lysis	Protein Levels
38	C->T	13	P->L	1	1
38	C->T	13	P->L	1	1
43	T->G	15	S->A	1	1
52	A->G	18	R->G	1	1
53	G->T	18	R->I	1	1

From the experimental dataset, I chose the following:

Position of the mutation in L	Base Pair Changed	Amino Acid Position	Amino Acid Change	Lysis	Protein Levels
38	C->T	13	P->L	1	1
43	T->G	15	S->A	1	1

From the model, I then selected mutations with the highest LLR scores as they are the most strongly predicted from the model.

Position	Wild_Type_AA	Mutation_AA	LLR_Score
50	K	L	2.561468
29	C	R	2.395427
39	Y	L	2.24178

K50L and Y39L introduce hydrophobic residues that can help stabilize packed or core regions of the protein, consistent with the tendency for hydrophobic side chains to support structural integrity [1]. C29R adds a charged residue in a position the model favours, which may create new stabilizing interactions without disrupting folding [2]. Together these selections balance predicted stability, polarity, and structural compatibility, supporting the goal of designing functional L protein variants [3].

References

Pace CN, Fu H, Fryar KL, Landua J, Trevino SR, Shirley BA, et al. Contribution of hydrophobic interactions to protein stability. J Mol Biol. 2011;408(3):514-28.
Doig AJ, Williams DH. Is the hydrophobic effect stabilizing or destabilizing in proteins? The contribution of disulphide bonds to protein stability. J Mol Biol. 1991;217(2):389-98.
Hendsch ZS, Tidor B. Do salt bridges stabilize proteins? A continuum electrostatic analysis. Proteins. 1994;20(1):1-10.

Week 6 HW: Genetic Circuits Part I: Assembly Technologies

Assignment: DNA Assembly

1. What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose?

The template DNA is the mUAV plasmid at Used at 20 ng/µL, with 0.8 µL added to the reaction. The primers are “colour forward” and “colour reverse”. Give that the stock concentration is 5 µL, using 2.5 µL of each primer in a total reaction volume of 25 µL results in a final primer concentration of 0.5 µM.

The Phusion HF Master mix is a solution containing DNA polymerase, nucleotides, and buffer Magnesium ions which enable accurate and efficient DNA amplification in PCR [1]. It was added at 12.5 µL from a 2X stock, resulting in a final concentration of 1X in the reaction.

Nuclease-free water is added (6.8 µL) to bring the total reaction volume up to 25 µL and ensure all components are at the correct final concentrations.

References

New England Biolabs. Phusion High-Fidelity PCR Master Mix with HF Buffer [Internet]. Available from: https://www.neb.com/en-gb/products/m0531-phusion-high-fidelity-pcr-master-mix-with-hf-buffer

2. What are some factors that determine primer annealing temperature during PCR?

Annealing would depend on the melting point/temperature (Tm) of the primers; where annealing is generally done at 5 °C below the primer’s melting temperature [1]. Other factors include primer length, base composition (Guanine Content), salt and ion concentrations in the reaction (such as Mg2+ and monovalent salts).

References

Integrated DNA Technologies. How do you calculate the annealing temperature for PCR? [Internet]. Available from: https://eu.idtdna.com/pages/support/faqs/how-do-you-calculate-the-annealing-temperature-for-pcr-?

3. There are two methods from this class that create linear fragments of DNA: PCR, and restriction enzyme digests. Compare and contrast these two methods, both in terms of protocol as well as when one may be preferable to use over the other.

PCR	Restriction Enzyme Digests
Amplifies any region of DNA	Requires presence of specific restriction sites
Can introduce mutations, insertions, deletions, and overhangs for Gibson assembly	The enzymes target and cleave near these sites
Highly flexible and does not require specific sequences to introduce mutations [1]	Ends produced are either sticky or blunt
	They allow for precise insertion of DNA fragments into vectors [2]

For the is protocol, as it involves cloning, because it allows precise amplification of DNA fragments while introducing mutations and overlaps required for Gibson assembly. In contrast, restriction enzyme digestion would be limited to existing recognition and doesn’t easily introduce sequence changes.

References

National Human Genome Research Institute. Polymerase chain reaction (PCR) [Internet]. Bethesda (MD): NHGRI https://www.genome.gov/genetics-glossary/Polymerase-Chain-Reaction-PCR
Thermo Fisher Scientific. Restriction enzyme basics [Internet]. Waltham (MA):. Available from: https://www.thermofisher.com/uk/en/home/life-science/cloning/cloning-learning-center/invitrogen-school-of-molecular-biology/molecular-cloning/restriction-enzymes/restriction-enzyme-basics.html#:~:text=In%20cloning%2C%20restriction%20enzymes%20enable%20precise%20DNA,fundamental%20principle%20of%20recombinant%20DNA%20cloning%20technology.

4. How can you ensure that the DNA sequences that you have digested and PCR-ed will be appropriate for Gibson cloning?

Dpnl is a restriction enzyme that selectively digests methylated DNA, leaving unmethylated PCR products untouched [1]. Therefore, according to the lab protocol, adding 1 µL of dpnl to each sample to digest methylated DNA digests the mUAV template so that only newly created PCR fragments are introduced into the following Gibson Assembly step.

References

University of Wisconsin–Madison. Lab 4: Background [Internet]. Biochemistry 551 Online Lab Manual; Available from: https://wisc.pb.unizin.org/biochemistry551online/chapter/lab-4-background/

5. How does the plasmid DNA enter the E. coli cells during transformation?

The most common forms of transformation are:

(i) Heat shock: Creating pores in cell wall through abrupt temperature changes

(ii) Electroporation: Generating pores via electrical voltage

These methods cause the wall to open up and create pores in the cell membrane, after which plasmids enter the bacteria through diffusion. After the initial heat/electric shock and entry, the pores eventually close up. Inside the bacteria, the plasmids replicate.

6. Describe another assembly method in detail (such as Golden Gate Assembly)

Explain the other method in 5 - 7 sentences plus diagrams (either handmade or online).
Model this assembly method with Benchling or Asimov Kernel!

(1) The Golden Gate Assembly is a molecular cloning method that uses only the sequential or simultaneous activities of a single type IIS restriction enzyme and T4 DNA ligase [1], this enabled multiple inserts to be placed into the vector backbone in a single reaction.

Type IIS enzymes include BsaI, BsmBI, or BbsI [1, 2]. These cut DNA at a defined distance away from their recognition sites, rather than within them. This feature enables the generation of user-defined overhangs/fusion sites, which can be further designed to be unique and complementary guiding ordered and ligation of DNA parts with high specificity [2].

The reaction is done in 1 tube, where restriction digestion and ligation using T4 DNA ligase, increasing efficiency and reducing steps. Importantly, the recognition sites are removed during assembly, resulting in a seamless DNA construct [2]. The cyclic process of ligation (16 °C) and digestion (37 °C) facilitates repeated breakdown of incorrect assemblies and enhances growth of the selected product.

Tried to reference this whole process in this illustrated diagram:

To start, I tried entering J23100 Promoter sequence (35 nucleotides) into Benchling:

To facilitate Golden Gate Assembly, the promoter was then designed with flanking BsaI sites that allow the enzyme to create unique 4-base overhangs, ensuring the fragment inserts into the backbone in the correct orientation without leaving a ‘scar’ sequence. Therefore, I input (with some help from gemini to generate):

GGTCTCATCCCttgacggctagctcagtcctaggtacagtgctagcTACTTGAGACC

I was still having problems, so I added:

https://www.addgene.org/44335/ (as per suggestion from Gemini, given it is used in the CIDAR MoClo Parts Kit | https://www.addgene.org/kits/densmore-cidar-moclo/)

Now I’m getting these sticky end errors, too tired to solve it. But also, genuinely a bit lost and would like some more support on it.

Would need more help on this at one point!!

References

New England Biolabs. Golden Gate Assembly [Internet]. Ipswich (MA): New England Biolabs. Available from: https://www.neb-online.de/en/cloning-synthetic-biology/dna-assembly/golden-gate-assembly/
Laboratory Notes. Golden Gate Assembly [Internet]. Available from: https://www.laboratorynotes.com/golden-gate-assembly/

Assignment: Asimov Kernel

As comitted online listener associated with Lifefabs, we did not have access to Asimov Kernal. Therefore, with permission of Node leads this was skipped

Week 7 Genetic Circuits Part II

Assignment Part 1: Intracellular Artificial Neural Networks (IANNs)

1. What advantages do IANNs have over traditional genetic circuits, whose input/output behaviors are Boolean functions?

The advantages of IANNs over traditional circuits include:

(i) Continuous processing which allows them to constantly measure changes in concentration gradients of cellular inputs rather than just their absolute presence or absence.

(ii) Relatively easier to scale up. That is, new inputs can be programmed by integrating additional weighted connections to existing nodes without completely rewiring the circuit.

(iii) Better adapted to non-linear classifications. Given IANNs continuously process as opposed to a Boolean (On/Off) logic, they can respons better to complex cell-state classification (e.g. distinguishing highly specific cell types)

Britto Bisso F, Aguilar R, Shree D, Zhu Y, Espinoza M, Diaz B, Cuba Samaniego C. Pattern recognition in living cells through the lens of machine learning. Open Biol. 2025 Jul 16;15(7):240377. doi: 10.1098/rsob.240377

Moorman A. Machine learning inspired synthetic biology: neuromorphic computing in mammalian cells [thesis]. Cambridge (MA): Massachusetts Institute of Technology; 2020. Available from: https://dspace.mit.edu/handle/1721.1/129864

2. Describe a useful application for an IANN; include a detailed description of input/output behavior, as well as any limitations an IANN might face to achieve your goal.

From researching papers related to the application of IANNs, I came across some interesting papers working exploring the use of bacteria to act as biosensors in soil or any agricultural mediums. For example, a paper by Del Valle and colleagues looked to looked to engineer modular genetic circuits that allow microbes to process complex, multi-variable environmental signals from the soil matrix and dynamically convert them into measurable cellular outputs.

From researching papers related to the application of IANNs, I came across some interesting papers working exploring the use of bacteria to act as biosensors in soil or any agricultural mediums. For example, a paper by Del Valle and colleagues looked to engineer modular genetic circuits that allow microbes to process complex, multi-variable environmental signals from the soil matrix and dynamically convert them into measurable cellular outputs [1].

A potential idea could be to use engineering modular circuits to clean up arsenic in soil. Where, inputs would be:

X1 : Concertation of Arsenic to be measured by proteins such as the ArsR protein, which is a naturally occurring arsenic-responsive transcription factor often borrowed from E. coli or Chromobacterium violaceum) [2].

X2 : Soil pH, measured by pH-responsive promoters. As demonstrated by Bañares et al. [3], genetic sensors can be used to dynamically regulate cellular outputs based on changing pH levels. Here, we use pH sensors to create a “bandpass filter” for the circuit.

Process:

IANNs will serve as weighted classifiers for that computes if Arsenic is high AND soil pH within a safe zone
OUTPUT: If conditions are met, the network activates the ArsR protein.
If soil increases above threshold pH, if it is too high the IANN turns OFF

Del Valle, I., Fulk, E. M., Kalvapalle, P., Silberg, J. J., Masiello, C. A., & Stadler, L. B. (2021). Translating New Synthetic Biology Advances for Biosensing Into the Earth and Environmental Sciences. Frontiers in Microbiology, 11. https://doi.org/10.3389/fmicb.2020.618373
Berset Y, Merulla D, Joublin A, Hatzimanikatis V, van der Meer JR. Mechanistic modeling of genetic circuits for ArsR arsenic regulation. ACS Synth Biol. 2017;6(5):862–874. doi:10.1021/acssynbio.6b00364
Bañares AB, Valdehuesa KNG, Ramos KRM, Nisola GM, Lee WK, Chung WJ. A pH-responsive genetic sensor for the dynamic regulation of D-xylonic acid accumulation in Escherichia coli. Applied Microbiology and Biotechnology. 2020 Mar;104(5):2097-2108. doi: 10.1007/s00253-019-10297-0.

3. Below is a diagram depicting an intracellular single-layer perceptron where the X1 input is DNA encoding for the Csy4 endoribonuclease and the X2 input is DNA encoding for a fluorescent protein output whose mRNA is regulated by Csy4. Tx: transcription; Tl: translation.

Draw a diagram for an intracellular multilayer perceptron where layer 1 outputs an endoribonuclease that regulates a fluorescent protein output in layer 2.

Assignment Part 2: Fungal Materials

1. What are some examples of existing fungal materials and what are they used for? What are their advantages and disadvantages over traditional counterparts?

Some examples include:

(i) Mycelium bio-composites which include fungi-derived leather substitutes. The advantages of these are they allow to bypass the killing of animals and avoid microplastic pollution in the long term

(ii) In architecture and construction, there are mycelium panels and acoustic tiles. Companies that utilise mycelium include Biohm (https://www.biohm.co.uk/mycelium)

(iii) Protective packaging. MycoComposite is used by companies as a substitute to Styrofoam. Bentangan M, Greetham D, Ross R, Kaplan-Bie L. Recent technological innovations in mycelium materials as leather substitutes: a patent review. Front Bioeng Biotechnol. 2023;11:1204861. https://doi.org/10.3389/fbioe.2023.1204861

Advantages

Animal free production
Quick turnaround given mushrooms have quick growth
Minimise agricultural waste
Low density and eco-friendly for building materials

Disadvantages

Easily biodegradable
Production scalability is low compared to traditional counterparts
Sensitivity moisture may reduce applicability

2. What might you want to genetically engineer fungi to do and why? What are the advantages of doing synthetic biology in fungi as opposed to bacteria?

Using the heat and drought resistance in engineered mycelium strains by engineering the overexpression of stress-response genes to confer drought and heat resistance in mycelium-based materials. This would be helpful beyond controlled laboratory environments, making fungal material manufacturing feasible in hotter, drier climates such as those found across.

The advantages of this would be that, as they are eukaryotes, they possess the post-translational machinery needed to produce and properly fold complex structural proteins that bacteria is unable to do. There is minimum downstream costs as it would not have to be derived, as mycelium would be used.

Assignment Part 3: First DNA Twist Order

Review Part 3: DNA Design Challenge of the week 2 homework. Design at least 1 insert sequence and place it into the Benchling/Kernel/Other folder you shared in the Google Form above. Document the backbone vector it will be synthesized in on your website.

Going back, I saw that after codon optimisation in week 2, there it did not start with “ATG”. I added it along with “CC” at the 3’ end (with help of claude.ai).

CCATGAGCCCGTTCAACAACCCGCTGCTGCGCCCGTTTCTGATTCTGTATGAACATTAAAAACATGATCCGGGCCGTGGCGCAGGTCGCGGCGGCGCGCCGCAGGAAGATCGTGGCGCACCGGGCTTACAGGCCGTGCTGGTTCCGCAGCCGCTGCTGCTGCCGGATCGCGGCCGTCGTCACCATGCCCTGCTGCCGGCGGCCCTGTGGTCGGATCGTCCGCAGCGTGAAGAATTTCCGCGCGATCTGAGCCTGATTAGCCCGCTGGCGCAGGCCGTGCGTAGCAGCAGCCGCACCCCGTCAGATAAACCGGTGGCGCACGTGGTGGCAAATCCGCAGGCCGAAGGTCAGCTGCAGTGGCTGAATCGTCGCGCGAATGCCCTGTTAGCCAATGGTGTGGAACTGCGCGATAATCAGCTGGTGGTGCCGTCAGAAGGTCTGTACCTGATCTATTCGCAGGTGCTGTTTAAAGGCCAGGGCTGTCCGAGCACCCATGTGCTGCTGACCCACACCATTAGCCGCATTGCGGTGAGCTACCAGACCAAAGTGAACCTGCTTTCTGCGATTAAAAGCCCGTGCCAGCGTGAAACCCCGGAAGGCGCGGAAGCGAAACCGTGGTACGAACCGATTTATCTGGGCGGCGTGTTCCAGCTGGAAAAAGGCGATCGTCTGAGCGCGGAAATTAATCGCCCGGATTATCTGGATTTTGCGGAAAGCGGTCAGGTGTATTTCGGCATTATTGCCTTGTAACTCGAG

Having problems with inserting backbone with digest and ligate:

Initial restriction enzyme setup caused incompatibility because the TNF-α insert did not generate matching sticky ends (NdeI site absent or not properly cut), leading to “left sticky end mismatch” errors.

In Benchling Digest & Ligate, manual sequence selection was invalid; only enzyme-generated digest fragments could be used for backbone and insert assignment, causing assembly to remain disabled.

In Gibson assembly, incorrect fragment selection (partial backbone instead of full pET-28a(+) plasmid) led to unset components and failed assembly preview errors. Therefore, should I use full plasmid??

Would like some help on this

Week 9 Week 9 — Cell-Free Systems

General homework questions

1. Explain the main advantages of cell-free protein synthesis over traditional in vivo methods, specifically in terms of flexibility and control over experimental variables. Name at least two cases where cell-free expression is more beneficial than cell production.

Advantages of cell-free protein synthesis (CFPS) over traditional in-vivo methods:

(i) Greater flexibility and control: Given that cells do not need to stay “alive” and the absence of a cell wall, it is possible to manipulate cells in real time; add chaperones, cofactors etc [1].

(ii) Rapid development of prototypes: Where in-vivo methods require cloning DNA into plasmids and transforming host cells, the CFPS allows us to essentially ‘drag and drop’ DNA with raw PCR products and observe protein expression in short periods of time (e.g. hours) [2]

Cases where CFPS provides benefits over in-vivo methods:

(i) Expression of toxic/dangerous antimicrobial peptides, potent neurotoxins, or complex membrane proteins in vivo. Usually the host cell would ‘die’ before reaching a large protein yield, as the CFPS is technically dead, it can synthesize toxic therapeutics and viral vectors that would be impossible to harvest from living cultures [2]

(ii) The open environment lets you easily swap natural amino acids for synthetic ones, enabling efficient, site-specific incorporation of non-standard amino acids (nsAAs) without competing with host metabolism [2]

Khambhati K, Bhattacharjee G, Gohil N, Braddick D, Kulkarni V, Singh V. Exploring the potential of cell-free protein synthesis for extending the abilities of biological systems. Front Bioeng Biotechnol. 2019;7:248.
Silverman AD, Kelley-Loughnane N, Jewett MC. Cell-free gene expression: an expanded repertoire of applications. Nat Rev Genet. 2020;21(3):151-70.

2. Describe the main components of a cell-free expression system and explain the role of each component.

(i) Cell extract (machinery): Derived from lysed cells (like E. coli), this extract provides the core transcriptional and translational machinery, including ribosomes and RNA polymerase, required to build the protein

(ii) Genetic template (blueprint): The DNA plasmid or RNA template that contains the specific gene sequence of the target protein we want to express

(iii) Nucleotides and amino acids (building blocks): Nucleotides—Adenosine triphosphate (ATP), Guanosine triphosphate (GTP), Cytidine triphosphate (CTP), and Uridine triphosphate (UTP)—are supplied for ribonucleic acid (RNA) synthesis (transcription), while transfer RNAs (tRNAs) pair with messenger RNA (mRNA) to deliver the amino acids necessary for protein synthesis (translation)

(iv) Energy systems: immediate energy sources like adenosine triphosphate (ATP) are paired with intermediate metabolites like 3-phosphoglycerate (3-PGA) or phosphoenolpyruvate (PEP) to continuously regenerate energy and maintain reaction stability.

(v) Buffers & cofactors (Environmental conditions):

HEPES: 4-(2-hydroxyethyl)-1-piperazineethanesulfonic acid (buffer to maintain a stable pH for optimal enzyme activity)

Mg: Magnesium (cofactor for transcription and translation enzymes)

DTT: Dithiothreitol (reducing agent that maintains a non-oxidizing environment to protect protein residues)

Sodium Oxalate: This is already the full chemical name (there is no abbreviation here, though its chemical formula is Na₂C₂O₄) (prevent magnesium precipitation, stabilizing the ionic balance)

3. Why is energy provision regeneration critical in cell-free systems? Describe a method you could use to ensure continuous ATP supply in your cell-free experiment.

Transcription and translation consume ATP/GTP rapidly, and the by-product inorganic phosphate chelates Mg²⁺, stalling ribosomes within roughly 1 hour without replenishment.

For continuous ATP supply, a possible system such as phosphoenolpyruvate (PEP) + pyruvate kinase, or creatine phosphate + creatine kinase, continuously re-phosphorylates ADP to ATP, sustaining synthesis for several hours.

Filippo Caschera, Vincent Noireaux. Synthesis of 2.3 mg/ml of protein with an all Escherichia coli cell-free transcription-translation system. Biochimie. 2014 Apr;99:162-168. doi: 10.1016/j.biochi.2013.11.025. Epub 2013 Dec 8. PMID: 24326247

4. Compare prokaryotic versus eukaryotic cell-free expression systems. Choose a protein to produce in each system and explain why. Prokaryotic systems such as E. coli S30 give high yields, are low cost, and have fast turnaround but lack post-translational modifications (PTMs); eukaryotic systems (wheat germ, CHO, HeLa) yield less but support glycosylation, disulfide bonds, and complex folding.

Therefore choosing:

GFP as it folds autonomously, no PTMs, ideal for rapid high-yield prototyping.

Eukaryotic choice: erythropoietin (EPO) as it requires N-glycosylation and disulfide bonds for activity, only achievable in a mammalian lysate

Anne Zemella, Lena Thoring, Christian Hoffmeister, Stefan Kubick. Cell-free protein synthesis: Pros and cons of prokaryotic and eukaryotic systems. ChemBioChem. 2015 Nov;16(17):2420-2431. doi: 10.1002/cbic.201500340. Epub 2015 Oct 19. PMID: 26478227; PMCID: PMC4676933.

5. How would you design a cell-free experiment to optimize the expression of a membrane protein? Discuss the challenges and how you would address them in your setup.

Challenges:

(i) Hydrophobic domains aggregate in aqueous mixes, the protein needs a lipid-like environment to fold

(ii) This could be toxic in-vivo

Design:

(i) Template: T7-driven, His-tagged construct of the membrane protein

(ii) Extract: E. coli S30 lysate

(iii) Supplement (test in parallel): mild detergents (Brij-35, DDM), nanodiscs (MSP + lipids), or liposomes

(iv) Optimise: Mg²⁺, K⁺, and temperature in a small factorial screen

Validation:

(i) SDS-PAGE + anti-His Western blot to confirm expression

(ii) Ultracentrifugation to separate soluble vs membrane-inserted fractions

(iii) Functional or ligand-binding assay to confirm native folding

Daniel Schwarz, Friederike Junge, Florian Durst, Nadine Frölich, Birgit Schneider, Sina Reckel, Solmaz Sobhanifar, Volker Dötsch, Frank Bernhard. Preparative scale expression of membrane proteins in Escherichia coli-based continuous exchange cell-free systems. Nat Protoc. 2007;2(11):2945-2957.

Imagine you observe a low yield of your target protein in a cell-free system. Describe three possible reasons for this and suggest a troubleshooting strategy for each.

(i) Energy depletion: add a regeneration system (PEP/pyruvate kinase or creatine phosphate/creatine kinase) or use a continuous-exchange (CECF) reactor.

(ii) Misfolding or proteolysis: lower temperature to 25 °C, add chaperones (GroEL/ES, DnaK) and protease/RNase inhibitors

(iii) Inefficient setup: re-purify DNA, check integrity on gel, ensure a T7 promoter and strong RBS, titrate 5–20 ng/µL

Adam D. Silverman, Ashty S. Karim, Michael C. Jewett. Cell-free gene expression: an expanded repertoire of applications. Nat Rev Genet. 2020 Mar;21(3):151-170. doi: 10.1038/s41576-019-0186-3. Epub 2019 Nov 28. PMID: 31780816.

Homework question from Kate Adamala

Function

A synthetic minimal cell that expands gut-brain axis signalling.

Input: Tumor Necrosis Factor-alpha (TNF-α), elevated in intestinal inflammation.

Output of SMC: 5-hydroxytryptophan (5-HTP), a serotonin precursor. Output of whole system: increased serotonin production in enterochromaffin cells, improving mood-relevant signalling.

Could this be cell-free transcription/translation (Tx/Tl) without encapsulation?

No, encapsulation is required to spatially contain the enzymatic conversion of tryptophan to 5-HTP, preventing uncontrolled release and ensuring TNF-α-triggered production only.

Could a genetically modified natural cell do this?

Yes, but SMCs offer safer, non-replicating, controllable delivery without risk of horizontal gene transfer or colonisation.

Desired outcome: In the presence of intestinal inflammation, SMCs locally produce and release 5-HTP, dampening the inflammation-serotonin deficit link in the gut-brain axis.

Components

Membrane:

1-palmitoyl-2-oleoyl-sn-glycero-3-phosphocholine (POPC) + cholesterol

Encapsulated:

Bacterial cell-free Tx/Tl; tryptophan hydroxylase 1 gene (TPH1) under TNF-α-responsive promoter; tryptophan (substrate)

Tx/Tl system:

Bacterial (TNF-α responsive elements achievable with engineered promoters)

Communication:

TNF-α diffuses into SMC; 5-HTP exits via alpha-hemolysin (aHL) pore expressed upon TNF-α sensing

Experimental details

Lipids: POPC, cholesterol

Genes: TPH1 (tryptophan hydroxylase 1); aHL (alpha-hemolysin membrane pore)

Small molecules: tryptophan (encapsulated substrate)

Measurement:

Enzyme-linked immunosorbent assay (ELISA) or high-performance liquid chromatography (HPLC) for 5-HTP output; serotonin levels in co-cultured enterochromaffin cells

Paul Strandwitz. Neurotransmitter modulation by the gut microbiota. Brain Res. 2018;1693(Pt B):128-133. doi:10.1016/j.brainres.2018.03.015.
Júlia Leão Batista Simões, Geórgia de Carvalho Braga, Charles Elias Assmann, Margarete Dulce Bagatini. Targeting the gut-immune-brain axis: pharmacological insights from depression in inflammatory bowel disease. Front Pharmacol. 2026 Apr 1;17:1793292. doi:10.3389/fphar.2026.1793292. PMID: 41993582; PMCID: PMC13079007.

Homework question from Peter Nguyen

Freeze-dried cell-free systems can be incorporated into all kinds of materials as biological sensors or as inducible enzymes to modify the material itself or the surrounding environment. Choose one application field — Architecture, Textiles/Fashion, or Robotics — and propose an application using cell-free systems that are functionally integrated into the material. Answer each of these key questions for your proposal pitch:

• Write a one-sentence summary pitch sentence describing your concept.

• How will the idea work, in more detail? Write 3-4 sentences or more.

• What societal challenge or market need will this address?

• How do you envision addressing the limitation of cell-free reactions (e.g., activation with water, stability, one-time use)?

Walls or tiles in homes in arsenic-endemic regions embedded with cell-free biosensors that visibly indicate arsenic contamination that change colour when water containing arsenic flows over or is applied to them.

• Freeze-dried cell-free systems containing ArsR biosensor embedded into a porous tile or panel surface coating

• HH member applies/splashes water sample onto the tile

• Water rehydrates the cell-free system

• If arsenic present –> colour change visible to the naked eye | can be a “testing wall”

Works towards one contributor of chronic kidney disease from arsenic. Prevalence is higher among communities dependent on communal wells. Additionally, no behaviour change would be needed, it would be part of regular chores/tasks.

Addressing cell-free system limitations: Activation with water: Naturally solved through water sample application is the intended use, making rehydration a feature rather than a limitation Stability: Freeze-drying into a protective hydrogel matrix embedded within the ceramic tile pores confers long shelf-life; tiles can be stored and installed in hot climates without refrigeration, as freeze-dried cell-free systems have demonstrated stability at ambient temperatures for extended periods

Homework question from Ally Huang

Freeze-dried cell-free reactions have great potential in space, where resources are constrained. As described in my talk, the Genes in Space competition challenges students to consider how biotechnology, including cell-free reactions, can be used to solve biological problems encountered in space. While the competition is limited to only high school students, your assignment will be to develop your own mock Genes in Space proposal to practice thinking about biotech applications in space! For this particular assignment, your proposal is required to incorporate the BioBits® cell-free protein expression system, but you may also use the other tools in the Genes in Space toolkit (the miniPCR® thermal cycler and the P51 Molecular Fluorescence Viewer). For more inspiration, check out https://www.genesinspace.org/.

(1) Provide background information that describes the space biology question or challenge you propose to address. Explain why this topic is significant for humanity, relevant for space exploration, and scientifically interesting. (Maximum 100 words)

Future missions to Mars and icy moons such as Europa will be unable to return contaminated samples to Earth for analysis, requiring an on-site biological screening tool. A rapid, equipment-minimal, on-site biological screening tool for astronaut safety and planetary protection which would be in the form of freeze-dried cell-free systems offer a uniquely stable, rehydration-activated solution deployable without refrigeration or living cells across multi-year deep space missions.

(2) Name the molecular or genetic target that you propose to study. Examples of molecular targets include individual genes and proteins, DNA and RNA sequences, or broader -omics approaches. (Maximum 30 words)

Pathogen-specific messenger RNA (mRNA) sequences from Pseudomonas aeruginosa and Salmonella, detected via toehold switch riboregulators.

(3) Describe how your molecular or genetic target relates to the space biology question or challenge your proposal addresses. (Maximum 100 words)

Toehold switches are synthetic riboregulators that only trigger translation of a fluorescent reporter when a specific target RNA is present. Embedding these into the BioBits cell-free system creates a programmable, rehydration-activated biosensor that can be reprogrammed for different pathogens or extraterrestrial biosignatures.

(4) Clearly state your hypothesis or research goal and explain the reasoning behind it. (Maximum 150 words)

Freeze-dried BioBits cell-free systems incorporating pathogen-specific toehold switches can be rehydrated with extraterrestrial liquid samples and produce a detectable fluorescent signal within 2–3 hours, enabling rapid on-site pathogen screening without living cells or complex equipment, even under microgravity conditions.

(5) Outline your experimental plan - identify the sample(s) you will test in your experiment, including any necessary controls, the type of data or measurements that will be collected, etc. (Maximum 100 words)

(i) Rehydrate toehold switch BioBits reactions with pathogen RNA spiked into simulated Martian brine and Europa ocean analogue solutions

(ii) Pre-amplify trace nucleic acids using miniPCR where sample concentrations are too low for direct detection

(iii) Visualise and quantify fluorescent output using the P51 Molecular Fluorescence Viewer

(iv) Controls: sterile water (negative) and non-target RNA (specificity)

(v) Repeat all experiments under simulated microgravity to confirm performance consistency

Selin Kocalar, Bess M Miller, Ally Huang, Emily Gleason, Kathryn Martin, Kevin Foley, D Scott Copeland, Michael C Jewett, Ezequiel Alvarez Saavedra, Sebastian Kraves. Validation of cell-free protein synthesis aboard the International Space Station. ACS Synth Biol. 2024 Mar 15;13(3):942-950. doi:10.1021/acssynbio.3c00733

Week 10 — Imaging and measurement

Homework: Waters Part I — Molecular Weight

We will analyze an eGFP standard on a Waters Xevo G3 QTof MS system to determine the molecular weight of intact eGFP and observe its charge state distribution in the native and denatured (unfolded) states. The conditions for LC-MS analysis of intact protein cause it to unfold and be detected in its denatured form (due to the solvents and pH used for analysis).

1. Based on the predicted amino acid sequence of eGFP (see below) and any known modifications, what is the calculated molecular weight? You can use an online calculator like the one at https://web.expasy.org/compute_pi/

eGFP Sequence: MVSKGEELFTG VVPILVELDG DVNGHKFSVS GEGEGDATYG KLTLKFICTT GKLPVPWPTL VTTLTYGVQC FSRYPDHMKQ HDFFKSAMPE GYVQERTIFF KDDGNYKTRA EVKFEGDTLV NRIELKGIDF KEDGNILGHK LEYNYNSHNV YIMADKQKNG IKVNFKIRHN IEDGSVQLAD HYQQNTPIGD GPVLLPDNHY LSTQSALSKD PNEKRDHMVL LEFVTAAGIT LGMDELYKLE HHHHHH Note: This contains a His-purification tag (HHHHHH) and a linker (the LE before it).

After inputting into the eGFP sequence into the online calculator I get:

Theoretical pI/Mw: 5.90 / 28006.60

2. Calculate the molecular weight of the eGFP using the adjacent charge state approach described in the recitation. Select two charge states from the intact LC-MS data (Figure 1) and:

I. Determine for each adjacent pair of peaks using:

From figure 1 I picked:

m/zn: 933.7349

m/zn+1: 903.7148

Plugging in values:

z = 903.7148/(933.7349 - 903.7148)

z = 31.1037

II. Determine the MW of the protein using the relationship between m/z_n , MW, and z.

Using derivation of deconvolution for ’n’':

Top

m/zn+1 903.7148

minus 1

Top 902.7148

Bottom

m/zn 933.7349

m/zn - m/zn+1 30.0201

Therefore,

n = Top/Bottom ~ 30.07035

Therefore,

MW = (n x m/zn – n)

{(30.0703462) * (933.7349)} - 30.07035

MW = 28047.66 Da

III. Calculate the accuracy of the measurement using the deconvoluted MW from 2.2 and the predicted weight of the protein from 2.1 using:

Accuracy = |28047.66 - 28006.60|Da/(28006.60)Da

= 0.001466131 (0.15%)

3. Can you observe the charge state for the zoomed-in peak in the mass spectrum for the intact eGFP? If yes, what is it? If no, why not?

No, it is difficult to determine charged state from zoomed-in peak by itself. The isotropic peaks are not clear, therefore the space needed to identify z cannot be found.

Homework: Waters Part II — Secondary/Tertiary structure

Homework: Waters Part III — Peptide Mapping - primary structure

We will digest the eGFP protein standard into peptides using trypsin (an enzyme that selectively cleaves the peptide bond after Lysine (K) and Arginine (R) residues. The resulting peptides will be analyzed on the Waters BioAccord LC-MS to measure their molecular weights and fragmented to confirm the amino acid sequence within each peptide – generating a “peptide map”. This process is used to confirm the primary structure of the protein.

There are a variety of tools available online to calculate protein molecular weight and predict a list of peptides generated from a tryptic digest. We will be using tools within the online resource Expasy (the bioinformatics resource portal of the Swiss Institute of Bioinformatics (SIB)) to predict a list of tryptic peptides from eGFP.

How many Lysines (K) and Arginines (R) are in eGFP? Please circle or highlight them in the eGFP sequence given in Waters Part I question 1 above. (Note: adding the sequence to Benchling as an amino acid file and clicking biochemical properties tab will show you a count for each amino acid).

After adding to benchling, I got counts & frequencies for each amino acid:

Amino Acid	Code	Count	Percentage
Ala	A	8	3.2%
Arg	R	6	2.4%
Asn	N	13	5.3%
Asp	D	18	7.3%
Cys	C	2	0.8%
Gln	Q	8	3.2%
Glu	E	17	6.9%
Gly	G	22	8.9%
His	H	15	6.1%
Ile	I	12	4.9%
Leu	L	22	8.9%
Lys	K	20	8.1%
Met	M	6	2.4%
Phe	F	12	4.9%
Pro	P	10	4.0%
Ser	S	10	4.0%
Thr	T	16	6.5%
Trp	W	1	0.4%
Tyr	Y	11	4.5%
Val	V	18	7.3%
Pyl	O	0	0.0%
Sec	U	0	0.0%

Lysines (K) and Arginines (R) highlighted:

MVSKGEELFTG VVPILVELDG DVNGHKFSVS GEGEGDATYG KLTLKFICTT GKLPVPWPTL VTTLTYGVQC FSRYPDHMKQ HDFFKSAMPE GYVQERTIFF KDDGNYKTRA EVKFEGDTLV NRIELKGIDF KEDGNILGHK LEYNYNSHNV YIMADKQKNG IKVNFKIRHN IEDGSVQLAD HYQQNTPIGD GPVLLPDNHY LSTQSALSKD PNEK RDHMVL LEFVTAAGIT LGMDELYKLE HHHHHH

How many peptides will be generated from tryptic digestion of eGFP?

(i) Navigate to https://web.expasy.org/peptide_mass/

(ii) Copy/paste the sequence above into the input box in the PeptideMass tool to generate expected list of peptides.

(iii) Use Figure 4 below as a guide for the relevant parameters to predict peptides from eGFP.

(iv) Click “Perform the Cleavage” button in the PeptideMass tool and report the number of peptides generated when using trypsin to perform the digest.

Number of peptides generated: 19

Mass (Da)	Position	Peptide Sequence
4472.1752	170–210	HNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSK
2566.2931	217–239	DHMVLLEFVTAAGITLGMDELYK
2437.2608	5–27	GEELFTGVVPILVELDGDVNGHK
2378.2577	54–74	LPVPWPTLVTTLTYGVQCFSR
1973.9062	142–157	LEYNYNSHNVYIMADK
1503.6597	28–42	FSVSGEGEGDATYGK
1266.5783	87–97	SAMPEGYVQER
1083.4979	240–247	LEHHHHHH
1050.5214	115–123	FEGDTLVNR
982.4952	133–141	EDGNILGHK
821.3940	81–86	QHDFFK
790.3552	75–80	YPDHMK
769.3913	47–53	FICTTGK
711.2944	103–108	DDGNYK
655.3813	98–102	TIFFK
602.2780	211–215	DPNEK
579.3137	128–132	GIDFK
507.2925	164–167	VNFK
502.3235	124–127	IELK

3. Based on the LC-MS data for the Peptide Map data generated in lab (please use Figure 5a as a reference) how many chromatographic peaks do you see in the eGFP peptide map between 0.5 and 6 minutes? You may count all peaks that are >10% relative abundance

Count peaks that are long (tiny peaks are (e.g. 1.20  impurities due to non-specificity to his-tag column/misses a cleavage)

Given relative abundance and using 4.87 as reference, there are about 20 peaks including peak at 5.43 which is borderline

4. Assuming all the peaks are peptides, does the number of peaks match the number of peptides predicted from question 2 above? Are there more peaks in the chromatogram or fewer?

It is roughly the same; ~19–20 observed vs. 19 predicted. There are approximately the same number, with perhaps slightly more peaks than predicted peptides. Missed cleavages (trypsin doesn’t cut every K/R every time), non-specific cleavages, oxidation or other modifications producing multiple forms of the same peptide, or contaminants.

5. Identify the mass-to-charge () of the peptide shown in Figure 5b. What is the charge (z) of the most abundant charge state of the peptide (use the separation of the isotopes to determine the charge state). Calculate the mass of the singly charged form of the peptide ([M+H]+) based on its m/z and z.

Highest peak is 525.76712

Average spacing:

Peak Pair	Difference (m/z)
526.25918 − 525.76712	0.49206
526.76845 − 526.25918	0.50927
527.26098 − 526.76845	0.49253
Average	0.497953
Rounded Average (m/z)	0.5

Therefore, Z = 1/delta(m/z) = 1/0.5 = 2

With MW = (n × m/z) – n = (2 x 525.7612) – 2 = 1049.53424Da

Therefore, approximating H = 1

[M+H]+1 = 1049.53424 + 1 Da = 1050.53424 Da

6. Identify the peptide based on comparison to expected masses in the PeptideMass tool. What is mass accuracy of measurement? Please calculate the error in ppm. (Recall that Accuracy = (MWexperiment – MWtheory)/ MWtheory )

MW_Experiment = 1049.53424

MW_Theory = 1050.5215 – 1 = 1049.525

Accuracy = {|1049.53424 – 1049.514|Da/1049.525 Da} x 106 = 12.13886519 ~12.14ppm

7. What is the percentage of the sequence that is confirmed by peptide mapping? (see Figure 6)

88%

Bonus Peptide Map Questions

8. Can you determine the peptide sequence for the peptide fragmentation spectrum shown in Figure 5c? (HINT: Use your results from Question 2 above to match the peptide molecular weight that is closest to that shown in Figure 5b. Copy and paste its sequence into this tool online to predict the fragmentation pattern based on its amino acid sequence: http://db.systemsbiology.net/proteomicsToolkit/FragIonServlet.html. What is the sequence of the eGFP peptide that best matches the fragmentation spectrum in Figure 5c?

After pasting FEGDTLVNR, I get:

Residue	Position	b-ion (m/z)	y-ion (m/z)
F	1	148.07574	1050.52149
E	2	277.11833	903.45308
G	3	334.13979	774.41049
D	4	449.16673	717.38902
T	5	550.21441	602.36208
L	6	663.29848	501.31440
V	7	762.36689	388.23034
N	8	876.40982	289.16192
R	9	1032.51093	175.11900

Ion Species	Monoisotopic Mass	Average Mass
(M)	1049.51422	1050.13629
(M+H)+	1050.52149	1051.14356
(M+2H)²+	525.76441	526.07544
(M+3H)³+	350.84538	351.05273
(M+4H)⁴+	263.38586	263.54138

The predicted y-ions for FEGDTLVNR (y3=388.23, y4=501.31, y5=602.36, y7=774.41, y8=903.45) match the peaks in figure 5c.

9. Does the peptide map data make sense, i.e. do the results indicate the protein is the eGFP standard? Why or why not? Consult with Figure 6, which depicts the % amino acid coverage of peptides positively identified using their calculated mass and fragmentation pattern.

The data confirms it’s eGFP because there is 88% sequence coverage; detected peptides map across most of the eGFP sequence. Also, there are fragmentation matches; b/y ions confirm the amino acid order.

Homework: Waters Part IV — Oligomers

We will determine Keyhole Limpet Hemocyanin (KLH)’s oligomeric states using charge detection mass spectrometry (CDMS). CDMS single-particle measurements of KLH allow us to make direct mass measurements to determine what oligomeric states (that is, how many protein subunits combine) are present in solution. Using the known masses of the polypeptide subunits (Table 1) for KLH, identify where the following oligomeric species are on the spectrum shown below from the CDMS (Figure 7):

• 7FU Decamer

• 8FU Didecamer

• 8FU 3-Decamer

• 8FU 4-Decamer

Polypeptide Subunit Name	Subunit Mass
7FU	340 kDa
8FU	400 kDa

Table 1: KLH Subunit Masses

Convert to Kilo-DA**

Oligomer (FU Decamer)	Multiply	Theoretical Mass (MDa)
7FU Decamer	10 × 340 kDa = 3,400 kDa	3.4
8FU Didecamer	20 × 400 kDa = 8,000 kDa	8.0
8FU 3-Decamer	30 × 400 kDa = 12,000 kDa	12.0
8FU 4-Decamer	40 × 400 kDa = 16,000 kDa	16.0

Homework: Waters Part V — Did I make GFP?

Please fill out this table with the data you acquired from the lab work done at the Waters Immerse Lab in Cambridge, or else the data screenshots in this document if you were unable to have lab work done at Waters.

Two peaks (m/z):

(i) 2545.0388

(ii) 2799.4929

Average distance between peaks is approximately = 0.092

Z= 1/0.092 = 10.86956522. Therefore taking 10 as charge state for peak (ii)

Z	m/z	MW = (z × m/z) − z	Round
10	2799.4929	27984.9290	27984.93
11	2545.0388	27984.4268	27984.43

MW_Experiment = 27,984.00

MW_Theory = 28,007.00

PPM = Accuracy = {|27,984.00 – 28,007.00|Da/28,007.00 Da} x 106 = 821.2232656 ~ 821ppm

Measurement	Theoretical	Observed / Measured on the Intact LC-MS	PPM Mass Error
Molecular weight (kDa)	28.007 / 27.984	28.007 / 27.984	821

Week 11 — Week 11 — Bioproduction & Cloud Labs

Homework: Week 11 — Bioproduction & Cloud Labs

Part A: The 1,536 Pixel Artwork Canvas | Collective Artwork

1. Contribute at least one pixel to this global artwork experiment before the editing ends on Sunday 4/19 at 11:59 PM EST.

• A personalized URL was sent to the email address associated with your Discourse account, and you can discuss the artwork on the Discourse.

• If you did not have a chance to contribute, it’s okay, just make sure you become a TA this fall! 😉

Most of my contributions were in the top right quadrant, followed by the top left quadrant. Initially the top left was the LifeLabs logo, which is now the 2026. The top right is the ‘MIT’ now. I also contributed to the bottom left, which is now formed a bacteriophage. I forgot to take screenshots at the time.

I really like the collaborative aspect of this project. Also, it gives us the opportunity to see emergent art as it happens.

To improve on the project, I would maybe have 1 large contribution followed by lab-specific artwork; each lab across the world could make their own design. Obviously, this would be subject to time and financial constraints. But, if possible, it would be very cool!

Part B: Cell-Free Protein Synthesis | Cell-Free Reagents

1. Referencing the cell-free protein synthesis reaction composition (the middle box outlined in yellow on the image above, also listed below), provide a 1-2 sentence description of what each component’s role is in the cell-free reaction.

E. coli Lysate

(i) BL21 (DE3) Star Lysate (includes T7 RNA Polymerase):

Produces high levels of protein from the T7 promoter and can be used with high or low copy number plasmids, making BL21(DE3) competent cells the preferred strains for protein expression in bacteria [1].

Thermo Fisher Scientific. Competent cells for protein expression in E. coli BL21(DE3) and derivatives [Internet]. Waltham (MA): Thermo Fisher Scientific; [cited 2026 May 3]. Available from: https://www.thermofisher.com/uk/en/home/life-science/cloning/competent-cells-for-transformation/competent-cells applications/comp-cells-for-protein-expression.html

Salts/Buffer

(i) Potassium Glutamate

It is a salt that salt that maintains ionic strength. It leads to transcriptional activation of sets of genes that allow the cell to achieve long-term adaptation to high osmolarity

Gralla JD, Vargas DR. Potassium glutamate as a transcriptional inhibitor during bacterial osmoregulation. EMBO J. 2006;25(7):1515–1521. doi:10.1038/sj.emboj.7601041.

(ii) HEPES-KOH pH 7.5

HEPES-KOH is a buffering agent that maintains a stable physiological pH during the cell-free reaction. Maintaining pH near 7.5 is essential because transcription and translation enzymes are highly sensitive to pH fluctuations.

Good NE, Winget GD, Winter W, Connolly TN, Izawa S, Singh RMM. Hydrogen ion buffers for biological research. Biochemistry. 1966;5(2):467-477. doi:10.1021/bi00866a011.

(iii) Magnesium Glutamate

Magnesium glutamate supplies Mg²⁺ ions that stabilize ribosomes, RNA, and ATP-dependent enzymatic reactions during transcription and translation. Magnesium concentration strongly affects protein synthesis efficiency and overall fluorescence yield in cell-free systems.

Jewett MC, Swartz JR. Rapid expression and purification of 100 nmol quantities of active protein using cell-free protein synthesis. Biotechnol Prog. 2004;20(1):102-109. doi:10.1021/bp0342330.

(iv) Potassium phosphate monobasic

Potassium phosphate monobasic contributes to phosphate buffering and helps maintain intracellular-like ionic conditions in the reaction. It also supports ATP regeneration and metabolic stability during extended incubations.

Kim DM, Swartz JR. Regeneration of adenosine triphosphate from glycolytic intermediates for cell-free protein synthesis. Biotechnol Bioeng. 2001;74(4):309-316. doi:10.1002/bit.1110.

(v) Potassium phosphate dibasic

Potassium phosphate dibasic works together with monobasic phosphate to maintain buffering capacity and phosphate balance in the cell-free system. This helps stabilize enzymatic activity and sustain long-term transcription and translation reactions.

Kim DM, Swartz JR. Regeneration of adenosine triphosphate from glycolytic intermediates for cell-free protein synthesis. Biotechnol Bioeng. 2001;74(4):309-316. doi:10.1002/bit.1110.

Energy / Nucleotide System

(i) Ribose

D-ribose is a naturally occurring monosaccharide within the pentose pathway that assists with ATP (Adenosine Triphosphate) production. In cell-free systems, it helps sustain nucleotide regeneration and prolonged protein synthesis.

Mahoney DE, Hiebert JB, Thimmesch A, Pierce JT, Vacek JL, Clancy RL, et al. Understanding D-ribose and mitochondrial function. Adv Biosci Clin Med. 2018;6(1):1-5. doi:10.7575/aiac.abcmed.v.6n.1p.1.

(ii) Glucose

Glucose serves as a major energy source that is metabolized to generate ATP through glycolysis and related metabolic pathways. In cell-free protein synthesis systems, glucose supports ATP regeneration, helping sustain transcription and translation during long incubations.

Hantzidiamantis PJ, Awosika AO, Lappin SL. Physiology, glucose. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 [cited 2026 May 24]. Available from: https://www.ncbi.nlm.nih.gov/books/NBK545201/

(iii) AMP

Adenosine triphosphate is a nucleotide involved in cellular energy metabolism and nucleotide biosynthesis. For cell-free protein synthesis systems, AMP contributes to ATP regeneration pathways and helps sustain transcriptional and translational activity during extended incubations.

Hardie DG. AMP-activated protein kinase: maintaining energy homeostasis at the cellular and whole-body levels. Annu Rev Nutr. 2014;34:31-55.

(iv) CMP

Cytidine monophosphate is a pyrimidine nucleotide involved in RNA synthesis and nucleotide metabolism. In cell-free protein synthesis systems, CMP will help maintain nucleotide pools required for sustained transcription during extended reactions

BOC Sciences. Comprehensive discussion on pyrimidine nucleotides [Internet]. Shirley (NY): BOC Sciences; [cited 2026 May 24]. Available from: https://www.bocsci.com/resources/comprehensive-discussion-on-pyrimidine-nucleotides.html

(v) GMP

Guanosine monophosphate is a purine nucleotide that serves as a precursor for Guanosine triphosphate (GTP) synthesis; which is essential for transcription and translation elongation. In cell-free systems, GMP supplementation can help sustain nucleotide availability and prolonged protein synthesis.

ScienceDirect. Guanosine monophosphate [Internet]. Amsterdam: Elsevier; [cited 2026 May 24]. Available from: https://www.sciencedirect.com/topics/neuroscience/guanosine-monophosphate

https://pmc.ncbi.nlm.nih.gov/articles/PMC9620470/

(vi) UMP

Uridine monophosphate is a pyrimidine nucleotide involved in RNA biosynthesis and cellular nucleotide metabolism. In cell-free synthesis systems, UMP can support RNA production by contributing to the regeneration of uridine nucleotide pools.

ScienceDirect. Uridine monophosphate [Internet]. Amsterdam: Elsevier; [cited 2026 May 24]. Available from: https://www.sciencedirect.com/topics/neuroscience/uridine-monophosphate

(vii) Guanine

Guanine is one of the four nitrogenous bases found in nucleic acids and is an essential component of RNA and DNA. In cell-free protein synthesis systems, guanine can be converted through nucleotide salvage pathways into GMP and GTP, supporting continued transcription and translation activity.

National Human Genome Research Institute. Guanine [Internet]. Bethesda (MD): National Human Genome Research Institute; [cited 2026 May 24]. Available from: https://www.genome.gov/genetics-glossary/guanine

Translation Mix (Amino Acids)

(i) 17 Amino Acid Mix

A combined stock of the standard proteinogenic amino acids excluding tyrosine and cysteine, which are added separately due to solubility and oxidation issues. In cell-free protein synthesis systems, this mix supplies the substrates charged onto tRNAs for ribosomal elongation, sustaining translation during extended reactions.

Caschera F, Noireaux V. Synthesis of 2.3 mg/ml of protein with an all Escherichia coli cell-free transcription–translation system. Biochimie. 2014;99:162-8. doi:10.1016/j.biochi.2013.11.025.

(ii) Tyrosine

Tyrosine is an aromatic, polar amino acid with notably low aqueous solubility. In cell-free systems, it is supplemented separately to maintain accurate concentrations without precipitation, and it supports consistent incorporation into nascent polypeptides during translation.

ScienceDirect. Tyrosine [Internet]. Amsterdam: Elsevier; [cited 2026 May 24]. Available from: https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/tyrosine

(iii) Cysteine

Cysteine is a sulfur-containing amino acid whose thiol side chain is prone to oxidation and disulfide cross-linking. For cell-free protein synthesis systems, it is added separately to preserve the free thiol pool and support correct incorporation and folding, particularly for cysteine-rich or disulfide-bonded proteins.

Sigma-Aldrich. Cysteine [Internet]. Burlington (MA): Merck KGaA; [cited 2026 May 24]. Available from: https://www.sigmaaldrich.com/GB/en/technical-documents/technical-article/cell-culture-and-cell-culture-analysis/mammalian-cell-culture/cysteine

Additives

(i) Nicotinamide

Nicotinamide is the amide form of vitamin B3 (niacin) and a precursor in the biosynthesis of NAD⁺ and NADP⁺. In cell-free systems, it supports the maintenance of nicotinamide cofactor pools required for energy regeneration reactions that sustain ATP supply during prolonged transcription and translation. ScienceDirect. Nicotinamide [Internet]. Amsterdam: Elsevier; [cited 2026 May 24]. Available from: https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/nicotinamide

Backfill

(i) Nuclease Free Water

Nuclease-free water is highly purified, deionized, filtered, and autoclaved water certified to be free of endonuclease, exonuclease, and RNase activity. In cell-free systems, it is used to adjust reaction volumes and dilute components without introducing contaminants that could degrade DNA, mRNA, or compromise reaction efficiency.

Thermo Fisher Scientific. Nuclease-free water [Internet]. Waltham (MA): Thermo Fisher Scientific; [cited 2026 May 24]. Available from: https://www.thermofisher.com/order/catalog/product/AM9930

2. Describe the main differences between the 1-hour optimized PEP-NTP master mix and the 20-hour NMP-Ribose-Glucose master mix shown in the Google Slide above. (2-3 sentences)

The 1-hour PEP/NTP mix supplies energy and nucleotides quickly via their ready-to-use forms (ATP, GTP, CTP, UTP plus PEP-Mono and maltodextrin), providing immediate phosphorylation power for fast transcription and translation but exhausting quickly. In comparison, the 20-hour NMP-Ribose mix is slow releasing. It feeds in low-cost precursors (NMPs, guanine, ribose, glucose, phosphate buffer) and relies on the lysate’s own metabolism to regenerate NTPs and ATP gradually, sustaining protein synthesis over a much longer window.

3. Bonus question: How can transcription occur if GMP is not included but Guanine is?

Guanine + ribose + ATP from the metabolic system enable the production of GTP. It is done inside the lysate using the purine salvage pathway, then phosphorylated up to GTP for transcription. The mix supplies the raw ingredients (guanine + ribose) and lets the cell extract’s own enzymes do the assembly.

Smith AA, Wong EL, Donovan RC, Chapman BA, Harry R, Tirandazi P, et al. Using a GPT-5-driven autonomous lab to optimize the cost and titer of cell-free protein synthesis. bioRxiv [Preprint]. 2026. Available from: https://www.biorxiv.org/content/10.64898/2026.02.05.703998v1

Part C: Part C: Planning the Global Experiment | Cell-Free Master Mix Design

1. Given the 6 fluorescent proteins we used for our collaborative painting, identify and explain at least one biophysical or functional property of each protein that affects expression or readout in cell-free systems. (Hint: options include maturation time, acid sensitivity, folding, oxygen dependence, etc) (1-2 sentences each)

sfGFP

This is a variant with fast, robust folding (maturation half-time ~10 min in E. coli) that tolerates misfolding-prone fusion partners. As a result, it’s the default CFPS reporter, giving the earliest and most reliable readout.

mRFP1

It is described on Pbase as a “somewhat slowly-maturing monomer” (~1 h half-time) with a low pKa (~4.5). The slow chromophore maturation means fluorescent readout lags well behind actual protein production in CFPS, plus a residual green immature intermediate (inherited from DsRed) can complicate spectral readout.

mKO2

It is a coral-derived monomeric Kusabira-Orange variant specifically engineered for rapid maturation. In CFPS reactions that drift acidic during prolonged ATP regeneration, signal can drop without strong buffering.

mTurquoise2

It has the highest quantum yield (~93%) of any monomeric fluorescent protein and high photostability, giving a strong, stable signal in CFPS. This is useful when expression levels are modest, as is common with non-optimised constructs

mScarlet_I

This variant evolved from the bright but very slow-maturing mScarlet (~132 min) down to a maturation half-time of ~36 min, trading a small brightness loss for much earlier red signal. This can be important for short CFPS incubations.

Electra2

A blue fluorescent protein derived from Entacmaea quadricolor. It is reported to reported to form aggregates in multiple organisms (C. elegans, zebrafish, mice, Dictyostelium). This aggregation could compromise solubility and skew fluorescence readout in CFPS.

2. Create a hypothesis for how adjusting one or more reagents in the cell-free mastermix could improve a specific biophysical or functional property you identified above, in order to maximize fluorescence over a 36-hour incubation. Clearly state the protein, the reagent(s), and the expected effect.

In the sfGFP cell-free system, increasing glucose, magnesium glutamate, and nucleotide precursors (AMP/CMP/ribose) is expected to extend ATP and nucleotide regeneration capacity, thereby sustaining transcription and translation over the full 36-hour incubation. This should increase total sfGFP accumulation and result in higher final fluorescence due to prolonged protein synthesis rather than early energy depletion.

For mRFP1, increasing magnesium glutamate to 10 mM is expected to improve ribosome stability and translation efficiency, while supplementation with GMP at 0.625 mM may help sustain GTP pools required for transcription and translation elongation. Increasing cysteine to 6 mM may also support proper protein folding and help maintain a favorable redox environment during the extended incubation, improving maturation and accumulation of functional fluorescent protein.

Together, these adjustments are expected to sustain translation capacity and improve fluorescent protein maturation over the 36-hour reaction, reducing losses from energy depletion and inefficient folding.

3. The second phase of this lab will be to define the precise reagent concentrations for your cell-free experiment. You will be assigned artwork wells with specific fluorescent proteins and receive an email with instructions this week (by April 24). You can begin composing master mix compositions here.

Well 1: Q2-B9

In the sfGFP cell-free system, increasing glucose, magnesium glutamate, and nucleotide precursors (AMP/CMP/ribose) is expected to extend ATP and NTP regeneration capacity, thereby sustaining transcription and translation over the full 36-hour incubation. This should increase total sfGFP accumulation and result in higher final fluorescence due to prolonged protein synthesis rather than early energy depletion.

Well 2: Q4-K18

To maximize mRFP1 (monomeric red fluorescent protein 1) fluorescence over 36 hours, I increased magnesium glutamate to 10 mM to stabilize translation and protein folding, added guanosine monophosphate (GMP) at 0.625 mM to sustain guanosine triphosphate (GTP) pools for translation elongation, and increased cysteine to 6 mM to prevent aggregation-promoting disulfide bonds.

Together, these adjustments sustain both translation output and chromophore stability over the extended reaction, preventing energy depletion and misfolding that would limit fluorescence accumulation at the 36-hour timepoint.

4. The final phase of this lab will be analyzing the fluorescence data we collect to determine whether we can draw any conclusions about favorable reagent compositions for our fluorescent proteins. This will be due a week after the data is returned (date TBD!). The reaction composition for each well will be as follows:

6 μL of Lysate
10 μL of 2X Optimized Master Mix from above
2 μL of assigned fluorescent protein DNA template
2 μL of your custom reagent supplements

Total: 20 μL reaction

We never received data for this part. With permission of Node leads, skipping this!

Homework

Weekly homework submissions:

Subsections of Homework

Week 1 HW: Principles and Practices

Class Assignment — DUE BY START OF FEB 10 LECTURE

References

Assignment (Week 2 Lecture Prep)

Homework Questions from Professor Jacobson

References

Homework Questions from Dr. LeProust

References

Homework Question from George Church

References

Week 2 HW: Read, write & edit

Homework Week 2

Part 1: Benchling & In-silico Gel Art

Part 3: DNA Design Challenge

References

Codon Optimized TNF-Alpha for improved expression of Escherichia coli

References

References

Part 4: Prepare a Twist DNA Synthesis Order

Part 5: DNA Read/Write/Edit

References

References

References

Week 3 HW: Lab automation

Homework Week 3

References

References

Project Ideas

Week 4 HW: Protein Design Part I

Part A. Conceptual Questions

Part B. Protein Analysis and Visualization

Part C. Using ML-Based Protein Design Tools

C1. Protein Language Modeling

Part D. Group Brainstorm on Bacteriophage Engineering

Week 5 HW: Protein Design Part II

Part A: SOD1 Binder Peptide Design (From Pranam)

Part C: Final Project: L-Protein Mutants

Week 6 HW: Genetic Circuits Part I: Assembly Technologies

Assignment: DNA Assembly

Assignment: Asimov Kernel

Week 7 Genetic Circuits Part II

Week 9 Week 9 — Cell-Free Systems

Week 10 — Imaging and measurement

Week 11 — Week 11 — Bioproduction & Cloud Labs