<Sameen Nasar> — HTGAA Spring 2026

cover image cover image

About me

🧬 London | LSHTM 📈

INTERESTS: Biology, Genomics, Epidemiology, Public Health, Genetic Engineering, Political Economy, Science Fiction, Comic Books, Nutrition

Contact info

Email

Homework

Labs

Projects

Subsections of <Sameen Nasar> — HTGAA Spring 2026

Homework

Weekly homework submissions:

  • Week 1 HW: Principles and Practices

    Class Assignment — DUE BY START OF FEB 10 LECTURE (1) First, describe a biological engineering application or tool you want to develop and why. This could be inspired by an idea for your HTGAA class project and/or something for which you are already doing in your research, or something you are just curious about.

  • Week 2 HW: Read, write & edit

    Homework Week 2 Part 1: Benchling & In-silico Gel Art Import the Lambda DNA. Simulate Restriction Enzyme Digestion with the following Enzymes: EcoRI HindIII BamHI KpnI EcoRV SacI SalI Create a pattern/image in the style of Paul Vanouse’s Latent Figure Protocol artworks.

  • Week 3 HW: Lab automation

    Homework Week 3 Post-Lab Questions One of the great parts about having an automated robot is being able to precisely mix, deposit, and run reactions without much intervention, and design and deploy experiments remotely. For this week, we’d like for you to do the following:

  1. Find and describe a published paper that utilizes the Opentrons or an automation tool to achieve novel biological applications.
  1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons) Assuming the meat is a red meat like beef, there would be approximately 20-25g of protein per 100g of meat [1, 2].
  • Week 5 HW: Protein Design Part II

    Part A: SOD1 Binder Peptide Design (From Pranam) Superoxide dismutase 1 (SOD1) is a cytosolic antioxidant enzyme that converts superoxide radicals into hydrogen peroxide and oxygen. In its native state, it forms a stable homodimer and binds copper and zinc. Mutations in SOD1 cause familial Amyotrophic Lateral Sclerosis (ALS). Among them, the A4V mutation (Alanine → Valine at residue 4) leads to one of the most aggressive forms of the disease. The mutation subtly destabilizes the N-terminus, perturbs folding energetics, and promotes toxic aggregation.

  • Week 6 HW: Genetic Circuits Part I: Assembly Technologies

    Assignment: DNA Assembly 1. What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose? The template DNA is the mUAV plasmid at Used at 20 ng/”L, with 0.8 ”L added to the reaction. The primers are “colour forward” and “colour reverse”. Give that the stock concentration is 5 ”L, using 2.5 ”L of each primer in a total reaction volume of 25 ”L results in a final primer concentration of 0.5 ”M.

  • Week 7 Genetic Circuits Part II

    Assignment Part 1: Intracellular Artificial Neural Networks (IANNs)

  1. What advantages do IANNs have over traditional genetic circuits, whose input/output behaviors are Boolean functions? The advantages of IANNs over traditional circuits include: (i) Continuous processing which allows them to constantly measure changes in concentration gradients of cellular inputs rather than just their absolute presence or absence. (ii) Relatively easier to scale up. That is, new inputs can be programmed by integrating additional weighted connections to existing nodes without completely rewiring the circuit.
  1. Explain the main advantages of cell-free protein synthesis over traditional in vivo methods, specifically in terms of flexibility and control over experimental variables. Name at least two cases where cell-free expression is more beneficial than cell production. Advantages of cell-free protein synthesis (CFPS) over traditional in-vivo methods: (i) Greater flexibility and control: Given that cells do not need to stay “alive” and the absence of a cell wall, it is possible to manipulate cells in real time; add chaperones, cofactors etc [1].
  • Week 10 — Imaging and measurement

    Homework: Waters Part I — Molecular Weight We will analyze an eGFP standard on a Waters Xevo G3 QTof MS system to determine the molecular weight of intact eGFP and observe its charge state distribution in the native and denatured (unfolded) states. The conditions for LC-MS analysis of intact protein cause it to unfold and be detected in its denatured form (due to the solvents and pH used for analysis).

  • Week 11 — Week 11 — Bioproduction & Cloud Labs

Subsections of Homework

Week 1 HW: Principles and Practices

cover image cover image

Class Assignment — DUE BY START OF FEB 10 LECTURE

(1) First, describe a biological engineering application or tool you want to develop and why. This could be inspired by an idea for your HTGAA class project and/or something for which you are already doing in your research, or something you are just curious about.

By leveraging biological engineering tools, such as CRISPR systems, I would like to develop highly specific nucleic acid biosensors and synthetic circuits to detect M. tuberculosis and resistance mutations with high precision and speed. The inspiration for this comes from working on my MSc project, where I studied the genomic epidemiology of multi-drug-resistant tuberculosis (MDR-TB) using WGS data. My work focused on downstream analyses (phylogenetics, transmission clustering, regression, and machine learning), with particular attention to population structure and epidemiological interpretation. However, when working on my project, I found that genomic data of MDR-TB is geographically imbalanced, limiting the representativeness of global MDR-TB patterns and, ultimately, timely detection and treatment. This is especially true in high burden countries. As a result, I would like to explore the application of biosensors and genetic circuitry to add an additional layer of surveillance alongside traditional methods; biosensor or genetic circuit engineered to detect specific MDR-TB resistance markers or lineage-specific sequences, potentially using luminescence as a real-time readout to provide rapid, high-throughput signals.

Brief on the biology and possible mechanism for the tool: 🛠️ 🧬

Unlike many other bacteria that can share drug‑resistance genes with each other through horizontal gene transfer, Mycobacterium tuberculosis mainly becomes drug resistant through mutations in its own DNA (Single Nucleotide Polymorphisms (SNPs), insertions/deletions (indels)) [1]. Simultaneously, the ability of M. tuberculosis to persist within human hosts exposes it to prolonged immune pressure, driving adaptive changes in virulence‑associated loci such as phoR, mymA and the mce1 operon that can influence how different lineages transmit or interact with particular human populations [2¬–4]. As a result, the proposed bio-engineering tool could take the form of a bio-sensor, where CRISPR-based device could be programmed to recognise TB resistance mutations or an engineered genetic circuit that only produces a light or electrical signal when multiple resistance signatures are present. Such a device would convert the presence of specific mutations into a measurable output that can be rapidly read and fed into surveillance models.

(2) Next, describe one or more governance/policy goals related to ensuring that this application or tool contributes to an “ethical” future, like ensuring non-malfeasance (preventing harm). Break big goals down into two or more specific sub-goals. Below is one example framework (developed in the context of synthetic genomics) you can choose to use or adapt, or you can develop your own. The example was developed to consider policy goals of ensuring safety and security, alongside other goals, like promoting constructive uses, but you could propose other goals for example, those relating to equity or autonomy.

Governance Goal 1: Prevent harm or misuse

As genomic data can be geo-located and time-stamped, there are risks for community stigmatization and political duress. Therefore, to mitigate against these risks, the governance goal should implement frameworks that: (i) Require ethical review and oversight of bio-sensor data and its secondary uses (ii) Establish strict guidelines on the limits of how precise location data can be shared or publicized (iii) Establish clear accountability mechanisms for state and private actors

Governance Goal 2: Promote equity in data collection, analysis and development

To prevent further exacerbation of inequities biological data collection and usage, the framework will implement mechanisms that ensure: (i) Control of locally generated data by implementing country (ii) Inclusion of implementing country as equal partners in analysis and interpretation (iii) Prioritization of under-sampled regions to improve representativeness and combining outputs with timely access to treatment and care.

(3) Next, describe at least three different potential governance “actions” by considering the four aspects below (Purpose, Design, Assumptions, Risks of Failure & “Success”). Try to outline a mix of actions (e.g. a new requirement/rule, incentive, or technical strategy) pursued by different “actors” (e.g. academic researchers, companies, federal regulators, law enforcement, etc). Draw upon your existing knowledge and a little additional digging, and feel free to use analogies to other domains (e.g. 3D printing, drones, financial systems, etc.).

  1. Purpose: What is done now and what changes are you proposing?
  2. Design: What is needed to make it “work”? (including the actor(s) involved - who must opt-in, fund, approve, or implement, etc)
  3. Assumptions: What could you have wrong (incorrect assumptions, uncertainties)?
  4. Risks of Failure & “Success”: How might this fail, including any unintended consequences of the “success” of your proposed actions?

Governance Action 1: Regulation and creation of standards for early-stage bio-sensor development

Purpose: Early-stage bio-sensor development research is guided by bio/genetic engineering but requires safety and bio-security risks. I am proposing specific standards and regulatory requirements for early-stage biosensor design, ensuring safety, transparency, and responsible innovation before deployment. This could be in the form of new regulatory support or reference diagnostics.

Design: Actors may include public health agencies, national regulators in science, and diagnostic developers. Establish validation criteria, accuracy thresholds, metadata standards, and geolocation safeguards. In addition, embed standards into existing public health TB surveillance programmes.

Assumptions: This initiative assumes that regulators will be quick to evaluate bio-sensor technologies. Also assumes public health surveillance will be quick to agree and implement technology across the existing surveillance system.

Risk of failure: Bureaucracy may hinder technological innovation and deployment. Unintended consequences include a premature reliance on bio-sensor technology which could lead to false positive cases and mis-directed public health strategies.

Governance Action 2: Pre-sequencing rapid signal regulatory pathways

Purpose: Currently, bio-sensor outputs such as CRISPR signals and genomic data are not integrated in low to middle-income countries Therefore, I would like to propose the creation of formal pathways that enable rapid biosensor signals to feed into surveillance systems before whole genome sequencing (WGS), with defined quality, privacy, and data use standards.

Design: Actors include public health agencies, national regulators in science, and diagnostic developers. Actors may also include international bodies such as the WHO. There may be potential to expand the WHO’s ‘attributes and principles on genomic data-sharing platforms supporting surveillance of pathogens’ [5–7].

Assumptions: This assumes developers implement required standards and metadata. Also assumes public health agencies can incorporate new signal streams effectively.

Risk of failure: Disagreements about implementation into existing surveillance pathways. State agencies may lack technical expertise to train workers to evaluate, interpret and act on rapid biosensor signals. This could lead to misinterpretation and/or delayed action

Governance Action 3: Ethical data access and sharing standards (with local and community engagement requirements)

Purpose: Many genomic and bio-engineering projects lack consistent standards for privacy, consent, equity, and local engagement. A proposed change could be the mandatory implementation of ethical standards for data access combined with mandatory local/community engagement, ensuring transparency, and equitable benefit-sharing.

Design: Develop standardised model data agreements which specify permissible uses, benefit-sharing obligations, and consent mechanisms. Furthermore, advisory boards and steering committees can be established to ensure engagement, feedback, and regular assessment of processes.

Assumptions: This assumes that communities where the technology is planned to be implemented will agree to engage meaningfully. It also assumes that cross-country coordination on ethical standards will be possible.

Risk of failure: Strict data provisions may slow down implementation, collection and action. There may be failure to engage communities as they may view the initiative to engage them as superficial.

(4) Next, score (from 1-3 with, 1 as the best, or n/a) each of your governance actions against your rubric of policy goals. The following is one framework but feel free to make your own:

Does the option:Option 1Option 2Option 3
Regulation and creation of standards for early-stage bio-sensor developmentPre-sequencing rapid signal regulatory pathwaysEthical data access and sharing standards
đŸŠ đŸ›ĄïžEnhance Biosecurity
‱ By preventing incidents122
‱ By helping respond232
đŸ§ȘFoster Lab Safety
‱ By preventing incidents222
‱ By helping respond332
đŸŒ±Protect the environment
‱ By preventing incidents122
‱ By helping respond211
⚖Other considerations
‱ Minimizing costs and burdens to stakeholders111
‱ Feasibility222
‱ Not impede research222
‱ Promote constructive applications111

5. Last, drawing upon this scoring, describe which governance option, or combination of options, you would prioritize, and why. Outline any trade-offs you considered as well as assumptions and uncertainties. For this, you can choose one or more relevant audiences for your recommendation, which could range from the very local (e.g. to MIT leadership or Cambridge Mayoral Office) to the national (e.g. to President Biden or the head of a Federal Agency) to the international (e.g. to the United Nations Office of the Secretary-General, or the leadership of a multinational firm or industry consortia). These could also be one of the “actor” groups in your matrix.

Based on the inputs and ranking in the matrix above, I would prioritize the following:

(i) Regulation and creation of standards for early-stage bio-sensor development (ii) Ethical data access and sharing standards with local community engagement

Together both these actions would address both the technical and social foundations required for responsible deployment of biosensors. Standards would ensure that biosensors are developed safely, setting incentive structures to develop lab safety protocols and enforce biosecurity. Local community engagement, training, and capacity building will help build trust, protect rights, and enable effective use of surveillance data across settings.

References

  1. Richard M. Jones, Kristin N. Adams, Hassan E. Eldesouky, and David R. Sherman “The evolving biology of mycobacterium tuberculosis drug resistance.” Frontiers in Cellular and Infection Microbiology 2022.

  2. Sebastien Gagneux “Ecology and evolution of mycobacterium tuberculosis.” Nature Reviews Microbiology 2018.

  3. Qingyun Liu, Jianhao Wei, Yawei Li, Mei Wang, Jun Su, et al. “Mycobacterium tuberculosis clinical isolates carry mutational signatures of host immune environments.” Science Advances 2020.

  4. Á. Chiner-Oms, L. Sánchez-Busó, J. Corander, S. Gagneux, S. R. Harris, et al. “Genomic determinants of speciation and spread of the mycobacterium tuberculosis complex.” Science Advances 2019.

  5. World Health Organization. Attributes and principles of genomic data-sharing platforms supporting surveillance of pathogens with epidemic and pandemic potential. World Health Organization; 2025.

  6. Carter L, Yu MA, Sacks J, Barnadas C, Pereyaslov D, Cognat S, et al. Global genomic surveillance strategy for pathogens with pandemic and epidemic potential 2022–2032. Bulletin of the World Health Organization. 2022 Apr 1;100(04):239–9A.

  7. Trump BD, Florin MV, Perkins E, et al. Biosecurity for Synthetic Biology and Emerging Biotechnologies: Critical Challenges for Governance. 2021 Sep 8. In: Trump BD, Florin MV, Perkins E, et al., editors. Emerging Threats of Synthetic Biology and Biotechnology: Addressing Security and Resilience Issues [Internet]. Dordrecht (DE): Springer; 2021. Chapter 1. Available from: https://www.ncbi.nlm.nih.gov/books/NBK584259/ doi: 10.1007/978-94-024-2086-9_

Assignment (Week 2 Lecture Prep)

Homework Questions from Professor Jacobson

Nature’s machinery for copying DNA is called polymerase. What is the error rate of polymerase? How does this compare to the length of the human genome. How does biology deal with that discrepancy?

The error rate of polymerase is 1 error per 10⁶ nucelotides, where this can range from expected error frequency from 1 error per 104 to approximately 106 [1]. The human genome has 3 x 109 base pairs, this is around 3 billion nucleotides. This is much larger (approx. 3000 times) than 10⁶-nucleotide error rate of polymerase. Biology deals with this through a process of proofreading; cells use polymerase proofreading and mismatch repair to reduce errors to just a few per genome per replication [2].

How many different ways are there to code (DNA nucleotide code) for an average human protein? In practice what are some of the reasons that all of these different codes don’t work to code for the protein of interest?

Average Human Protein: 1036 bp As 1 codon = 3 nucleotides

∎ Total amino acids = 1036/3 ~ 345

Given 3 nuclotide-codons and 1 codon codes for 1 amino acid, there are 3345 different ways to code for an average human protein.

Given 3345 DNA sequences code for the same protein, only some of it works due to codon preferences and bias, repetitive or unstable sequences, and mRNA folding [3].

References

  1. Kunkel TA, Bebenek K. DNA replication fidelity. In: Brenner S, Miller JH, editors. DNA Replication and Human Disease. Bethesda (MD): National Center for Biotechnology Information (US); 2002. Available from: [https://www.ncbi.nlm.nih.gov/books/NBK9940/](https://www.ncbi.nlm.nih.gov/books/NBK9940/]

  2. Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P. Molecular Biology of the Cell. 4th ed. New York: Garland Science; 2002. ISBN: 0-8153-3218-1, 0-8153-4072-9.

  3. Lin J, Chen Y, Zhang Y, Lin H, Ouyang Z, et al. Deciphering the role of RNA structure in translation efficiency. BMC Bioinformatics. 2022;23:559

Homework Questions from Dr. LeProust

What’s the most commonly used method for oligo synthesis currently?

Oligonucleotide synthesis is the chemical process of making short fragments of DNA or RNA with a defined sequence, typically using step‑by‑step addition of nucleotide building blocks on a solid support [1]. For enzyme-free synthesis, the process involves sequentially adding nucleotide units to a growing chain, typically using solid- or liquid-phase synthesis [2]. The most common method is solid phase oligo phosphoramidite synthesis. As it is now automated and uses high quality short sequences, it is widely used in biotech companies around the world [3–4].

Why is it difficult to make oligos longer than 200nt via direct synthesis?

As length is increased, chemical synthesis becomes less efficient. As a result, there is a loss in product yield, greater rate of error accumulation (higher substitution or deletion rates), and an increased difficulty in purifying the final product due to the introduction of truncated and mis-incorporated oligos [5].

Why can’t you make a 2000bp gene via direct oligo synthesis?

As oligosynthesis adds one nucleotide at a time, increasing length will lead to a greater accumulation of errors (substitutions/deletions). The truncated or defective sequences become increasingly difficult to purify [6]. Therefore, direct synthesis of a 2000bp gene is not practical despite surface-based methods and capture-based purification [7].

References

  1. Beaucage SL, Caruthers MH. Deoxynucleoside phosphoramidites—A new class of key intermediates for deoxypolynucleotide synthesis. Tetrahedron Letters. 1981;22(20):1859–62. doi:10.1016/S0040-4039(01)90461-7.

  2. Bachem. What is oligonucleotide synthesis & how does it work? [Internet]. Bubendorf: Bachem; 2024 Aug 26 [cited 2026 Feb 10]. Available from: https://www.bachem.com/articles/oligonucleotides/how-does-oligonucleotide-synthesis-work/

  3. ScienceDirect. Oligonucleotide synthesis [Internet]. Amsterdam: Elsevier; 2024 [cited 2026 Feb 10]. Available from: https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/oligonucleotidesynthesis

  4. ATDBio. Solid-phase oligonucleotide synthesis: The Phosphoramidite method [Internet]. Southampton: ATDBio; 2024 [cited 2026 Feb 10]. Available from: https://atdbio.com/nucleic-acids-book/Solid-phase-oligonucleotide-synthesis#The-Phosphoramidite-method

  5. Kosuri S, Church GM. Large-scale de novo DNA synthesis: technologies and applications. Nat Methods. 2014;11:499–507. doi:10.1038/nmeth.2918.

  6. Pichon M, Hollenstein M. Controlled enzymatic synthesis of oligonucleotides. Commun Chem. 2024;7:138. doi:10.1038/s42004-024-01216-0.

  7. Yin Y, Arneson R, Yuan Y, Fang S. Long oligos: direct chemical synthesis of genes with up to 1728 nucleotides. Chem Sci. 2025;16:1966–73. doi:10.1039/D4SC06958G.

Homework Question from George Church

Choose ONE of the following three questions to answer; and please cite AI prompts or paper citations used, if any.

1. [Using Google & Prof. Church’s slide #4] What are the 10 essential amino acids in all animals and how does this affect your view of the “Lysine Contingency”?

Essential amino acids are defined as the amino acids that the animal body cannot synthesize, and therefore must obtain from diet. The essential amino acids in animals are are: isoleucine, leucine, lysine, threonine, tryptophan, methionine, histidine, valine, and phenylalanine. In addition, cysteine and tyrosine are often described as conditionally essential because they cannot be synthesized de novo in animals and are instead produced from methionine and phenylalanine, respectively [1].

Given lysine is one of essential amino acids that is universal for all animals, the “Lysine Contingency” is not an exclusive real control mechanism. Even if it hypothetically existed and could be removed, animals could easily source it from food, either meats, beans, or grains.

References

  1. Hou Y, Wu G. Nutritionally essential amino acids. Adv Nutr. 2018;9(6):849–851. doi:10.1093/advances/nmy054

Week 2 HW: Read, write & edit

Homework Week 2

Part 1: Benchling & In-silico Gel Art

Import the Lambda DNA. Simulate Restriction Enzyme Digestion with the following Enzymes:

EcoRI
HindIII
BamHI
KpnI
EcoRV
SacI
SalI

Create a pattern/image in the style of Paul Vanouse’s Latent Figure Protocol artworks.

You might find Ronan’s website a helpful tool for quickly iterating on designs!

Playing around with the digest enzymes

HW2_ImageG1 HW2_ImageG1HW2_ImageG2 HW2_ImageG2

Getting an “S”, well…sort of:

HW2_ImageG3 HW2_ImageG3

Part 3: DNA Design Challenge

3.1. Choose your protein.: In recitation, we discussed that you will pick a protein for your homework that you find interesting. Which protein have you chosen and why? Using one of the tools described in recitation (NCBI, UniProt, google), obtain the protein sequence for the protein you chose. [Example from our group homework, you may notice the particular format — The example below came from UniProt]

I have chosen Tumor Necrosis Factor- Alpha (TNF-α).

Why:

Reasons for choosing this protein include my interest in dermatology and chronic diseases. It is a key inflammatory cytokine in many skin and insulin resistant conditions. I am interested in psoriasis, particularly plaque psoriasis and its relation to insulin resistance and diabetes [1]. This is because this is something my Mum has suffered from the last couple of years, recently developing some pre-diabetes.

Protein Sequence:

NP_000585.2 tumor necrosis factor [Homo sapiens] MSTESMIRDVELAEEALPKKTGGPQGSRRCLFLSLFSFLIVAGATTLFCLLHFGVIGPQREEFPRDLSLI SPLAQAVRSSSRTPSDKPVAHVVANPQAEGQLQWLNRRANALLANGVELRDNQLVVPSEGLYLIYSQVLF KGQGCPSTHVLLTHTISRIAVSYQTKVNLLSAIKSPCQRETPEGAEAKPWYEPIYLGGVFQLEKGDRLSA EINRPDYLDFAESGQVYFGIIAL

References

  1. Moller DE. Potential role of TNF-α in the pathogenesis of insulin resistance and type 2 diabetes. Trends Endocrinol Metab. 2000 Aug;11(6):212-217. doi:10.1016/S1043-2760(00)00272-1.

3.2. Reverse Translate: Protein (amino acid) sequence to DNA (nucleotide) sequence.

To obtain the nucleotide sequence encoding TNF-α, I retrieved the validated human mRNA record (NCBI RefSeq: NM_000594.4) from NCBI. From this record, I extracted the coding sequence (CDS), which corresponds to the protein sequence NP_000585.2. Only the CDS was used for downstream codon optimization. See below:

ATGAGCACTGAAAGCATGATCCGGGACGTGGAGCTGGCCGAGGAGGCGCTCCCCAAGAAGACAGGGGGGCCCCAGGGCTCCAGGCGGTGCTTGTTCCTCAGCCTCTTCTCCTTCCTGATCGTGGCAGGCGCCACCACGCTCTTCTGCCTGCTGCACTTTGGAGTGATCGGCCCCAGAGGGAAGAGTTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAGGCAGTCAGATCATCTTCTCGAACCCCGAGTGACAAGCCTGTAGCCCATGTTGTAGCAAACCCTCAAGCTGAGGGGCAGCTCCAGTGGCTGAACCGCCGGGCCAATGCCCTCCTGGCCAATGGCGTGGAGCTGAGAGATAACCAGCTGGTGGTGCCATCAGAGGGCCTGTACCTCATCTACTCCCAGGTCCTCTTCAAGGGCCAAGGCTGCCCCTCCACCCATGTGCTCCTCACCCACACCATCAGCCGCATCGCCGTCTCCTACCAGACCAAGGTCAACCTCCTCTCTGCCATCAAGAGCCCCTGCCAGAGGGAGACCCCAGAGGGGGCTGAGGCCAAGCCCTGGTATGAGCCCATCTATCTGGGAGGGGTCTTCCAGCTGGAGAAGGGTGACCGACTCAGCGCTGAGATCAATCGGCCCGACTATCTCGACTTTGCCGAGTCTGGGCAGGTCTACTTTGGGATCATTGCCCTGTGA

3.3. Codon optimization. Once a nucleotide sequence of your protein is determined, you need to codon optimize your sequence. You may, once again, utilize google for a “codon optimization tool”. In your own words, describe why you need to optimize codon usage. Which organism have you chosen to optimize the codon sequence for and why?

For codon optimization, I chose the online codon optimizing tool:

https://en.vectorbuilder.com/tool/codon-optimization.html

HW2_Image1_1 HW2_Image1_1

From my input:

I got: Pasted Sequence: GC=59.84%, CAI=0.49

From my output:

Improved DNA[1]: GC=59.97%, CAI=0.92

For CAI (Codon Adaptation Index), this indicates strong expected expression.

For GC content, after optimization it remained near 60%, within a suitable range for Escherichia coli, supporting stable and efficient gene synthesis.

I selected Escherichia coli strain K-12 MG1655 as the target organism for codon optimization because it is a well-studied laboratory strain with a completely sequenced and annotated genome [1–2].

Codon Optimized TNF-Alpha for improved expression of Escherichia coli

CTGAGCCCGTTCAACAACCCGCTGCTGCGCCCGTTTCTGATTCTGTATGAACATTAAAAACATGATCCGGGCCGTGGCGCAGGTCGCGGCGGCGCGCCGCAGGAAGATCGTGGCGCACCGGGCTTACAGGCCGTGCTGGTTCCGCAGCCGCTGCTGCTGCCGGATCGCGGCCGTCGTCACCATGCCCTGCTGCCGGCGGCCCTGTGGTCGGATCGTCCGCAGCGTGAAGAATTTCCGCGCGATCTGAGCCTGATTAGCCCGCTGGCGCAGGCCGTGCGTAGCAGCAGCCGCACCCCGTCAGATAAACCGGTGGCGCACGTGGTGGCAAATCCGCAGGCCGAAGGTCAGCTGCAGTGGCTGAATCGTCGCGCGAATGCCCTGTTAGCCAATGGTGTGGAACTGCGCGATAATCAGCTGGTGGTGCCGTCAGAAGGTCTGTACCTGATCTATTCGCAGGTGCTGTTTAAAGGCCAGGGCTGTCCGAGCACCCATGTGCTGCTGACCCACACCATTAGCCGCATTGCGGTGAGCTACCAGACCAAAGTGAACCTGCTTTCTGCGATTAAAAGCCCGTGCCAGCGTGAAACCCCGGAAGGCGCGGAAGCGAAACCGTGGTACGAACCGATTTATCTGGGCGGCGTGTTCCAGCTGGAAAAAGGCGATCGTCTGAGCGCGGAAATTAATCGCCCGGATTATCTGGATTTTGCGGAAAGCGGTCAGGTGTATTTCGGCATTATTGCCTTGTAA

References

  1. Lukjancenko O, Wassenaar TM, Ussery DW. Comparison of 61 sequenced Escherichia coli genomes. Microb Ecol. 2010 Nov;60(4):708-20. doi:10.1007/s00248-010-9717-3. PMID:20623278; PMCID:PMC2974192.

  2. Yannai A, Katz S, Hershberg R. The codon usage of lowly expressed genes is subject to natural selection. Genome Biol Evol. 2018 May;10(5):1237–46. doi:10.1093/gbe/evy084.

3.4. You have a sequence! Now what?

What technologies could be used to produce this protein from your DNA? Describe in your words the DNA sequence can be transcribed and translated into your protein. You may describe either cell-dependent or cell-free methods, or both.

After codon optimizing the TNF- α DNA sequence, it can be used to produce protein either through cell-dependent or cell-free systems.

For cell-dependent systems, the DNA will first need to be cloned using and inserted into an expression vector, this is then introduced into live host cells such as E. coli or eukaryotic cells, where cellular machinery transcribes the DNA into mRNA and then translates the mRNA into TNF‑α protein during growth and metabolism; this is seen in standard biotechnology production processes [1–2].

For cell-free systems, crude cell extracts provide all the machinery for transcription, translation, protein folding, and energy metabolism [3]. Therefore, when the codon optimized DNA is added, the TNF‑α protein will be produced in-vitro and under controlled conditions.

Both these methods rely on the flow of information from DNA to mRNA to protein; the Central Dogma of Molecular Biology.

References

  1. Lukjancenko O, Wassenaar TM, Ussery DW. Comparison of 61 sequenced Escherichia coli genomes. Microb Ecol. 2010 Nov;60(4):708-20. doi: 10.1007/s00248-010-9717-3. Epub 2010 Jul 11. PMID: 20623278; PMCID: PMC2974192.

  2. Swartz JR. Advances in Escherichia coli production of therapeutic proteins. Curr Opin Biotechnol. 2001 Oct;12(5):195–201. doi:10.1016/s0958-1669(00)00198-5. PMID:11513436.

  3. Carlson ED, Gan R, Hodgman CE, Jewett MC. Cell-free protein synthesis: Applications come of age. Biotechnol Adv. 2012 Sep-Oct;30(5):1185-94. doi:10.1016/j.biotechadv.2011.09.016. PMID:22001003; PMCID:PMC3359644.

3.5. [Optional] How does it work in nature/biological systems?

1. Describe how a single gene codes for multiple proteins at the transcriptional level. 2. Try aligning the DNA sequence, the transcribed RNA, and also the resulting translated Protein!!! See example below. [Example shows the biomolecular flow in central dogma from DNA to RNA to Protein] Special note that all “T” were transcribed into “U” and that the 3-nt codon represents 1-AA.

Part 4: Prepare a Twist DNA Synthesis Order

4.2. Build Your DNA Insert Sequence

Link to the sequence (first attempt):

https://benchling.com/s/seq-92QKTmxOZ4NOBZloFYXH?m=slm-ih8RIVqVkxJpGYbdm50f

Link to corrected sequence:

https://benchling.com/s/seq-AKpYnuHnRmdf5XnJxSv8?m=slm-sqc6y4bFyGTTvcXYx3Q9

HW2_Image1_1 HW2_Image1_1

4.3-4.5. Building Expression Cassette and Plasmid

Plasmid with Expression Cassette

https://benchling.com/s/seq-dx10o3kwSJPyNLgmJDGo?m=slm-V5wHDO0G8ZxGTwWVp2A7

HW2_Image2 HW2_Image2

Part 5: DNA Read/Write/Edit

5.1 DNA Read

(i) What DNA would you want to sequence (e.g., read) and why? This could be DNA related to human health (e.g. genes related to disease research), environmental monitoring (e.g., sewage waste water, biodiversity analysis), and beyond (e.g. DNA data storage, biobank).

I would want to sequence Mycobacterium Tuberculosis DNA. I would like to focus on virulence‑associated loci such as phoR, mymA and the mce1, and lineage defining SNPs, such as rpoB, katG, inhA promoter, gyrA, embB.

To integrate with surveillance, I would potentially try to store drug resistance and mutation outputs from my detection bio-tool into a DNA-based archive. This could help build a long-term genomic repository.

(ii) In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why? Also answer the following questions: 1. Is your method first-, second- or third-generation or other? How so? 2. What is your input? How do you prepare your input (e.g. fragmentation, adapter ligation, PCR)? List the essential steps. 3. What are the essential steps of your chosen sequencing technology, how does it decode the bases of your DNA sample (base calling)? 4. What is the output of your chosen sequencing technology?

To perform sequencing for the drug-resistant DNA, short-read sequencing is ideal for identifying the key resistance driving genes for profiling and analysis. In contrast, long-read sequencing (e.g. Oxford Nanopore) would make rapid detection, which is useful in high-burden regions, but has slightly lower accuracy. Therefore, short-read sequencing is ideal for identifying key resistance-driving genes for profiling and analysis (for e.g. using Illumina) [1]. It involves DNA extraction, fragmentation, adapter ligation, cluster amplification, and sequencing by synthesis, with base-calling software decoding the sequence from fluorescent signals. The output includes high-quality short reads, aligned sequences, and variant calls for resistance and lineage analysis. In contrast, long-read sequencing enables rapid detection in high-burden regions but has slightly lower accuracy and may require deeper coverage.

References

  1. The CRyPTIC Consortium and the 100,000 Genomes Project. Prediction of Susceptibility to First-Line Tuberculosis Drugs by DNA Sequencing. N Engl J Med. 2018;379:1403–1415. doi:10.1056/NEJMoa1800474.

5.2 DNA Read

(i) What DNA would you want to synthesize (e.g., write) and why? These could be individual genes, clusters of genes or genetic circuits, whole genomes, and beyond. As described in class thus far, applications could range from therapeutics and drug discovery (e.g., mRNA vaccines and therapies) to novel biomaterials (e.g. structural proteins), to sensors (e.g., genetic circuits for sensing and responding to inflammation, environmental stimuli, etc.), to art (DNA origamis). If possible, include the specific genetic sequence(s) of what you would like to synthesize! You will have the opportunity to actually have Twist synthesize these DNA constructs! :)

I would like to design a genetic circuit that could be integrated into a microbial chassis or a cell-free system, which would enable it to detect molecular signatures for key multi-drug /extra-drug-resistant tuberculosis and activates a fluorescent reporter when present in a sample. Examples of this have been seen in research that looks at how biosensors are used to detect heavy metal in water through recombinase-based logic gates [1]. Such CRISPR‑based detection systems can be programmed with guides targeting lineage‑specific SNPs (e.g., Beijing/East Asian, Indo-American) [2] alongside resistance mutations so that the circuit only activates a fluorescent reporter when both types of signatures are present. Potentially, CRISPR‑Cas12/13 coupled with allele‑specific amplification can discriminate single‑base changes for lineage and resistance detection with high specificity. There is also a possibility of integrating all of this into a microfluidic biosensor, enabling automated, low-volume, rapid, and multiplexed detection suitable for environmental and point-of-care surveillance [3].

(ii) In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why? Also answer the following questions:

  1. Is your method first-, second- or third-generation or other? How so?
  2. What is your input? How do you prepare your input (e.g. fragmentation, adapter ligation, PCR)? List the essential steps.
  3. What are the essential steps of your chosen sequencing technology, how does it decode the bases of your DNA sample (base calling)?
  4. What is the output of your chosen sequencing technology?

For this synthesis, I would use synthetic DNA platforms and include CRISPR guide sequences, promoters, and fluorescent reporter proteins. These technologies would allow for quick prototyping, flexibility with design and would allow for automated printers to synthesize sequences up to multiple kilobases accurately.

Essential steps would include: full sequence of nucleotides and CRISPR guides, promoters and reporter proteins; setting the oligonucleotide assembly, this includes making assemblies of short oligos through PCR or ligation. These would need to be further tested and validated to ensure proper functioning of the circuit.

Limitations include, time, fixing errors, and scaling the device. These large constructs and may take time due to the complexity associated with multiple variants.

References

  1. Mathur S, Singh D, Ranjan R. Genetic circuits in microbial biosensors for heavy metal detection in soil and water. Biochem Biophys Res Commun. 2023 Apr 16;652:131–137. doi:10.1016/j.bbrc.2023.02.031.

  2. Napier, G., Campino, S., Merid, Y. et al. Robust barcoding and identification of Mycobacterium tuberculosis lineages for epidemiological and clinical studies. BMC Genome Med 12, 114 (2020). https://doi.org/10.1186/s13073-020-00817-3

  3. Didarian R, Azar MT. Microfluidic biosensors: revolutionizing detection in DNA analysis, cellular analysis, and pathogen detection. Biomed Microdevices. 2025;27:10. doi:10.1007/s10544-025-00741-6.

5.3 DNA Edit

(i) What DNA would you want to edit and why? In class, George shared a variety of ways to edit the genes and genomes of humans and other organisms. Such DNA editing technologies have profound implications for human health, development, and even human longevity and human augmentation. DNA editing is also already commonly leveraged for flora and fauna, for example in nature conservation efforts, (animal/plant restoration, de-extinction), or in agriculture (e.g. plant breeding, nitrogen fixation). What kinds of edits might you want to make to DNA (e.g., human genomes and beyond) and why?

For editing, I would use CRISPR-Cas systems to introduce lineage specific SNPs and resistant mutations into safe mycobacterial strains or cell-free systems [1]. This allow me to test the genetic circuit, validate the CRISPR guides, and generate controls for MDR-TB detection.

(ii) What technology or technologies would you use to perform these DNA edits and why? Also answer the following questions:

  1. How does your technology of choice edit DNA? What are the essential steps?
  2. What preparation do you need to do (e.g. design steps) and what is the input (e.g. DNA template, enzymes, plasmids, primers, guides, cells) for the editing?
  3. What are the limitations of your editing methods (if any) in terms of efficiency or precision?

Essential steps would include: designing guide RNAs to target SNPs/loci related to drug resistance; integrating of editing components into cells or a cell-free platform; functional testing to ensure sequences properly activate the fluorescent reporter protein within the circuit.

Preparation would require designing the guide RNAs and providing either a cell-free system or microbial framework as the host.

Limitations include: possible off-target edits; increased complexity when introducing multiple edits or larger constructs, which can affect throughput and precision.

References

  1. Molla KA, Yang Y. CRISPR/Cas mediated base editing: technical considerations and practical applications. Trends Biotechnol. 2019 Oct;37(10):1121–1142. doi:10.1016/j.tibtech.2019.03.008. Review of CRISPR base editing systems and how they introduce precise nucleotide changes without double strand breaks.

Week 3 HW: Lab automation

Homework Week 3

Post-Lab Questions

One of the great parts about having an automated robot is being able to precisely mix, deposit, and run reactions without much intervention, and design and deploy experiments remotely. For this week, we’d like for you to do the following:

1. Find and describe a published paper that utilizes the Opentrons or an automation tool to achieve novel biological applications.

Initially tried to do a Sonic 3 & Knuckles logo (Classic sonic game) silhouette. The design with some tweaks was hopeful. In the end, I went with the Batman Beyond logo, as it was simple and only had one colour (given the limitations of our node). The final code (with the help of Gemini) I used is below:

*### *### YOUR CODE HERE to create your design

############################################################################## Simple Design: Batman Beyond Logo ##############################################################################

spacing = 1.7 design_points = []

#* We use absolute values abs(i) to ensure perfect left-right symmetry for i in range(-18, 19): # Horizontal span (~72mm total) for j in range(-15, 12): # Vertical span

      x = abs(i)
      # 1. Top Wing Edge (slopes up to the points)
      if j < (0.5 * x) + 3:
          
          # 2. Bottom "V" Shape
          if j > (1.2 * x) - 16:
              
              # 3. Inner Wing Cutouts (The 'U' shapes next to the head)
              # If we are not in the cutout zone, add the point
              is_cutout = (2 < x < 7) and (j > -2)
              
              # 4. The Head (Center spike)
              is_head = (x <= 1) and (j < 5)
              
              if not is_cutout or is_head:
                  design_points.append((i * spacing, j * spacing, 'Red'))

#* EXECUTION points_for_color = [p for p in design_points if p[2] == ‘Red’]

if points_for_color: pipette_20ul.pick_up_tip() pipette_20ul.aspirate(15, location_of_color(‘Red’))

  for x, y, c in points_for_color:
      if pipette_20ul.current_volume < 0.5:
          pipette_20ul.aspirate(15, location_of_color('Red'))
      
      target = center_location.move(types.Point(x=x, y=y))
      dispense_and_detach(pipette_20ul, 0.5, target)

  pipette_20ul.drop_tip()

############################################################################## END OF CODE ##############################################################################

The output:

HW2_PaperImage_BB1 HW2_PaperImage_BB1

With some manual tweaks, what I initially wanted to do:

HW2_PaperImage_S1 HW2_PaperImage_S1

However, the end product was pretty atrocious and would take way too much time to fix, given I’m doing this in the last minute. So atrocious that I won’t paste it.

2. Write a description about what you intend to do with automation tools for your final project. You may include example pseudocode, Python scripts, 3D printed holders, a plan for how to use Ginkgo Nebula, and more. You may reference this week’s recitation slide deck for lab automation details.

While your description/project idea doesn’t need to be set in stone, we would like to see core details of what you would automate. This is due at the start of lecture and does not need to be tested on the Opentrons yet.

(Answer 1)

Paper: An Automated Versatile Diagnostic Workflow for Infectious Disease Detection in Low-Resource Settings

Miren Urrutia Iturritza, Phuthumani Mlotshwa, Jesper Gantelius, Tobias Alfvén, Edmund Loh, Jens Karlsson, Chris Hadjineophytou, Krzysztof Langer, Konstantinos Mitsakakis, Aman Russom, HÄkan N. Jönsson, Giulia Gaudenzi

https://doi.org/10.3390/mi15060708

This paper describes how researchers built an automated diagnostic workflow for detection of infectious diseases in low-resource settings [1]. Specifically, they tested for Neisseria meningitidis; a gram-negative bacterium that cause serious meningitis and blood infections in humans.

For their workflow, they used Opentrons OT-One-S Hood. This is an open-source liquid handling robot, which can be bought at a relatively low cost. The researchers wrote custom software developed at SciLifeLab Nanobiotechnology division [2] to create scripts for their workflow.

Materials and reagents were organized onto the OT-One-S Hood robot, with racks and tubes with primers, buffers, and enzymes, the MiniPCR¼ mini8 thermal cycler, magnetic bead racks, waste containers, and microarray holders, to analyze Neisseria meningitidis DNA in both clinical and spiked samples. “Clinical” samples refere to specimens collected from individuals, where “spiked” samples were lab prepared samples where a known amount of Neisseria meningitidis DNA.

The robot then performs all the necessary pipetting steps, RNA amplication of ctrA gene (as its conserved, species-specific gene essential for capsule formation, making it a reliable marker [3]), enzymatic digestion, and deposition onto paper-based microarrays. The only manual steps were the opening and closing of tube lids before and after the DNA amplification, and the exonuclease digestion steps on the MiniPCRÂź mini8 thermal cycler [1].

The study showed that, automated liquid handling can detect Neisseria meningitidis in low-resource settings, though accuracy and reproducibility were not fully validated.

HW2_PaperImage1 HW2_PaperImage1HW2_PaperImage1 HW2_PaperImage1

References

  1. Urrutia Iturritza M, Mlotshwa P, Gantelius J, Alfvén T, Loh E, Karlsson J, Hadjineophytou C, Langer K, Mitsakakis K, Russom A, et al. An automated versatile diagnostic workflow for infectious disease detection in low-resource settings. Micromachines. 2024;15(6):708. doi:10.3390/mi15060708.

  2. Langer K, Joensson HN. Rapid production and recovery of cell spheroids by automated droplet microfluidics. SLAS Technol. 2020;25:111–122.

  3. Rivas L, ReuterswĂ€rd P, Rasti R, Herrmann B, MĂ„rtensson A, AlfvĂ©n T, Gantelius J, Andersson-Svahn H. A vertical flow paper-microarray assay with isothermal DNA amplification for detection of Neisseria meningitidis. Talanta. 2018;183:192–200.

(Answer 2)

For the automation of my project, I plan to use automation tools to develop and test a CRISPR-based biosensor that would be capable of detecting multi-drug-resistant tuberculosis (MDR-TB) signatures. This workflow would involve high-throughput liquid handling and cell-free protein synthesis. Possible steps would include:

(i) Module setup: This would include arranging reagents, tip racks, thermal cyclers, magnetic bead racks, and microarray holders on an Opentrons OT-2 deck [1]. This would be supplemented by temperature modules for incubation and heater-shaker modules for mixing and precise reaction control

(ii) Automated reaction setup: The robot will then perform pipetting of cell-free lysate, DNA templates, CRISPR guides, and cofactors into 96- or 384-well plate. Then multiple combinations of lineage-specific SNP guides and resistance mutation guides will be tested to evaluate ‘AND-gate logic’.

(iii) Incubation: External devices like a plate reader or miniPCR thermal cycler amplification will be loaded. Then Python scripts will be used to control timing, mixtures, and incubation periods.

(iv) Signal detection and analysis: Fluorescent outputs will be measure using devices such as Spark or PHERAstar FSX for high-throughput plate analysis [1]. This will be a measure of change in fluorescence colour which would indicate successful target detection and amplification.

(v) Microfluidic integration (if possible): If possible, will look to integrate 3D printed holders for small microfluidic chips. These can serve as small test cartridges for running multiple tests at once while minimizing manual handling and contamination risk in low-resource settings.

References

[1] Course Recitation Slides. Lab Automation Overview. Course presentation, [institution or course name if known]; Year [cited 2026 Feb 23]. Available from: https://docs.google.com/presentation/d/e/2PACX-1vQc3zo7Z0b6HK7YeC56p_n2RbHNjUHh1HI66DH0cHbFk0db1HlbF7gILE__NCvhUiYMjIGSOHwHPv2_/pub?start=false&loop=false&delayms=3000#slide=id.g2b9b763dcde_1_131

Project Ideas

Go to specific slide

Week 4 HW: Protein Design Part I

Part A. Conceptual Questions

Answer any NINE of the following questions from Shuguang Zhang: (i.e. you can select two to skip)

1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)

Assuming the meat is a red meat like beef, there would be approximately 20-25g of protein per 100g of meat [1, 2].

So, taking the upper end of that range for 500g:

500g x 0.25 = 125g

Given on average 1 amino acid ≈ 100 Daltons, then 1g/mol ≈ 1 Dalton

Therefore,

125 Daltons ≈ 125g/mol

Converting grams to moles:

Moles = mass/molar mass = 125g/125g/mol = 1 mole of amino acids

Converting moles to molecules using Avogadro’s constant [3]:

1 mole ≈ 6.02214076 x 1023 ≈ 6.02 x 1023

  1. University Hospitals Sussex NHS Foundation Trust. Protein fact sheet [Internet]. West Sussex (UK): University Hospitals Sussex NHS Foundation Trust; [cited 2026 Mar 1]. Available from: https://www.uhsussex.nhs.uk/resources/protein-fact-sheet/

  2. Nuffield Health. Best high protein foods [Internet]. Epsom (UK): Nuffield Health; [cited 2026 Mar 1]. Available from: https://www.nuffieldhealth.com/article/best-high-protein-foods

  3. Metric System. Avogadro constant [Internet]. 2024 [cited 2026 Mar 1]. Available from: https://metricsystem.net/si/defining-constants/avogadro-constant/

2. Why do humans eat beef but do not become a cow, eat fish but do not become fish?

When we digest meat or fish, we are breaking them down into their basic constituents, which include amino acids. These amino acids are further used by ribosomes (through DNA instruction) to build human proteins, not that of a cow or fish.

3. Why are there only 20 natural amino acids?

The standard 20 amino acids were formed through evolutionary pressures which selected the acids based on folding capabilities, catalysis, and molecular recognition. These were most likely adopted in pre-biotic conditions through early metabolism/pre-biotic chemistry [1]. Once incorporated into the genetic code, it got fixed given other additions may have created disruptions towards survival.

Exceptions include Pyrrolysine and Selenocysteine, which are naturally occurring amino acids incorporated into proteins via specialized mechanisms; with pyrrolysine encoded by UAG stop codon in certain areas using dedicated tRNA and biosynthetic enzymes, and selenocysteine inserted at UGA codons with a Selenocysteine Insertion Sequence (SECIS) in the mRNA.

  1. Doig AJ. Frozen, but no accident – why the 20 standard amino acids were selected. FEBS Lett. 2016 Dec 7;590(21):3977–3985. doi:10.1111/febs.13982. Available from: https://doi.org/10.1111/febs.13982

4. Can you make other non-natural amino acids? Design some new amino acids.

Yes, it is possible to make non-natural amino acids as well as incorporate them into proteins using engineered tRNA synthase pairs with reassigned codons [1]. Initially will choose a base amino acid  modify side chain to add new function  Synthesize and by introducing protein with engineered tRNA, so that the amino acid can be recognized  insert in specific codon.

To design a new amino acid, I would modify non‑natural amino acid is para‑azido‑L‑phenylalanine (pAzF), which contains an azide (‑N₃) group. When pAzF is genetically incorporated into a protein at a chosen site, the azide can act as a chemical handle attaching a fluorescent dye or imaging agent to that protein. This can help label or track proteins in cells and animals [2].

  1. Bag SS, Saraogi I, Guo J. Editorial: Expansion of the Genetic Code: Unnatural Amino Acids and their Applications. Front Chem. 2022;10:958433. doi:10.3389/fchem.2022.958433.

  2. Lightle HE, Kafley P, Lewis TR, Wang R. Site‑specific protein conjugates incorporating para‑azido‑L‑phenylalanine for cellular and in vivo imaging. Methods. 2023;219:95–101. doi:10.1016/j.ymeth.2023.10.001

5. Where did amino acids come from before enzymes that make them, and before life started?

The origins of amino acids are hypothesized to have emerged from primordial earth [1], and have undergone abiotic synthesis under early environmental conditions (such as electrical discharges and impact‑driven reactions during the Hadean Eon) before life existed; over time, as organisms evolved in the Archean and Proterozoic Eons, they developed enzyme‑mediated biosynthetic pathways to produce amino acids internally, eventually supporting the diversity of life seen in the three domains of Archaea, Bacteria, and Eukarya.

  1. Nature Education. **An evolutionary perspective on amino acids. Nature Scitable. 2014. Available from: https://www.nature.com/scitable/topicpage/an-evolutionary-perspective-on-amino-acids-14568445

6. If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?

If an α-helix is made from D-amino acids instead of L-amino acids, it would form a left-handed helix [1]. In natural proteins, L-amino acids are used, and they form right-handed α-helices. Therefore, the helix built from D-amino acids reverse that twist to make it left-handed.

Perlego. Alpha helix. Perlego Chemistry Index. Available from: https://www.perlego.com/index/chemistry/alpha-helix OpenAI. ChatGPT (version 5.2)

7. Can you discover additional helices in proteins?

Yes, it is possible to have additional helices in proteins, both natural and artificially designed. This includes the 310 helix which is a secondary structure found in proteins and polypeptides.

Another is the pi/π helix, which is a secondary structure found in proteins.

8. Why are most molecular helices right-handed?

9. Why do ÎČ-sheets tend to aggregate?

ÎČ-sheets tend to aggregate due to their structure as they have exposed edges with available hydrogen bonding groups [1]. This leaves it susceptible to interactions with other ÎČ-sheets.

o What is the driving force for ÎČ-sheet aggregation?

The intermolecular backbone formed from ÎČ-sheet aggregation from hydrogen bonds forming between the backbone groups. Once aligned, hydrophobic side-chain interactions and van der Waals forces between tightly packed residues further stabilize the ÎČ-sheet aggregates [1].

  1. Richardson JS, Richardson DC. Natural ÎČ-sheet proteins use negative design to avoid edge-to-edge aggregationProc Natl Acad Sci U S A. 2002 Mar 5;99(5):2754–9. doi:10.1073/pnas.052706099.

10. Why do many amyloid diseases form ÎČ-sheets?

Give ÎČ-sheets allow for extensive intermolecular backbones, it enables multiple proteins to stick together. For example, in Alzheimer’s disease, amyloid-ÎČ peptides misfold and aggregate into fibrils that are rich in ÎČ-sheet structure. These facilitate plaque formation in the brain [1].

o Can you use amyloid ÎČ-sheets as materials?

Amyloid ÎČ-sheets can be used as materials because their cross-ÎČ structure forms highly stable, self-assembling nanofibers. These properties allow them to be developed into biomaterials such as hydrogels and nanofibers.

  1. Ow SY, Dunstan DE. A brief overview of amyloids and Alzheimer’s disease. Protein Sci. 2014 Oct;23(10):1315–31. doi:10.1002/pro.2524.

Part B. Protein Analysis and Visualization

In this part of the homework, you will be using online resources and 3D visualization software to answer questions about proteins. Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions:

1. Briefly describe the protein you selected and why you selected it.

I selected the Sonic Hedgehog protein. One, because when I was a massive Sonic the Hedgehog fan. Later as I started studying and working, I became interested in biology, neuroscience, and mental health. I found out that the protein has important functions in information exchange at fetal stage, the central nervous system development, tooth enamel growth, and it has also been that it may have potential regenerative functions for hair growth. Whereas, dysregulation can lead to aging-related neurodegenerative diseases such as Alzheimer’s disease, Parkinson’s disease, and amyotrophic lateral sclerosis

2. Identify the amino acid sequence of your protein.

From UniProt:

sp|Q15465|SHH_HUMAN Sonic hedgehog protein OS=Homo sapiens OX=9606 GN=SHH PE=1 SV=1

MLLLARCLLLVLVSSLLVCSGLACGPGRGFGKRRHPKKLTPLAYKQFIPNVAEKTLGASG RYEGKISRNSERFKELTPNYNPDIIFKDEENTGADRLMTQRCKDKLNALAISVMNQWPGV KLRVTEGWDEDGHHSEESLHYEGRAVDITTSDRDRSKYGMLARLAVEAGFDWVYYESKAH IHCSVKAENSVAAKSGGCFPGSATVHLEQGGTKLVKDLSPGDRVLAADDQGRLLYSDFLT FLDRDDGAKKVFYVIETREPRERLLLTAAHLLFVAPHNDSATGEPEASSGSGPPSGGALG PRALFASRVRPGQRVYVVAERDGDRRLLPAAVHSVTLSEEAAGAYAPLTAQGTILINRVL ASCYAVIEEHSWAHRAFAPFRLAHALLAALAPARTDRGGDSGGGDRGGGGGRVALTAPGA ADAPGAGATAGIHWYSQLLYQIGTWLLDSEALHPLGMAVKSS

Results from the collab notebook:

Length: 462 amino acids

Most frequent: A (57 times, 12.3%)

This matches the number provided on UniProt

o How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.

After inputting the sequence into UniProt’s BLAST tool, there are 244 homologs identified.

o Does your protein belong to any protein family?

It belongs to the hedgehog family.

3. Identify the structure page of your protein in RCSB o When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)

After the search, I selected:

8Z2V | pdb_00008z2v

Crystal structure of Sonic hedgehog in complex with antibody 5E1 mutant H-R102A with metals

Resolution: 1.89 Å

This was deposited on April 13, 2024 and released in the PDB on April 16, 2025.

The resolution indicates that it is of good quality due to its resolution of a more detailed structure. This presents a more accurate interpretation of its structure.

o Are there any other molecules in the solved structure apart from protein?

The solved structure 8Z2V includes the heavy and light chains of the antibody 5E1 to which it is bound, as well as several small molecules: glycerol, zinc ions, calcium ions, and a chloride ion.

o Does your protein belong to any structure classification family?

It belongs to the immune system.

4. Open the structure of your protein in any 3D molecule visualization software:

o PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands)

After loading the protein, I got:

HW4_PaperImage_S1 HW4_PaperImage_S1

o Visualize the protein as “cartoon”, “ribbon” and “ball and stick”

Cartoon:

HW4_PaperImage_S2 HW4_PaperImage_S2

Ball & Stick:

HW4_PaperImage_S3 HW4_PaperImage_S3

Ribbon:

HW4_PaperImage_S4 HW4_PaperImage_S4

o Color the protein by secondary structure. Does it have more helices or sheets?

On PyMol I used:

*# color by secondary structure color red, ss h # helices color yellow, ss s # sheets color green, ss l # loops/coils

HW6_PaperImage_S5 HW6_PaperImage_S5

Upon visual inspection, there seems to be more sheets.

o Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

For inspection of holes/pockets, I reduced transparency.

Setting: transparency set to 0.30000.

scene: scene stored as “004”.

HW6_PaperImage_S6 HW6_PaperImage_S6

I then restarted with the following code:

fetch 3m1n show surface show spheres, organic set transparency, 0.3

This showed spheres:

HW6_PaperImage_S7 HW6_PaperImage_S7

This showed some deeply embedded pockets, with one (I think!) more towards the surface.

May need some help with his!!

Part C. Using ML-Based Protein Design Tools

C1. Protein Language Modeling

1. Deep Mutational Scans

a. Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.

HW6_PaperImage_S8 HW6_PaperImage_S8

b. Can you explain any particular pattern? (choose a residue and a mutation that stands out)

Dark vertical stripes in the heatmap indicate positions where nearly all mutations score negatively; highly conserved residues critical for SHH function. Position 141 (His), part of the zinc-binding motif, shows strongly negative LLR scores for most substitutions, reflecting its essential role in zinc coordination. Interestingly, our ESMFold experiments confirmed that mutating this site (H141A/H142A) preserved the backbone fold while likely abolishing function, consistent with the language model’s predictions. In contrast, position 39 showed a near-neutral score (-0.08) for arginine substitution, expected given its location in the signal peptide which is cleaved after translation and therefore under weaker evolutionary pressure.

c. (Bonus) Find sequences for which we have experimental scans, and compare the prediction of the language model to experiment.

No systematic DMS dataset exists for SHH as far as I searched. Though, with more time I could do deeper searches. However, I would need some help with this question.

2. Latent Space Analysis

a. Use the provided sequence dataset to embed proteins in reduced dimensionality.

After realising that SHH was not included in the code, with the help of ChatGPT, I coded:

#####################################################################################################

*# 1. Embed SHH sequence shh_tokens = tokenizer( [protein_sequence], # your already-defined protein_sequence variable return_tensors=“pt”, padding=True, truncation=True, max_length=tokenizer.model_max_length )

with torch.no_grad(): shh_outputs = esm2( input_ids=shh_tokens[‘input_ids’], attention_mask=shh_tokens[‘attention_mask’], output_hidden_states=True, )

*# Mean pool the embedding shh_embedding = shh_outputs.hidden_states[-1][0] shh_mask = shh_tokens[‘attention_mask’][0] shh_mean_embedding = shh_embedding[shh_mask == 1].mean(dim=0).cpu().numpy()

*# 2. Stack with existing embeddings and re-run t-SNE all_embeddings = np.vstack([embeddings_array, shh_mean_embedding])

tsne_3d_new = TSNE(n_components=3, perplexity=30, n_iter=300, random_state=42) embeddings_3d_new = tsne_3d_new.fit_transform(all_embeddings)

*# 3. Build dataframe tsne_df_new = pd.DataFrame(embeddings_3d_new, columns=[‘TSNE1’, ‘TSNE2’, ‘TSNE3’])

*# Add labels — SCOP proteins + SHH labels = protein_sequence_annotations[:len(embeddings_array)] + [‘SHH (Sonic Hedgehog)’] tsne_df_new[’label’] = labels tsne_df_new[‘is_SHH’] = [‘SHH’ if i == len(embeddings_array) else ‘Other’ for i in range(len(tsne_df_new))]

*# Create a numerical column for marker size tsne_df_new[‘marker_size’] = tsne_df_new[‘is_SHH’].apply(lambda x: 10 if x == ‘SHH’ else 3)

*# 4. Plot with SHH highlighted fig_shh = px.scatter_3d( tsne_df_new, x=‘TSNE1’, y=‘TSNE2’, z=‘TSNE3’, color=‘is_SHH’, color_discrete_map={‘SHH’: ‘red’, ‘Other’: ’lightblue’}, hover_name=‘label’, title=‘3D t-SNE with SHH Highlighted’, size=‘marker_size’ # Use the new numerical size column )

fig_shh.update_layout(height=800) fig_shh.show() #####################################################################################################

This produced:

HW6_PaperImage_S9 HW6_PaperImage_S9

b. Analyze the different formed neighborhoods: do they approximate similar proteins?

The 3D t-SNE plot shows a single continuous distribution of SCOP protein embeddings with no sharply defined clusters, suggesting protein sequence space varies gradually across structural families. Outlier points at the periphery represent the most divergent sequences, consistent with the known continuity of protein fold space.

c. Place your protein in the resulting map and explain its position and similarity to its neighbors

SHH appears as a distinct red point near the periphery of the t-SNE cloud, reflecting its unusual biochemical features, including autocatalytic processing and lipid modification, that makes it distinct from most SCOP representatives. Despite this, it remains within the main cloud boundary, indicating shared broad sequence features with neighbouring proteins. Its nearest neighbours would be expected to include other hedgehog family members (IHH, DHH), consistent with ESM2 capturing evolutionary relationships through sequence alone.

C2. Protein Folding

Folding a protein

  1. Fold your protein with ESMFold. Do the predicted coordinates match your original structure?

Total sequence length: 462

Running ESMFold inference for sequence with length 462…

Prediction complete. ptm: 0.603 plddt: 78.225

Results saved to SHH_Fold_V1_3a3ca/

CPU times: user 1min 26s, sys: 8.6 s, total: 1min 35s

Wall time: 2min 8s

ESMFold predicted the SHH structure with a pTM of 0.603 and mean pLDDT of 78.2. The pTM score above 0.5 suggests the overall fold topology is likely correct, while the pLDDT of 78.2 indicates confident but not perfect local coordinate prediction. Regions of lower confidence likely correspond to flexible loops and the signal peptide. A full structural comparison via RMSD alignment to the crystal structure 1VHH would further quantify coordinate accuracy.

  1. Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?

Mutation:

MLLLARCLLLVLVSSLLVCSGLACGPGRGFGKRRHPKKLTPLAYKQFIPNVAEKTLGASG RYEGKISRNSERFKELTPNYNPDIIFKDEENTGADRLMTQRCKDKLNALAISVMNQWPGV KLRVTEGWDEDGAHSEESLHYEGRAVDITTSDRDRSKYGMLARLAVEAGFDWVYYESKAH IHCSVKAENSVAAKSGGCFPGSATVHLEQGGTKLVKDLSPGDRVLAADDQGRLLYSDFLT FLDRDDGAKKVFYVIETREPRERLLLTAAHLLFVAPHNDSATGEPEASSGSGPPSGGALG PRALFASRVRPGQRVYVVAERDGDRRLLPAAVHSVTLSEEAAGAYAPLTAQGTILINRVL ASCYAVIEEHSWAHRAFAPFRLAHALLAALAPARTDRGGDSGGGDRGGGGGRVALTAPGA ADAPGAGATAGIHWYSQLLYQIGTWLLDSEALHPLGMAVKSS

Changed HH→AA at zinc-binding site (positions 141-142)

Total sequence length: 462

Running ESMFold inference for sequence with length 462…

Prediction complete. ptm: 0.602 plddt: 78.128

Results saved to test_2cd60/

CPU times: user 1min 25s, sys: 8.45 s, total: 1min 33s

Wall time: 2min 3s

HW6_PaperImage_S10 HW6_PaperImage_S10

A double point mutation at the zinc-binding site (H141A/H142A) had negligible effect on predicted structure (pTM 0.603 vs. 0.602, pLDDT 78.2 vs. 78.13), suggesting SHH’s fold is resilient to point mutations even at functionally critical residues.

Mutation:

Alanine substitution was chosen as it removes side chain functionality while preserving backbone geometry, representing a conservative but informative structural perturbation.

MLLLARCLLLVLVSSLLVCSGLACGPGRGFGKRRHPKKLTPLAYKQFIPNVAEKTLGASG RYEGKISRNSERFKELTPNYNPDIIFKDEENTGADRLMTQRCKDKLNALAISVMNQWPGV KLRVTEGWDEDGHHSEESLHYEGRAVDITTSDRDRSKYGMLARLAVEAGFDWVYYESKAH IHCSVKAENSVAAKSGGCFPGSATVHLEQGGTKLVKDLSPGDRVLAADDQGRLLYSAAAAAAAAAAAAAAAAAAAAAAAAFLDRDDGAKKVFYVIETREPRERLLLTAAHLLFVAPHNDSATGEPEASSGSGPPSGGALG PRALFASRVRPGQRVYVVAERDGDRRLLPAAVHSVTLSEEAAGAYAPLTAQGTILINRVL ASCYAVIEEHSWAHRAFAPFRLAHALLAALAPARTDRGGDSGGGDRGGGGGRVALTAPGA ADAPGAGATAGIHWYSQLLYQIGTWLLDSEALHPLGMAVKSS

Replaced 25 residues with Alanines in a surface region

This resulted in:

Total sequence length: 482

Running ESMFold inference for sequence with length 482…

Prediction complete. ptm: 0.554 plddt: 72.026

Results saved to SHH_FinalMut_d0fdf/

CPU times: user 1min 40s, sys: 10.3 s, total: 1min 51s

Wall time: 2min 27s

HW6_PaperImage_S10 HW6_PaperImage_S10

A large segment mutation was introduced by replacing a surface region with a polyalanine stretch (26 residues), resulting in a slight sequence length increase from 462 to 482 residues due to insertion. This caused a moderate reduction in predicted structural confidence (pTM 0.603 –> 0.554, pLDDT 78.2 –> 72.0), while the fold remained above the 0.5 pTM threshold, indicating overall structural resilience.

C3. Protein Generation

sequence candidates via ProteinMPNN

  1. Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one.

After installing necessary packages for ProteinMPNN, I input the latest PDB for SHH protein, 8Z2V.

Heat map:

HW6_PaperImage_S10 HW6_PaperImage_S10

Sequence comparison:

Generating sequences…

8Z2V, score=1.5464, fixed_chains=[], designed_chains=[‘A’], model_name=v_48_020

LTPLAYKQFIPNVAEKTLGASGRYEGKISRNSERFKELTPNYNPDIIFKDEENTGADRLMTQRCKDKLNALAISVMNQWPGVKLRVTEGWDEDGHHSEESLHYEGRAVDITTSDRDRSKYGMLARLAVEAGFDWVYYESKAHIHCSVKAE

T=0.1, sample=0, score=0.8104, seq_recovery=0.4467

LTPLAPGERVPPVPEDSPEAAGPYLGRVERGDPRFADLVPDTDPDIEFADADGDGNDRLHTPKLVAVLRRLARLVREAWPGLRLRVLRGWSLDGDGSPRSHHYNGREADVTFSDEDAARLGALAALAVEAGADWVELASPDYVEIAVRPE

The ProteinMPNN probability heatmap shows that most positions along the SHH backbone are highly constrained, with single amino acids receiving probabilities exceeding 0.9 (yellow). This reflects strong structural determinism; the backbone geometry dictates specific residue preferences at key positions. A minority of positions, particularly around residues 95–105, show broader probability distributions across multiple amino acids, indicating structurally tolerant surface-exposed regions. The overall sparsity of high-probability assignments is consistent with the 44.67% sequence recovery observed, where roughly half of positions were confidently recovered while the remainder tolerate sequence variation.

  1. Input this sequence into ESMFold and compare the predicted structure to your original.

Inputting designed sequence back into ESMFold:

Original:

LTPLAYKQFIPNVAEKTLGASGRYEGKISRNSERFKELTPNYNPDIIFKDEENTGADRLMTQRCKDKLNALAISVMNQWPGVKLRVTEGWDEDGHHSEESLHYEGRAVDITTSDRDRSKYGMLARLAVEAGFDWVYYESKAHIHCSVKAE

Designed:

LTPLAPGERVPPVPEDSPEAAGPYLGRVERGDPRFADLVPDTDPDIEFADADGDGNDRLHTPKLVAVLRRLARLVREAWPGLRLRVLRGWSLDGDGSPRSHHYNGREADVTFSDEDAARLGALAALAVEAGADWVELASPDYVEIAVRPE

Total sequence length: 150

Running ESMFold inference for sequence with length 150…

Prediction complete. ptm: 0.910 plddt: 90.664

Results saved to SHH_Inverse_FinalMut_3f7ad/

CPU times: user 10.2 s, sys: 8.37 s, total: 18.6 s

Wall time: 47.2 s

The ProteinMPNN designed sequence, when folded by ESMFold, achieved a pTM of 0.910 and pLDDT of 90.664; substantially higher than the native SHH sequence (pTM 0.603, pLDDT 78.2). This improvement reflects two factors: first, the designed sequence covers only the structured core of SHH (150 residues vs 462), excluding disordered regions such as the signal peptide that reduce confidence scores; second, ProteinMPNN explicitly optimises sequences for backbone compatibility, producing a sequence more thermodynamically suited to the given fold than the evolutionarily derived native sequence.

HW6_PaperImage_S10 HW6_PaperImage_S10

Part D. Group Brainstorm on Bacteriophage Engineering

Proposal by: Sameen Nasar, Robert C Beck

Group Project Goal: Engineering a chaperone-independent efficient MS2 lysis protein

Project Rationale:

The efficacy of bacteriophage MS2 as an antibacterial agent is currently limited by the host’s ability to evolve resistance. Specifically, E. coli can mutate the molecular chaperone DnaJ (e.g., at position P330), disrupting the essential interaction required for the MS2 lysis (L) protein to fold and function [1.] This interaction is required for proper function of the lysis protein, as DnaJ binds to the N-terminal domain of MS2 lysis protein and alleviates its inhibitory effect on lytic activity.

We propose engineering a self-activating L protein by replacing its inhibitory, chaperone-dependent N-terminal region with a computationally designed, thermodynamically stable scaffold. As this original domain is dispensable for actual lysis but creates the DnaJ dependency [2], our redesign conceptually eliminates the need for the molecular “handshake” between host and phage, allowing MS2 to fold independently and bypass bacterial control mechanisms entirely.

Schematic

MS2 Protein & DnaJ Sequences
↓
AlphaFold-Multimer
Map the DnaJ binding interface

↓
RFDiffusion
Design a stable, independent N-terminal scaffold

↓
ProteinMPNN
Generate amino acid sequences for the new scaffold

↓
ESMFold
Confirm the new single-chain mutant folds correctly

↓
AlphaFold-Multimer
Confirm the mutant no longer binds to DnaJ

↓

Final L Protein Mutant for Synthesis

References

  1. Chamakura KR, Tran JS, Young R. MS2 lysis of Escherichia coli depends on host chaperone DnaJ. J Bacteriol. 2017;199(9):e00058-17. doi:10.1128/JB.00058-17.

  2. Chamakura KR, Edwards GB, Young R. Mutational analysis of the MS2 lysis protein L. Microbiology (Reading). 2017;163(7):961–969. doi:10.1099/mic.0.000485.

Week 5 HW: Protein Design Part II

Part A: SOD1 Binder Peptide Design (From Pranam)

Superoxide dismutase 1 (SOD1) is a cytosolic antioxidant enzyme that converts superoxide radicals into hydrogen peroxide and oxygen. In its native state, it forms a stable homodimer and binds copper and zinc.

Mutations in SOD1 cause familial Amyotrophic Lateral Sclerosis (ALS). Among them, the A4V mutation (Alanine → Valine at residue 4) leads to one of the most aggressive forms of the disease. The mutation subtly destabilizes the N-terminus, perturbs folding energetics, and promotes toxic aggregation.

Your challenge:

  1. Design short peptides that bind mutant SOD1.

  2. Then decide which ones are worth advancing toward therapy.

You will use three models developed in our lab:

‱ PepMLM: target sequence-conditioned peptide generation via masked language modeling

‱ PeptiVerse: therapeutic property prediction

‱ moPPIt: motif-specific multi-objective peptide design using Multi-Objective Guided Discrete Flow Matching (MOG-DFM)

Part 1: Generate Binders with PepMLM

*I initially did this part wrong as I did not introduce the mutation into the sequence, therefore had to do it again. The following is the latest attempt

1. Begin by retrieving the human SOD1 sequence from UniProt (P00441) and introducing the A4V mutation.

From UniProt:

sp|P00441|SODC_HUMAN Superoxide dismutase [Cu-Zn] OS=Homo sapiens OX=9606 GN=SOD1 PE=1 SV=2

MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTS AGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVV HEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

After adding the A4V mutation in position 5, taking in Methione into account:

MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

2. Using the PepMLM Colab linked from the HuggingFace PepMLM-650M model card:

Using PepMLM, four candidate peptides of length 12 were generated conditioned on the mutant SOD1 (A4V) sequence. The generated peptides were WRYGPYAIELAX (pseudo-perplexity 11.85), WRYYVAALEWWE (28.73), WHNYAAAIRLKX (15.20), and WHSYAAAAELKX (9.48). For comparison, the known SOD1-binding peptide FLYRWLPSRRGG was scored against the same target, yielding a pseudo-perplexity of 20.64. Lower pseudo-perplexity values indicate higher model confidence in the predicted binder. Three of the four generated peptides outperformed the known binder, with WHSYAAAAELKX achieving the lowest score of 9.48. Notably, two of the four generated peptides, WRYGPYAIELAX and WHNYAAAIRLKX, contained a terminal X residue, representing an unknown or masked amino acid. This suggests a mismatch between the specified peptide length and the model’s generation process, and these sequences should be treated with caution or re-generated with corrected parameters before advancing to downstream evaluation.

3. Generate four peptides of length 12 amino acids conditioned on the mutant SOD1 sequence.

BinderPseudo Perplexity
WRYGPYAIELAX11.85063412
WRYYVAALEWWE28.7286821
WHNYAAAIRLKX15.20319465
WHSYAAAAELKX9.482601001

4. To your generated list, add the known SOD1-binding peptide FLYRWLPSRRGG for comparison.

To find the perplexity score for FLYRWLPSRRGG, I added this code (with help from LLM) to generate perplexity score on the collab notebook:

known_peptide = “FLYRWLPSRRGG”

ppl_score = compute_pseudo_perplexity(model, tokenizer, protein_seq, known_peptide)

print(f"Peptide: {known_peptide}")

print(f"Pseudo Perplexity: {ppl_score}")

This resulted in:

BinderPseudo Perplexity
WRYGPYAIELAX11.85063412
WRYYVAALEWWE28.7286821
WHNYAAAIRLKX15.20319465
WHSYAAAAELKX9.482601001
FLYRWLPSRRGG20.63523127

5. Record the perplexity scores that indicate PepMLM’s confidence in the binders.

After generating the 12 amino acid peptides with PepMLM on the mutant SOD1 sequence, I recorded the pseudo-perplexity scores for each (lower scores indicate higher model confidence). I then added the known SOD1-binding peptide FLYRWLPSRRGG as a reference for comparison, yielding a pseudo-perplexity of 20.64. Of the four generated peptides, three outperformed the known binder: WHSYAAAAELKX (9.48), WRYGPYAIELAX (11.85), and WHNYAAAIRLKX (15.20), while WRYYVAALEWWE (28.73) scored worse. The best performing generated peptide, WHSYAAAAELKX, achieved nearly half the perplexity of the known binder, suggesting a strong model confidence in its predicted binding to the A4V mutant SOD1 target.

Part 2: Evaluate Binders with AlphaFold3

1. Navigate to the AlphaFold Server: alphafoldserver.com

HW5_AF1 HW5_AF1

2. For each peptide, submit the mutant SOD1 sequence followed by the peptide sequence as separate chains to model the protein-peptide complex.

3. Record the ipTM score and briefly describe where the peptide appears to bind. Does it localize near the N-terminus where A4V sits? Does it engage the ÎČ-barrel region or approach the dimer interface? Does it appear surface-bound or partially buried?

SOD1_ProtPeptide1 (WRYGPYAIELAX)

ipTM: 0.33

The predicted complex produced an ipTM score of 0.33, indicating low confidence in the interaction between the peptide and mutant SOD1. In the structural model, the peptide appears detached from the protein surface and does not localize near the N-terminal region where the A4V mutation occurs. Instead, it remains largely solvent-exposed and does not form clear contacts with the ÎČ-barrel region of SOD1.

HW5_ProPep1 HW5_ProPep1

SOD1_ProtPeptide2 (WRYYVAALEWWE)

ipTM: 0.28

The predicted complex produced an ipTM score of 0.28, indicating low confidence in the interaction between the peptide and mutant SOD1. In the structural model, the peptide appears detached from the protein surface, adopting a partially helical conformation in the periphery of the structure but failing to localize near the N-terminal region where the A4V mutation occurs. The peptide does not form clear contacts with the ÎČ-barrel core and remains largely solvent-exposed.

HW5_ProPep2 HW5_ProPep2

SOD1_ProtPeptide3 (WHNYAAAIRLKX) ipTM: 0.39

While this model has a higher ipTM score, it still has the same problems as a detached peptide, and no clear contacts make it solvent exposed.

HW5_ProPep3 HW5_ProPep3

SOD1_ProtPeptide4 (WHSYAAAAELKX)

ipTM: 0.26

Similar trends with this peptide.

HW5_ProPep3 HW5_ProPep3

4. In a short paragraph, describe the ipTM values you observe and whether any PepMLM-generated peptide matches or exceeds the known binder.

AlphaFold predictions of SOD-1 peptides produced relatively low ipTM scores ranging from 0.26¬–0.39. This indicates low confidence in stable interactions between the generated peptides and mutant SOD1. In the predicted structures, the peptides generally appear surface-exposed and do not consistently localize near the N-terminal region where the A4V mutation occurs. As a result, they are loosely structured, and do not form clear interfaces with the ÎČ-barrel core or the dimer interface.

Part 3: : Evaluate Properties of Generated Peptides in the PeptiVerse

Structural confidence alone is insufficient for therapeutic development. Using PeptiVerse, let’s evaluate the therapeutic properties of your peptide! For each PepMLM-generated peptide:

  1. Paste the peptide sequence.

  2. Paste the A4V mutant SOD1 sequence in the target field.

  3. Check the boxes:

  4. Predicted binding affinity

  5. Solubility

  6. Hemolysis probability

  7. Net charge (pH 7)

  8. Molecular weight

Compare these predictions to what you observed structurally with AlphaFold3. In a short paragraph, describe what you see. Do peptides with higher ipTM also show stronger predicted affinity? Are any strong binders predicted to be hemolytic or poorly soluble? Which peptide best balances predicted binding and therapeutic properties?

Choose one peptide you would advance and justify your decision briefly.

*Candidate Peptides:

#BinderPseudo Perplexity
0WRYGPYAIELAX11.85063412
1WRYYVAALEWWE28.7286821
2WHNYAAAIRLKX15.20319465
3WHSYAAAAELKX9.482601001

(0)

HW5_CanPep2 HW5_CanPep2

(1)

HW5_CanPep1 HW5_CanPep1

(2)

HW5_CanPep1 HW5_CanPep1

(3)

HW5_CanPep1 HW5_CanPep1
#BinderipTMPredicted binding affinitySolubilityHemolysis probabilityNetcharge (pH7)Molecular weight
0WRYGPYAIELAX0.335.79110.084-0.241320.7
1WRYYVAALEWWE0.287.75010.190-1.231671.8
2WHNYAAAIRLKX0.395.97210.0191.851324.8
3WHSYAAAAELKX0.265.97210.0191.851324.8

Higher ipTM scores do not consistently correspond to stronger predicted binding affinity in this dataset. For example, WHNYAAAIRLKX (ipTM 0.39) has a predicted binding affinity of 5.972, while WRYYVAALEWWE (ipTM 0.28) shows a higher affinity of 7.750 despite its lower structural confidence score. This suggests that ipTM and binding affinity capture different aspects of peptide-target interaction and should be considered together rather than in isolation.

All four generated peptides are highly soluble and show low hemolysis probabilities, indicating a favourable therapeutic safety profile. WHNYAAAIRLKX stands out as the most balanced candidate; it achieves the highest ipTM score (0.39), a competitive predicted binding affinity (5.972), perfect solubility, the lowest hemolysis probability in the dataset (0.019), and a positive net charge (1.85) which may favour interaction with the negatively charged surface regions of SOD1. However, it also has an unknown terminal amino acid, which is a problem for synthesis. Alternatively, WRYYVAALEWWE could also be a candidate due to its higher binding affinity and absence of X residue. Given its higher structural confidence (approx. 0.4) compared to the others, WHNYAAAIRLKX would be the most promising candidate to advance for further investigation.

Part 4: Generate Optimized Peptides with moPPIt

To edit the code, given the sliders were static, I used:

########################################################################################################################################################### *# For meet new selections

SELECTED_OBJECTIVES = [“Hemolysis”, “Solubility”, “Affinity”, “Motif”] OBJECTIVE_WEIGHTS_DICT = { “Hemolysis”: 1.0, “Solubility”: 1.0, “Affinity”: 1.5, “Motif”: 1.0 } OBJECTIVE_WEIGHTS_LIST = [1.0, 1.0, 1.5, 1.0] OBJECTIVES_CFG = { “selected_objectives”: SELECTED_OBJECTIVES, “weights_dict”: OBJECTIVE_WEIGHTS_DICT, “weights_list”: OBJECTIVE_WEIGHTS_LIST, “motif_positions”: “1-10” }

print(“Saved:”) print(“SELECTED_OBJECTIVES =”, SELECTED_OBJECTIVES) print(“OBJECTIVE_WEIGHTS_DICT =”, OBJECTIVE_WEIGHTS_DICT) print(“OBJECTIVE_WEIGHTS_LIST =”, OBJECTIVE_WEIGHTS_LIST) print(“motif_positions =”, OBJECTIVES_CFG[“motif_positions”])

###########################################################################################################################################################

BinderHemolysisSolubilityBinding AffinityMotif
KKKKYITECLVM0.97949668951332570.66666662693023687.1775856018066410.6455004811286926
ECYYVWTEQGTT0.97298292815685270.83333331346511846.3593978881835940.5219646692276001
KLKQKKFTEKVC0.96760169416666030.75000006.89976170.7254035472869873
SFQKINEKVKNA0.91039800.66666662693023686.8613882064819340.6815867

Peptides generated with moPPIt differ from those generated by PepMLM through controlled, residue-specific generation targeting positions 1-10 of the A4V mutant SOD1 sequence, with simultaneous optimisation of hemolysis, solubility, affinity, and motif objectives.

The four generated peptides show different balances across the optimised properties. KKKKYITECLVM achieves the highest affinity score (7.178) and a strong hemolysis score (0.979), though its solubility is moderate (0.667). KLKQKKFTEKVC shows the highest motif score (0.725) alongside competitive affinity (6.900), suggesting strong localisation near the targeted N-terminal residues. ECYYVWTEQGTT offers the best solubility (0.833) but the lowest affinity and motif scores of the four. SFQKINEKVKNA presents a balanced profile across all objectives with the lowest hemolysis score (0.910).

Compared to the PepMLM-generated peptides, the moPPIt peptides benefit from explicit multi-objective optimisation, producing sequences with higher predicted affinities and targeted motif engagement rather than purely sequence-conditioned sampling.

Before advancing toward therapeutic development, these peptides would require further evaluation through in vitro binding assays to confirm SOD1 interaction, proteolytic stability testing to assess degradation resistance, and cytotoxicity screening to verify safety before progressing to in vivo studies. Special emphasis should be placed on the haemolysis, given the high scores generated by this model; this may or may not indicate high toxicity.

Part C: Final Project: L-Protein Mutants

After running the code for analysis between predicted mutations and the experimental dataset, there is little to no overlap.

Process

After running the code for analysis between predicted mutations and the experimental dataset, there is little to no overlap. Process:

The model evaluated mutations using a log-likelihood ratio (LLR) derived from the probability distribution predicted by the ESM-2 protein language model. Mutations were then ranked by their LLR score, and predicted mutations were compared with experimental mutations using dataset merging.

The mutation C29R is present in both datasets. Experimental data shows no lysis activity, highlighting the difficulty in modelling predictions, as they do not always correspond to functional outcomes.

The model evaluated mutations using a log-likelihood ratio (LLR) derived from the probability distribution predicted by the ESM-2 protein language model. Mutations were then ranked by their LLR score, and predicted mutations were compared with experimental mutations using dataset merging.

Intiailly I tried to geenrate the full length sequences via Excel through updating mutations at specific positions:

HW5C_Excel1 HW5C_Excel1

This was very tedious, therefore I switched to Python on the desktop. Python was used instead of manual editing in Excel. A script was written to apply selected point mutations to the wild-type sequence by modifying specific residue positions. The code I used is below:

###########################################################################################################################################################

*#### HTGAA W5_HW_Part C: Multimer Assembly ####

*## Base sequence

base_seq = “METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT”

*## Mutations based on experimental dataset and model

*# K50L was done manually on MS EXcel

mutations = {

"Variant1": {"50":"L"},

"Variant2": {"39":"L"},

"Variant3": {"29":"R"},

"Variant4": {"13":"L"},

"Variant5": {"15":"A"}

}

*## Generate list

def apply_mutation(seq, mutation_dict):

seq_list = list(seq)

for pos, aa in mutation_dict.items():

seq_list[int(pos) - 1] = aa # -1 because Python is 0-indexed

return “".join(seq_list)

*# Store sequences

variant_sequences = {}

for name, mut in mutations.items():

variant_sequences[name] = apply_mutation(base_seq, mut)

*## Save variants in text file

with open(“Af2_variants.txt”, “w”) as f:

for name, seq in variant_sequences.items():

    f.write(f"{name}: {seq}\n")

###########################################################################################################################################################

Position of the mutation in LBase Pair ChangedAmino Acid PositionAmino Acid ChangeLysisProtein Levels
38C->T13P->L11
38C->T13P->L11
43T->G15S->A11
52A->G18R->G11
53G->T18R->I11

From the experimental dataset, I chose the following:

Position of the mutation in LBase Pair ChangedAmino Acid PositionAmino Acid ChangeLysisProtein Levels
38C->T13P->L11
43T->G15S->A11

From the model, I then selected mutations with the highest LLR scores as they are the most strongly predicted from the model.

PositionWild_Type_AAMutation_AALLR_Score
50KL2.561468
29CR2.395427
39YL2.24178

K50L and Y39L introduce hydrophobic residues that can help stabilize packed or core regions of the protein, consistent with the tendency for hydrophobic side chains to support structural integrity [1]. C29R adds a charged residue in a position the model favours, which may create new stabilizing interactions without disrupting folding [2]. Together these selections balance predicted stability, polarity, and structural compatibility, supporting the goal of designing functional L protein variants [3].

References

  1. Pace CN, Fu H, Fryar KL, Landua J, Trevino SR, Shirley BA, et al. Contribution of hydrophobic interactions to protein stability. J Mol Biol. 2011;408(3):514-28.

  2. Doig AJ, Williams DH. Is the hydrophobic effect stabilizing or destabilizing in proteins? The contribution of disulphide bonds to protein stability. J Mol Biol. 1991;217(2):389-98.

  3. Hendsch ZS, Tidor B. Do salt bridges stabilize proteins? A continuum electrostatic analysis. Proteins. 1994;20(1):1-10.

Week 6 HW: Genetic Circuits Part I: Assembly Technologies

Assignment: DNA Assembly

1. What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose?

The template DNA is the mUAV plasmid at Used at 20 ng/”L, with 0.8 ”L added to the reaction. The primers are “colour forward” and “colour reverse”. Give that the stock concentration is 5 ”L, using 2.5 ”L of each primer in a total reaction volume of 25 ”L results in a final primer concentration of 0.5 ”M.

The Phusion HF Master mix is a solution containing DNA polymerase, nucleotides, and buffer Magnesium ions which enable accurate and efficient DNA amplification in PCR [1]. It was added at 12.5 ”L from a 2X stock, resulting in a final concentration of 1X in the reaction.

Nuclease-free water is added (6.8 ”L) to bring the total reaction volume up to 25 ”L and ensure all components are at the correct final concentrations.

References

  1. New England Biolabs. Phusion High-Fidelity PCR Master Mix with HF Buffer [Internet]. Available from: https://www.neb.com/en-gb/products/m0531-phusion-high-fidelity-pcr-master-mix-with-hf-buffer

2. What are some factors that determine primer annealing temperature during PCR?

Annealing would depend on the melting point/temperature (Tm) of the primers; where annealing is generally done at 5 °C below the primer’s melting temperature [1]. Other factors include primer length, base composition (Guanine Content), salt and ion concentrations in the reaction (such as Mg2+ and monovalent salts).

References

  1. Integrated DNA Technologies. How do you calculate the annealing temperature for PCR? [Internet]. Available from: https://eu.idtdna.com/pages/support/faqs/how-do-you-calculate-the-annealing-temperature-for-pcr-?

3. There are two methods from this class that create linear fragments of DNA: PCR, and restriction enzyme digests. Compare and contrast these two methods, both in terms of protocol as well as when one may be preferable to use over the other.

PCRRestriction Enzyme Digests
Amplifies any region of DNARequires presence of specific restriction sites
Can introduce mutations, insertions, deletions, and overhangs for Gibson assemblyThe enzymes target and cleave near these sites
Highly flexible and does not require specific sequences to introduce mutations [1]Ends produced are either sticky or blunt
They allow for precise insertion of DNA fragments into vectors [2]

For the is protocol, as it involves cloning, because it allows precise amplification of DNA fragments while introducing mutations and overlaps required for Gibson assembly. In contrast, restriction enzyme digestion would be limited to existing recognition and doesn’t easily introduce sequence changes.

References

  1. National Human Genome Research Institute. Polymerase chain reaction (PCR) [Internet]. Bethesda (MD): NHGRI https://www.genome.gov/genetics-glossary/Polymerase-Chain-Reaction-PCR

  2. Thermo Fisher Scientific. Restriction enzyme basics [Internet]. Waltham (MA):. Available from: https://www.thermofisher.com/uk/en/home/life-science/cloning/cloning-learning-center/invitrogen-school-of-molecular-biology/molecular-cloning/restriction-enzymes/restriction-enzyme-basics.html#:~:text=In%20cloning%2C%20restriction%20enzymes%20enable%20precise%20DNA,fundamental%20principle%20of%20recombinant%20DNA%20cloning%20technology.

4. How can you ensure that the DNA sequences that you have digested and PCR-ed will be appropriate for Gibson cloning?

Dpnl is a restriction enzyme that selectively digests methylated DNA, leaving unmethylated PCR products untouched [1]. Therefore, according to the lab protocol, adding 1 ”L of dpnl to each sample to digest methylated DNA digests the mUAV template so that only newly created PCR fragments are introduced into the following Gibson Assembly step.

References

  1. University of Wisconsin–Madison. Lab 4: Background [Internet]. Biochemistry 551 Online Lab Manual; Available from: https://wisc.pb.unizin.org/biochemistry551online/chapter/lab-4-background/

5. How does the plasmid DNA enter the E. coli cells during transformation?

The most common forms of transformation are:

(i) Heat shock: Creating pores in cell wall through abrupt temperature changes

(ii) Electroporation: Generating pores via electrical voltage

These methods cause the wall to open up and create pores in the cell membrane, after which plasmids enter the bacteria through diffusion. After the initial heat/electric shock and entry, the pores eventually close up. Inside the bacteria, the plasmids replicate.

6. Describe another assembly method in detail (such as Golden Gate Assembly)

  1. Explain the other method in 5 - 7 sentences plus diagrams (either handmade or online).

  2. Model this assembly method with Benchling or Asimov Kernel!

(1) The Golden Gate Assembly is a molecular cloning method that uses only the sequential or simultaneous activities of a single type IIS restriction enzyme and T4 DNA ligase [1], this enabled multiple inserts to be placed into the vector backbone in a single reaction.

Type IIS enzymes include BsaI, BsmBI, or BbsI [1, 2]. These cut DNA at a defined distance away from their recognition sites, rather than within them. This feature enables the generation of user-defined overhangs/fusion sites, which can be further designed to be unique and complementary guiding ordered and ligation of DNA parts with high specificity [2].

The reaction is done in 1 tube, where restriction digestion and ligation using T4 DNA ligase, increasing efficiency and reducing steps. Importantly, the recognition sites are removed during assembly, resulting in a seamless DNA construct [2]. The cyclic process of ligation (16 °C) and digestion (37 °C) facilitates repeated breakdown of incorrect assemblies and enhances growth of the selected product.

Tried to reference this whole process in this illustrated diagram:

HW6_PaperImage_1 HW6_PaperImage_1

To start, I tried entering J23100 Promoter sequence (35 nucleotides) into Benchling:

HW6_PaperImage_2 HW6_PaperImage_2

To facilitate Golden Gate Assembly, the promoter was then designed with flanking BsaI sites that allow the enzyme to create unique 4-base overhangs, ensuring the fragment inserts into the backbone in the correct orientation without leaving a ‘scar’ sequence. Therefore, I input (with some help from gemini to generate):

GGTCTCATCCCttgacggctagctcagtcctaggtacagtgctagcTACTTGAGACC

I was still having problems, so I added:

https://www.addgene.org/44335/ (as per suggestion from Gemini, given it is used in the CIDAR MoClo Parts Kit | https://www.addgene.org/kits/densmore-cidar-moclo/)

Now I’m getting these sticky end errors, too tired to solve it. But also, genuinely a bit lost and would like some more support on it.

HW6_PaperImage_3 HW6_PaperImage_3

References

  1. New England Biolabs. Golden Gate Assembly [Internet]. Ipswich (MA): New England Biolabs. Available from: https://www.neb-online.de/en/cloning-synthetic-biology/dna-assembly/golden-gate-assembly/

  2. Laboratory Notes. Golden Gate Assembly [Internet]. Available from: https://www.laboratorynotes.com/golden-gate-assembly/

Week 7 Genetic Circuits Part II

Assignment Part 1: Intracellular Artificial Neural Networks (IANNs)

1. What advantages do IANNs have over traditional genetic circuits, whose input/output behaviors are Boolean functions?

The advantages of IANNs over traditional circuits include:

(i) Continuous processing which allows them to constantly measure changes in concentration gradients of cellular inputs rather than just their absolute presence or absence.

(ii) Relatively easier to scale up. That is, new inputs can be programmed by integrating additional weighted connections to existing nodes without completely rewiring the circuit.

(iii) Better adapted to non-linear classifications. Given IANNs continuously process as opposed to a Boolean (On/Off) logic, they can respons better to complex cell-state classification (e.g. distinguishing highly specific cell types)

Britto Bisso F, Aguilar R, Shree D, Zhu Y, Espinoza M, Diaz B, Cuba Samaniego C. Pattern recognition in living cells through the lens of machine learning. Open Biol. 2025 Jul 16;15(7):240377. doi: 10.1098/rsob.240377

Moorman A. Machine learning inspired synthetic biology: neuromorphic computing in mammalian cells [thesis]. Cambridge (MA): Massachusetts Institute of Technology; 2020. Available from: https://dspace.mit.edu/handle/1721.1/129864

2. Describe a useful application for an IANN; include a detailed description of input/output behavior, as well as any limitations an IANN might face to achieve your goal.

From researching papers related to the application of IANNs, I came across some interesting papers working exploring the use of bacteria to act as biosensors in soil or any agricultural mediums. For example, a paper by Del Valle and colleagues looked to looked to engineer modular genetic circuits that allow microbes to process complex, multi-variable environmental signals from the soil matrix and dynamically convert them into measurable cellular outputs.

From researching papers related to the application of IANNs, I came across some interesting papers working exploring the use of bacteria to act as biosensors in soil or any agricultural mediums. For example, a paper by Del Valle and colleagues looked to engineer modular genetic circuits that allow microbes to process complex, multi-variable environmental signals from the soil matrix and dynamically convert them into measurable cellular outputs [1].

A potential idea could be to use engineering modular circuits to clean up arsenic in soil. Where, inputs would be:

X1 : Concertation of Arsenic to be measured by proteins such as the ArsR protein, which is a naturally occurring arsenic-responsive transcription factor often borrowed from E. coli or Chromobacterium violaceum) [2].

X2 : Soil pH, measured by pH-responsive promoters. As demonstrated by Bañares et al. [3], genetic sensors can be used to dynamically regulate cellular outputs based on changing pH levels. Here, we use pH sensors to create a “bandpass filter” for the circuit.

Process:

  • IANNs will serve as weighted classifiers for that computes if Arsenic is high AND soil pH within a safe zone
  • OUTPUT: If conditions are met, the network activates the ArsR protein.
  • If soil increases above threshold pH, if it is too high the IANN turns OFF
  1. Del Valle, I., Fulk, E. M., Kalvapalle, P., Silberg, J. J., Masiello, C. A., & Stadler, L. B. (2021). Translating New Synthetic Biology Advances for Biosensing Into the Earth and Environmental Sciences. Frontiers in Microbiology, 11. https://doi.org/10.3389/fmicb.2020.618373

  2. Berset Y, Merulla D, Joublin A, Hatzimanikatis V, van der Meer JR. Mechanistic modeling of genetic circuits for ArsR arsenic regulation. ACS Synth Biol. 2017;6(5):862–874. doi:10.1021/acssynbio.6b00364

  3. Bañares AB, Valdehuesa KNG, Ramos KRM, Nisola GM, Lee WK, Chung WJ. A pH-responsive genetic sensor for the dynamic regulation of D-xylonic acid accumulation in Escherichia coli. Applied Microbiology and Biotechnology. 2020 Mar;104(5):2097-2108. doi: 10.1007/s00253-019-10297-0.

3. Below is a diagram depicting an intracellular single-layer perceptron where the X1 input is DNA encoding for the Csy4 endoribonuclease and the X2 input is DNA encoding for a fluorescent protein output whose mRNA is regulated by Csy4. Tx: transcription; Tl: translation.

Draw a diagram for an intracellular multilayer perceptron where layer 1 outputs an endoribonuclease that regulates a fluorescent protein output in layer 2.

Still trying to read and think this through

Assignment Part 2: Fungal Materials

1. What are some examples of existing fungal materials and what are they used for? What are their advantages and disadvantages over traditional counterparts?

Some examples include:

(i) Mycelium bio-composites which include fungi-derived leather substitutes. The advantages of these are they allow to bypass the killing of animals and avoid microplastic pollution in the long term

(ii) In architecture and construction, there are mycelium panels and acoustic tiles. Companies that utilise mycelium include Biohm (https://www.biohm.co.uk/mycelium)

(iii) Protective packaging. MycoComposite is used by companies as a substitute to Styrofoam. Bentangan M, Greetham D, Ross R, Kaplan-Bie L. Recent technological innovations in mycelium materials as leather substitutes: a patent review. Front Bioeng Biotechnol. 2023;11:1204861. https://doi.org/10.3389/fbioe.2023.1204861

Advantages

  • Animal free production

  • Quick turnaround given mushrooms have quick growth

  • Minimise agricultural waste

  • Low density and eco-friendly for building materials

Disadvantages

  • Easily biodegradable

  • Production scalability is low compared to traditional counterparts

  • Sensitivity moisture may reduce applicability

2. What might you want to genetically engineer fungi to do and why? What are the advantages of doing synthetic biology in fungi as opposed to bacteria?

Week 9 Week 9 — Cell-Free Systems

General homework questions

1. Explain the main advantages of cell-free protein synthesis over traditional in vivo methods, specifically in terms of flexibility and control over experimental variables. Name at least two cases where cell-free expression is more beneficial than cell production.

Advantages of cell-free protein synthesis (CFPS) over traditional in-vivo methods:

(i) Greater flexibility and control: Given that cells do not need to stay “alive” and the absence of a cell wall, it is possible to manipulate cells in real time; add chaperones, cofactors etc [1].

(ii) Rapid development of prototypes: Where in-vivo methods require cloning DNA into plasmids and transforming host cells, the CFPS allows us to essentially ‘drag and drop’ DNA with raw PCR products and observe protein expression in short periods of time (e.g. hours) [2]

Cases where CFPS provides benefits over in-vivo methods:

(i) Expression of toxic/dangerous antimicrobial peptides, potent neurotoxins, or complex membrane proteins in vivo. Usually the host cell would ‘die’ before reaching a large protein yield, as the CFPS is technically dead, it can synthesize toxic therapeutics and viral vectors that would be impossible to harvest from living cultures [2]

(ii) The open environment lets you easily swap natural amino acids for synthetic ones, enabling efficient, site-specific incorporation of non-standard amino acids (nsAAs) without competing with host metabolism [2]

  1. Khambhati K, Bhattacharjee G, Gohil N, Braddick D, Kulkarni V, Singh V. Exploring the potential of cell-free protein synthesis for extending the abilities of biological systems. Front Bioeng Biotechnol. 2019;7:248.
  2. Silverman AD, Kelley-Loughnane N, Jewett MC. Cell-free gene expression: an expanded repertoire of applications. Nat Rev Genet. 2020;21(3):151-70.

2. Describe the main components of a cell-free expression system and explain the role of each component.

(i) Cell extract (machinery): Derived from lysed cells (like E. coli), this extract provides the core transcriptional and translational machinery, including ribosomes and RNA polymerase, required to build the protein

(ii) Genetic template (blueprint): The DNA plasmid or RNA template that contains the specific gene sequence of the target protein we want to express

(iii) Nucleotides and amino acids (building blocks): Nucleotides—Adenosine triphosphate (ATP), Guanosine triphosphate (GTP), Cytidine triphosphate (CTP), and Uridine triphosphate (UTP)—are supplied for ribonucleic acid (RNA) synthesis (transcription), while transfer RNAs (tRNAs) pair with messenger RNA (mRNA) to deliver the amino acids necessary for protein synthesis (translation)

(iv) Energy systems: immediate energy sources like adenosine triphosphate (ATP) are paired with intermediate metabolites like 3-phosphoglycerate (3-PGA) or phosphoenolpyruvate (PEP) to continuously regenerate energy and maintain reaction stability.

(v) Buffers & cofactors (Environmental conditions):

HEPES: 4-(2-hydroxyethyl)-1-piperazineethanesulfonic acid (buffer to maintain a stable pH for optimal enzyme activity)

Mg: Magnesium (cofactor for transcription and translation enzymes)

DTT: Dithiothreitol (reducing agent that maintains a non-oxidizing environment to protect protein residues)

Sodium Oxalate: This is already the full chemical name (there is no abbreviation here, though its chemical formula is Na₂C₂O₄) (prevent magnesium precipitation, stabilizing the ionic balance)

Week 10 — Imaging and measurement

Homework: Waters Part I — Molecular Weight

We will analyze an eGFP standard on a Waters Xevo G3 QTof MS system to determine the molecular weight of intact eGFP and observe its charge state distribution in the native and denatured (unfolded) states. The conditions for LC-MS analysis of intact protein cause it to unfold and be detected in its denatured form (due to the solvents and pH used for analysis).

1. Based on the predicted amino acid sequence of eGFP (see below) and any known modifications, what is the calculated molecular weight? You can use an online calculator like the one at https://web.expasy.org/compute_pi/

eGFP Sequence: MVSKGEELFTG VVPILVELDG DVNGHKFSVS GEGEGDATYG KLTLKFICTT GKLPVPWPTL VTTLTYGVQC FSRYPDHMKQ HDFFKSAMPE GYVQERTIFF KDDGNYKTRA EVKFEGDTLV NRIELKGIDF KEDGNILGHK LEYNYNSHNV YIMADKQKNG IKVNFKIRHN IEDGSVQLAD HYQQNTPIGD GPVLLPDNHY LSTQSALSKD PNEKRDHMVL LEFVTAAGIT LGMDELYKLE HHHHHH Note: This contains a His-purification tag (HHHHHH) and a linker (the LE before it).

After inputting into the eGFP sequence into the online calculator I get:

Theoretical pI/Mw: 5.90 / 28006.60

2. Calculate the molecular weight of the eGFP using the adjacent charge state approach described in the recitation. Select two charge states from the intact LC-MS data (Figure 1) and:

I. Determine for each adjacent pair of peaks using:

From figure 1 I picked:

m/zn: 933.7349

m/zn+1: 903.7148

Plugging in values:

z = 903.7148/(933.7349 - 903.7148)

z = 31.1037

II. Determine the MW of the protein using the relationship between m/z_n , MW, and z.

Using derivation of deconvolution for ’n’':

Top

m/zn+1 903.7148

minus 1

Top 902.7148

Bottom

m/zn 933.7349

m/zn - m/zn+1 30.0201

Therefore,

n = Top/Bottom ~ 30.07035

Therefore,

MW = (n x m/zn – n)

{(30.0703462) * (933.7349)} - 30.07035

MW = 28047.66 Da

III. Calculate the accuracy of the measurement using the deconvoluted MW from 2.2 and the predicted weight of the protein from 2.1 using:

Accuracy = |28047.66 - 28006.60|Da/(28006.60)Da

= 0.001466131 (0.15%)

3. Can you observe the charge state for the zoomed-in peak in the mass spectrum for the intact eGFP? If yes, what is it? If no, why not?

No, it is difficult to determine charged state from zoomed-in peak by itself. The isotropic peaks are not clear, therefore the space needed to identify z cannot be found.

Homework: Waters Part II — Secondary/Tertiary structure

Week 11 — Week 11 — Bioproduction & Cloud Labs

Subsections of Labs

Week 1 Lab: Pipetting

cover image cover image

Subsections of Projects

Individual Final Project

cover image cover image

Group Final Project

cover image cover image