Subsections of Florian Kroh — HTGAA Spring 2026

Homework

Weekly homework submissions:

  • Week 1 HW: Principles and Practices

    Homework Week 1: Class Assignment Biological Engineering Application First Steps towards “Intelligence in a (warehouse)-dish” Guided by the vision of building a biological general computing system, the goal of the proposed tool is to provide a minimal, yet replicable brain organoid based system, that can be engineered to exhibit controllable, learning-like signal processing behaviour. The system consists of 3 conceptual parts (input - computation - output), that manifest in 2 integrated physical devices.

  • Week 2 HW: DNA Read, Write & Edit

    Part 0: Basics of Gel Electrophoresis Watch Week 2 Lecture (Zoom) Watch Week 2 Recitation (Zoom) Watch BioBootcamp Day 1 - Day 3 (Zoom) Part 1: Benchling & In-silico Gel Art Make a free account at benchling.com Import the Lambda DNA. Simulate Restriction Enzyme Digestion with the following Enzymes: EcoRI HindIII BamHI KpnI EcoRV SacI SalI Artwork After struggling quite some time with the task of creating artwork with the limited amount of restriction enzymes, in the end decided to stick to a relatively easy and repetitive pattern that with a little imagination has a lot of versatile interpretations: It can be two friends hanging out It can be DNA (or at least a rought estimation of the firrst two loops) To help you visualize it a bit better i created some generative AI art

  • Week 3 HW: Lab Automation

    Python Script for Opentrons Artwork Review this week’s recitation and this week’s lab for details on the Opentrons and programming it. Generate an artistic design using the GUI at opentrons-art.rcdonovan.com. I generated a quick design using the above mentioned tool: BioPunk Initials See the: Using the coordinates from the GUI, follow the instructions in the HTGAA26 Opentrons Colab to write your own Python script which draws your design using the Opentrons. You may use AI assistance for this coding — Google Gemini is integrated into Colab (see the stylized star bottom center); it will do a good job writing functional Python, while you probably need to take charge of the art concept. If you’re a proficient programmer and you’d rather code something mathematical or algorithmic instead of using your GUI coordinates, you may do that instead. If the Python component is proving too problematic even with AI and human assistance, download the full Python script from the GUI website and submit that: If you use AI to help complete this homework or lab, document how you used AI and which models made contributions. As its good practice in Software Engineering, not to reinvent the wheel, I had a look at the provided examples, and figured Example 7 would be a good basis for my requirements. Nevertheless, significant updates needed to be done to make the code useful for my purposes. These include:

  • Week 4 HW: Protein Design I

    Part A: Conceptual Questions Answer any NINE of the following questions from Shuguang Zhang: (i.e. you can select two to skip) How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons) Assuming that approx 20-25% of the weight is protein, that leaves us with 100-125g protein in a 500 gram meat piece. The average amino acid molar mass is approx. 100 g/mol. With Avogadro’s number 6.022×1023 the number of molecules is between: 1 x 6.022×1023 = 6.022×1023 and 1.25 x 6.022×1023 = 7.5 ×1023 (approx)

  • Week 5 HW: Protein Design II

    Part A: SOD1 Binder Peptide Design (From Pranam) Part 1: Generate Binders with PepMLM Begin by retrieving the human SOD1 sequence from UniProt (P00441) and introducing the A4V mutation. The Protein can be found with this link The Protein sequence is:

  • Week 6 HW: Genetic Circuits I

    DNA Assembly Answer these questions about the protocol in this week’s lab: What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose? I didn’t find anything pertaining the matter in the protocol itself, though a quick websearch (Link 1, Link 2), revealed that the PCR Master Mix contains 4 main ingreadients.

  • Week 7 HW: Genetic Circuits Part II

    Part 1 What advantages do IANNs have over traditional genetic circuits, whose input/output behaviors are Boolean functions? The key advantage of intracellular analog neural networks (IANNs) over traditional genetic circuits lies in the shift from discrete logic to continuous computation. Classical genetic circuits are typically engineered as Boolean systems: inputs are interpreted as “on” or “off”. This abstraction is convenient for engineering and design, but it is fundamentally misaligned with how biology actually operates, where signals exist as continuously varying concentrations and reaction rates.

Subsections of Homework

Week 1 HW: Principles and Practices

hw-1-logo hw-1-logo

Homework Week 1: Class Assignment

Biological Engineering Application

First Steps towards “Intelligence in a (warehouse)-dish”

Guided by the vision of building a biological general computing system, the goal of the proposed tool is to provide a minimal, yet replicable brain organoid based system, that can be engineered to exhibit controllable, learning-like signal processing behaviour. The system consists of 3 conceptual parts (input - computation - output), that manifest in 2 integrated physical devices.

Firstly, there is the 3D organoid culture chamber that handles the computation. Brain organoids, based on iPSC cultures, with a diameter below 500 micrometer, containing less than 100,000 cells can be manufactured with a high degree of standardization and scalability. More recent research allowed the long-term culture of brain organoids exceeding one year, displaying spontaneous electrophysical (re)activity, show extensive myelination, and can be enriched with several relevant cell types, e.g. oligodendrocytes, microglia, and astrocytes [1, 2, 3, 4].

Second is the input-output system that handles the input and output functions. As current research focuses on avascular brain organoids that require delivery of nutrients via diffusion, a higher order, more complex brain organoid requires brain vasculature. The microfluidic system mimics above-mentioned vasculature and contributes to the development of higher-order brain organoids. Additionally, the system can deliver and record chemical signals in a spatiotemporal manner [1, 5, 6]. Another activation mechanism lies in 3D microelectronic arrays. These allow precise stimulation and recording of spatiotemporal signals across the entire surface of the brain organoid [7].

Therefore, combining advances in 3D culture of brain organoids, with a combination of microfluidics and microelectronic arrays, poses an exciting research avenue and aims to contribute to the research topic of Organoid Intelligence (OI) [1].

Citations

  1. Smirnova, L., Caffo, B.S., Gracias, D.H., Huang, Q., Morales Pantoja, I.E., Tang, B., Zack, D.J., Berlinicke, C.A., Boyd, J.L., Harris, T.D. and Johnson, E.C., 2023. Organoid intelligence (OI): the new frontier in biocomputing and intelligence-in-a-dish. Frontiers in Science, 1, p.1017235
  2. Pamies, D., Barrera, P., Block, K., Makri, G., Kumar, A., Wiersma, D., Smirnova, L., Zhang, C., Bressler, J., Christian, K.M. and Harris, G., 2016. A human brain microphysiological system derived from induced pluripotent stem cells to study neurological diseases and toxicity. Altex, 34(3), p.362.
  3. Monzel, A.S., Smits, L.M., Hemmer, K., Hachi, S., Moreno, E.L., van Wuellen, T., Jarazo, J., Walter, J., Brüggemann, I., Boussaad, I. and Berger, E., 2017. Derivation of human midbrain-specific organoids from neuroepithelial stem cells. Stem cell reports, 8(5), pp.1144-1154.
  4. Marton, R.M., Miura, Y., Sloan, S.A., Li, Q., Revah, O., Levy, R.J., Huguenard, J.R. and Pașca, S.P., 2019. Differentiation and maturation of oligodendrocytes in human three-dimensional neural cultures. Nature neuroscience, 22(3), pp.484-491.
  5. Grigoryan, B., Paulsen, S.J., Corbett, D.C., Sazer, D.W., Fortin, C.L., Zaita, A.J., Greenfield, P.T., Calafat, N.J., Gounley, J.P., Ta, A.H. and Johansson, F., 2019. Multivascular networks and functional intravascular topologies within biocompatible hydrogels. Science, 364(6439), pp.458-464.
  6. Sabatini, B.L. and Tian, L., 2020. Imaging neurotransmitter and neuromodulator dynamics in vivo with genetically encoded indicators. Neuron, 108(1), pp.17-32.
  7. Huang, Q., Tang, B., Romero, J.C., Yang, Y., Elsayed, S.K., Pahapale, G., Lee, T.J., Morales Pantoja, I.E., Han, F., Berlinicke, C. and Xiang, T., 2022. Shell microelectrode arrays (MEAs) for brain organoids. Science advances, 8(33), p.eabq5031.

Governance Goals

Harm reduction

Minimize the harm on the biological system through careful research design, consideration of biological responses and sensibility at the intersection of the research goal and societal norms.

Biological harm reduction emphasizes preserving the physiological integrity of the organoid system on the biological level. The intent is to avoid inducing unnecessary stress, damage or pathological stress in living tissue and to ensure that experimental interactions remain compatible with healthy biological function.

Limitation of harm arising from emergent properties seeks to prevent unintended transitions toward higher-order dynamics that could raise ethical concerns, as organoids exhibit more complex, self-organizing behavior. This includes constraining system complexity and maintaining clear boundaries on the duration and scope of experiments, as well as ensuring that organoids are not maintained beyond their justified research purpose.

Donor rights

Ensuring that individuals that contribute their biological samples for research retain fair rights, autonomy and benefits and are protected again extractive behaviour of third parties though the entire lifecycle and downstream applications.

Transparent disclosure of organoid intelligence research ensures that donors clearly understand that their biological material may be used to generate brain organoids for learning-like signal processing and computational experimentation.

Fairness in benefit sharing and protection against discrimination aims to include donors into the benefits generated by the research, nor expose them to harm through stigmatization, profiling, or inequitable use of data derived from their biological contributions. Donor participation should not create asymmetries where value is extracted without corresponding ethical consideration.

Privacy preservation protects donors from identity linkage, misuse or inappropriate inference of personal traits.

Access

Ensuring that access to the systems themselves as well as the associated knowledge and benefits are disseminated in a fair way, while extractive use attempts are actively prevented.

Equitable research access seeks to avoid concentration of organoid intelligence capabilities within a small number of research groups and institutions. The intent is to enable participation by a diverse range of (also non-scientific) communities.

Non-exclusive access tries to ensure that foundational biological systems and insights won’t be locked behind proprietary structures. The goal is to preserve openness at the level of core knowledge and enabling technologies.

Limiting extractive use ensures that access to sensitive biological data does not enable exploitation of contributing or downstream affected individuals. This goal emphasizes that organoid intelligence research should generate value that is aligned with societal benefit.

Governance Actions

Action 1: Technically Enforced Graduated Freedom

Purpose: Redefining the locus of governance from external to internal. The goal of this action is to embed harm reduction directly into the technical architecture, while still preserving scientific flexibility. Instead of imposing rigid limits, the system provides ethical “factory settings” that enable safe and broadly acceptable use by default, while allowing controlled exploration beyond these settings when justified.

Design: The organoid computing platform is developed with a set of default operating parameters, e.g. size, culture duration, stimulation intensity, and learning persistence. These can be used without additional ethical review. At the same time, a clearly defined subset of parameters is designated as research-variable, allowing researchers to intentionally explore higher complexity, longer duration, or altered learning dynamics. Deviations beyond default settings require explicit justification and appropriate ethical oversight, but are technically supported rather than prohibited. The system logs when and how parameters are modified.

Assumptions: It is assumed that most researchers will operate within default settings unless there is a genuine scientific reason to deviate.. It also assumes that technical transparency (rather than hard locks) is an effective governance lever.

Risks of Failure & “Success”: The model fails if defaults are treated without care rather than minimum safeguards, or if parameter variation becomes routine without oversight. Successful use of the system could create a false sense of ethical safety. There is also a risk that logging is perceived as surveillance, which in turn would discourage experimentation.

Action 2: Reciprocal Donor Stewardship

Purpose: Current consent frameworks mostly are a one-time action, offering limited protection against extractive use. This action proposes a reciprocal donor stewardship model, in which the collecting institution acts as a fiduciary to protect the donor’s interests, but also to maintain a two-way informational relationship. Donors are recognized as long-term stakeholders.

Design: Universities and biobanks adopt stewardship responsibilities as a condition for ethical approval and public funding. Donors opt into a structured relationship that includes regular high-level updates on relevant scientific developments as well as personalized notifications when findings derived from their samples may have health relevance.

Assumptions: This model assumes donors want an ongoing relationship. Furthermore it assumes that institutions can responsibly manage communication. It also assumes that research findings can be meaningfully categorized into general scientific updates versus personally relevant information.

Risks of Failure & “Success”: The model may fail if institutions lack the willingness to maintain long-term engagement. There is also a risk that donors misinterpret research signals as medical diagnoses, causing anxiety or harm. Successful implementation could blur the boundary between research and clinical care.

Action 3: Simple Public-Interest Licensing

Purpose: Biological computing moves toward commercialization, this creates the option that foundational technologies become locked behind exclusive or opaque licensing arrangements. The goal of this action is to preserve the public-interest while enabling rapid and practical commercialization, ensuring that ethical constraints do not themselves become barriers to innovation.

Design: Universities and spin-outs adopt standardized, plain-language public-interest licenses. These licenses are intentionally short, unambiguous, and easy to interpret, defining only a small number of clearly prohibited applications, while leaving all other commercial uses unrestricted. Investors and companies opt in upfront, gaining predictability.

Assumptions: This approach assumes that ethical constraints can be expressed in a small number of clear, enforceable prohibitions. It also assumes that companies and investors value legal certainty and speed of commercialization enough to accept modest limits on exclusivity and application scope.

Risks of Failure & “Success”: The model fails if prohibited-use categories are defined too broadly or too minimally. Conversely, “success” could lead to widespread adoption, which may normalize these constraints.

Scoring Matrix

Does the option:Tecnically EnforcedReciprocal StewardshipSimple Licensing
Harm Reduction1N/A2
• Biological harm reduction1N/A2
• Limitation of harm arising from emergent properties1N/A3
Donor rightsN/A12
• Transparent disclosure of organoid intelligence researchN/A13
• Fairness in benefit sharing and protection against discriminationN/A11
• Privacy preservationN/A12
AccessN/A11
• Equitable research accessN/A22
• Non-exclusive accessN/A21
• Limiting extractive useN/A22

Recommendation to ethics boards

For research on organoid intelligence, ethics boards should prioritize governance mechanisms that operate at the point of experimental design and focus on setting default use behaviour. Based on the policy goals of harm reduction, donor rights, and access, I recommend ethics boards focus primarily on Action 1 (Technically Enforced Graduated Freedom) for the experiment design and Action 2 (Reciprocal Donor Stewardship) ensuring a modern relationship management. Action 3 (Simple Public-Interest Licensing) will become more relevant in the near future, therefore it should be considered down the road.

First, ethics boards should require technically enforced ethical defaults for organoid intelligence systems. Rather than relying on lengthy binary approval decisions, projects should be judged on justification of any intended deviations. Second, ethics boards should transition from one-time consent towards reciprocal stewardship plans. These plans should treat donors as long-term stakeholders engaging in two-way communication when findings may be personally relevant. This strengthens donor autonomy and public trust without conflating research with clinical care. While ethics boards do not manage IP, they should recommend investigators to include public interest licensing in their research lifecycle.

The risk that overly cautious governance discourages legitimate research. These uncertainties argue for graduated, revisitable oversight rather than rigid prohibitions.

Use of Generative AI

Generative AI was used as a drafting aid throughout the development of this homework assignment. Specifically, it supported the structuring and refinement of complex ideas at the intersection of organoid intelligence, and governance, including the logical separation and articulation of policy goals and governance actions. The AI was used to iteratively clarify language and explore alternative framings, while all substantive ideas, judgments, and final decisions were made by the author.

Week 2 Lecture Prep

Prof. Dr. Jacobson

Nature’s machinery for copying DNA is called polymerase.

What is the error rate of polymerase?

The error rate is 1:10E6 (see page 8, right side “biological systhesis”)

How does this compare to the length of the human genome?

The human genome has a length of 3.2 Gbp or 3.2 x 10E9 (see page 10 diagram)

How does biology deal with that discrepancy?

The human body engages in a “proofreading” process (page 8). The error correction process, called “MutS Repair System” is described in detail in Lamers et al. Nature 407:711 (2000) (page 14)

How many different ways are there to code (DNA nucleotide code) for an average human protein? In practice what are some of the reasons that all of these different codes don’t work to code for the protein of interest?

The average human protein is 1036 bp, equating to 1036 / 3 = 345 amino acids. On average 3 codons encode an amino acid, therefore the number of number of different DNA sequences encoding the same 345 amino acids is 3 x 10E345.

Dr. LeProust

What’s the most commonly used method for oligo synthesis currently?

Phosphoramidite chemistry. Page 2 illustrates the 4 step process. See the historical development from page 9.

Why is it difficult to make oligos longer than 200nt via direct synthesis?

Not directly found in the slides. Searching further ressources says the fundamental issue is cumulative coupling efficiency loss. Each cycle has a coupling efficiency of ~99–99.5%. The overall yield is the stepwise efficiency raised to the power of the oligo length: for a 200-mer at 99.5% coupling efficiency, the yield of full-length product is roughly 37%.

Why can’t you make a 2000bp gene via direct oligo synthesis?

Same rational as answer above. Decay is exponential, therefore the efficiency drops to negligible levels.

Prof. Dr. George Church

Using Google & Prof. Church’s slide #4: What are the 10 essential amino acids in all animals and how does this affect your view of the “Lysine Contingency”?

Phenylalanine (F), Valine (V), Threonine (T), Tryptophan (W), Isoleucine (I), Methionine (M), Leucine (L), Lysine (K), Histidine (H), Arginine (R)

Opus 4.6 Prompt: How does this relate to Lysine Deficiency?

The connection to Jurassic Park wasn’t clear to me, as i haven’t seen the movie. Therefore here is Opus answer (which was an amusing read): “The “Lysine Contingency” from Jurassic Park

In the film/novel, the engineers claim dinosaurs were designed to be lysine-deficient — unable to produce lysine — so they’d die without supplemental lysine provided by the park, serving as a biocontainment mechanism.

Why It Doesn’t Work The “contingency” is essentially meaningless as a containment strategy because lysine is already an essential amino acid for all animals. No animal synthesizes its own lysine. The dinosaurs wouldn’t be any more dependent on external lysine than a cow, a bird, or a human. Any environment with protein-containing food sources (plants, prey, insects) would supply adequate lysine. The dinosaurs would simply eat and obtain lysine the same way every other animal does.

Connection to Church’s Work Interestingly, slide #4’s emphasis on “knowing your monomers” and the references to non-standard amino acids (NSAAs) and xenomicrobiology (genetic code engineering, referenced from the 2016 Microb. Biotech paper) point to what actual biocontainment might look like. As shown in slides 39–42, Church’s lab has built Genomically Recoded Organisms (GROs) that depend on NSAAs like BipA — amino acids that truly do not exist in nature. This is a real lysine contingency done right: organisms that require a synthetic amino acid unavailable in any natural food web, providing genuine genetic and metabolic isolation.

So Crichton’s fictional biocontainment was flawed at the basic biochemistry level, but Church’s lab has essentially built the version that actually works.”

Week 2 HW: DNA Read, Write & Edit

HW 2 Logo HW 2 Logo

Part 0: Basics of Gel Electrophoresis

  • Watch Week 2 Lecture (Zoom)
  • Watch Week 2 Recitation (Zoom)
  • Watch BioBootcamp Day 1 - Day 3 (Zoom)

Part 1: Benchling & In-silico Gel Art

  • Make a free account at benchling.com
  • Import the Lambda DNA. Benchling import DNA Benchling import DNA
  • Simulate Restriction Enzyme Digestion with the following Enzymes:
    • EcoRI
    • HindIII
    • BamHI
    • KpnI
    • EcoRV
    • SacI
    • SalI Benchling Screenshot Virtual Digest Benchling Screenshot Virtual Digest

Artwork

After struggling quite some time with the task of creating artwork with the limited amount of restriction enzymes, in the end decided to stick to a relatively easy and repetitive pattern that with a little imagination has a lot of versatile interpretations: It can be two friends hanging out Two Friends Two Friends It can be DNA (or at least a rought estimation of the firrst two loops) DNA Simple DNA Simple To help you visualize it a bit better i created some generative AI art DNA Cool DNA Cool

To create the loop like structure, I used the Restriction enzymees BamHI - KpnI - SalI - KpnI - BamHI

Part 2: Gel Art - Restriction Digests and Gel Electrophoresis

not relevant as I don’t have access to a lab

Part 3: DNA Design Challenge

3.1. Choose your protein.

For Week 2 Homework I choose the Growth/differentiation factor 8 (short GDF-8), also known as human myostatin protein. I choose myostatin for the inital reason, that it was the first protein that came to mind. Being known for the viral video of Jo Zayner injecting the DIY-Gene Therapy to knock out the myostatin associated gene, or the many pictures of muscled animals, like cattle and dogs.

Digging further, myostatin seemed to be a good choice, not only for it’s fame. It is a well studied protein, with a clear function to negatively regulate muscle growth, as seen with the example of the “jacked” bagle or the cattle. Furthermore myostatin is not only interesting to biohackers and instagram scrollers, it has actual therapeutical interest, and is actively researched to combat muscular dystrophy. In the up and coming field of Longevity, myostatin is researched to mitigate age-related muscle loss.

Sequence

>sp|O14793|GDF8_HUMAN Growth/differentiation factor 8 OS=Homo sapiens OX=9606 GN=MSTN PE=1 SV=1 MQKLQLCVYIYLFMLIVAGPVDLNENSEQKENVEKEGLCNACTWRQNTKSSRIEAIKIQI LSKLRLETAPNISKDVIRQLLPKAPPLRELIDQYDVQRDDSSDGSLEDDDYHATTETIIT MPTESDFLMQVDGKPKCCFFKFSSKIQYNKVVKAQLWIYLRPVETPTTVFVQILRLIKPM KDGTRYTGIRSLKLDMNPGTGIWQSIDVKTVLQNWLKQPESNLGIEIKALDENGHDLAVT FPGPGEDGLNPFLEVKVTDTPKRSRRDFGLDCDEHSTESRCCRYPLTVDFEAFGWDWIIA PKRYKANYCSGECEFVFLQKYPHTHLVHQANPRGSAGPCCTPTKMSPINMLYFNGKEQII YGKIPAMVVDRCGCS

3.2 Reverse Translate: Protein (amino acid) sequence to DNA (nucleotide) sequence.

The Central Dogma discussed in class and recitation describes the process in which DNA sequence becomes transcribed and translated into protein. The Central Dogma gives us the framework to work backwards from a given protein sequence and infer the DNA sequence that the protein is derived from. Using one of the tools discussed in class, NCBI or online tools (google “reverse translation tools”), determine the nucleotide sequence that corresponds to the protein sequence you chose above.

Using the Reverse Translate tool from Bioinformatics.org, I got the following results:

>reverse translation of sp|O14793|GDF8_HUMAN Growth/differentiation factor 8 OS=Homo sapiens OX=9606 GN=MSTN PE=1 SV=1 to a 1125 base sequence of most likely codons. atgcagaaactgcagctgtgcgtgtatatttatctgtttatgctgattgtggcgggcccg gtggatctgaacgaaaacagcgaacagaaagaaaacgtggaaaaagaaggcctgtgcaac gcgtgcacctggcgccagaacaccaaaagcagccgcattgaagcgattaaaattcagatt ctgagcaaactgcgcctggaaaccgcgccgaacattagcaaagatgtgattcgccagctg ctgccgaaagcgccgccgctgcgcgaactgattgatcagtatgatgtgcagcgcgatgat agcagcgatggcagcctggaagatgatgattatcatgcgaccaccgaaaccattattacc atgccgaccgaaagcgattttctgatgcaggtggatggcaaaccgaaatgctgctttttt aaatttagcagcaaaattcagtataacaaagtggtgaaagcgcagctgtggatttatctg cgcccggtggaaaccccgaccaccgtgtttgtgcagattctgcgcctgattaaaccgatg aaagatggcacccgctataccggcattcgcagcctgaaactggatatgaacccgggcacc ggcatttggcagagcattgatgtgaaaaccgtgctgcagaactggctgaaacagccggaa agcaacctgggcattgaaattaaagcgctggatgaaaacggccatgatctggcggtgacc tttccgggcccgggcgaagatggcctgaacccgtttctggaagtgaaagtgaccgatacc ccgaaacgcagccgccgcgattttggcctggattgcgatgaacatagcaccgaaagccgc tgctgccgctatccgctgaccgtggattttgaagcgtttggctgggattggattattgcg ccgaaacgctataaagcgaactattgcagcggcgaatgcgaatttgtgtttctgcagaaa tatccgcatacccatctggtgcatcaggcgaacccgcgcggcagcgcgggcccgtgctgc accccgaccaaaatgagcccgattaacatgctgtattttaacggcaaagaacagattatt tatggcaaaattccggcgatggtggtggatcgctgcggctgcagc

3.3. Codon optimization.

Once a nucleotide sequence of your protein is determined, you need to codon optimize your sequence. You may, once again, utilize google for a “codon optimization tool”. In your own words, describe why you need to optimize codon usage. Which organism have you chosen to optimize the codon sequence for and why?

Organisms have different procivities for using certain codons. Some codons are used more frequently in one orgamis. So for the same amino acide certain codons are “preferred” over others. If the inserted DNA matches the preferred codons of the organism more translation happens. In my case, as the protein I’m trying to express is of human origin, though expressed in e.coli (common for early experiments), many of the codons common in homo sapiens, are rare in e.coli.

Using the Codon Optimization tool from VectorBuilder, and choosing e.coli K-12 as the organism i get the following optimized codon:

ATGCAGAAACTGCAGCTGTGCGTTTACATTTATCTGTTCATGCTGATTGTGGCCGGCCCGGTGGATCTGAACGAAAACAGTGAACAGAAAGAAAACGTGGAAAAAGAAGGTCTGTGCAACGCCTGTACCTGGCGCCAGAATACCAAATCGAGCCGCATTGAAGCCATTAAAATTCAGATCCTGTCAAAACTGCGTCTGGAAACCGCGCCGAATATTAGCAAAGATGTGATCCGTCAGCTGCTGCCGAAAGCCCCGCCGCTGCGTGAACTGATTGATCAGTATGATGTGCAGCGCGATGATAGCAGCGATGGCAGCCTGGAAGATGATGATTATCACGCGACCACCGAAACCATTATTACCATGCCGACCGAAAGCGATTTTCTGATGCAGGTGGATGGCAAACCGAAATGCTGCTTCTTCAAATTTAGCTCGAAAATTCAATATAATAAAGTGGTGAAAGCGCAGCTGTGGATCTATCTGCGCCCGGTGGAAACCCCGACCACCGTGTTTGTGCAGATTCTGCGCCTGATTAAACCGATGAAAGATGGCACCCGCTACACCGGCATTCGCAGCCTGAAACTGGATATGAACCCGGGCACCGGCATCTGGCAGAGCATTGATGTGAAAACCGTTCTGCAGAATTGGCTGAAACAGCCGGAAAGCAACCTGGGCATTGAAATTAAAGCCCTGGATGAAAATGGCCATGATCTGGCAGTGACCTTTCCGGGCCCGGGCGAAGATGGCCTGAATCCGTTCCTGGAAGTGAAAGTGACCGATACCCCGAAACGCAGCCGCCGCGACTTTGGCCTGGATTGCGATGAACACAGCACCGAAAGCCGCTGCTGCCGCTACCCGCTGACCGTGGATTTTGAAGCGTTCGGCTGGGATTGGATTATTGCGCCGAAACGCTATAAGGCGAACTACTGCAGCGGTGAATGCGAATTTGTGTTTCTGCAGAAATATCCGCACACCCATCTGGTGCACCAGGCAAACCCGCGCGGCAGCGCGGGCCCGTGCTGTACCCCGACCAAAATGAGCCCGATTAACATGCTGTATTTTAACGGCAAAGAACAGATTATCTATGGCAAAATCCCGGCGATGGTTGTGGATCGCTGCGGTTGTAGC while avoiding cleavage sites of restriction enzymes: BamHI HindIII

Whether the e.coli is the proper host for this application is debateable, as e.coli lacks the post-translational modification capabilities. Other hosts like CHO cells or human cells, could prove to be a better choice, if one aims for a properly folded, functioning protein.

3.4. You have a sequence! Now what?

What technologies could be used to produce this protein from your DNA? Describe in your words the DNA sequence can be transcribed and translated into your protein. You may describe either cell-dependent or cell-free methods, or both.

Let me focus on cell-dependent methods. There are several expression systems like e.coli, yeast and mamalian cells. Each system has their pros and cons. E.coli is a procaryote, it’s cheap, fast and well established, though it lacks post translational modification abilities. Therefore the myostatin might be misfolded. Addtionally the cells have to be lysed to get to the myostatin. Yeast is eukaryotic, therefore it has some post-translational folding capabilities, additionally secretion is possible. Mammalian cells offer human-like folding, and secretion, though they grow slower, are a-lot harder to handle, offer lower yields and are significantly more expensive than e.coli or yeast.

To get the DNA into the orgamism:

  1. Perform PCR on the optimized myostatin DNA strand, to generate many copies of said gene. make sure the DNA strand is flanked with BanHI and HindIII sites.
  2. Open the plasmid at the previously avoided restriction sites (BamHI HindIII), creating compatible sticky ends.
  3. Cut the DNA strand with the same restriction enzymes
  4. Mix plasmid and DNA strand, introduce ligase enzymes and ATP, to connect the matching sticky ends.
  5. Transform the plasmid into e.coli using common transformation methods, e.g. heat-shock or electroporation
  6. (optional) screen for uptake of plasmid

Having chosen e.coli here the DNA would transscribe into mRNA, which in turn would translate into myostating facilitated by ribosomes. For lab-sized fermentation, a batch system is sufficient, on industrial scale a fed-batch system would be used.

Part 4: Prepare a Twist DNA Synthesis Order

4.1. Create a Twist account and a Benchling account

  • Create Twist Account
  • Create Benchling Account

4.2. Build Your DNA Insert Sequence

Following the example on the course site, this is the linear map of the sequence: Linear Map of myostatin gene Linear Map of myostatin gene

>Test_myostatin TTTACGGCTAGCTCAGTCCTAGGTATAGTGCTAGCCATTAAAGAGGAGAAAGGTACCATGCAGAAACTGCAGCTGTGCG TTTACATTTATCTGTTCATGCTGATTGTGGCCGGCCCGGTGGATCTGAACGAAAACAGTGAACAGAAAGAAAACGTGGA AAAAGAAGGTCTGTGCAACGCCTGTACCTGGCGCCAGAATACCAAATCGAGCCGCATTGAAGCCATTAAAATTCAGATC CTGTCAAAACTGCGTCTGGAAACCGCGCCGAATATTAGCAAAGATGTGATCCGTCAGCTGCTGCCGAAAGCCCCGCCGC TGCGTGAACTGATTGATCAGTATGATGTGCAGCGCGATGATAGCAGCGATGGCAGCCTGGAAGATGATGATTATCACGC GACCACCGAAACCATTATTACCATGCCGACCGAAAGCGATTTTCTGATGCAGGTGGATGGCAAACCGAAATGCTGCTTC TTCAAATTTAGCTCGAAAATTCAATATAATAAAGTGGTGAAAGCGCAGCTGTGGATCTATCTGCGCCCGGTGGAAACCC CGACCACCGTGTTTGTGCAGATTCTGCGCCTGATTAAACCGATGAAAGATGGCACCCGCTACACCGGCATTCGCAGCCT GAAACTGGATATGAACCCGGGCACCGGCATCTGGCAGAGCATTGATGTGAAAACCGTTCTGCAGAATTGGCTGAAACAG CCGGAAAGCAACCTGGGCATTGAAATTAAAGCCCTGGATGAAAATGGCCATGATCTGGCAGTGACCTTTCCGGGCCCGG GCGAAGATGGCCTGAATCCGTTCCTGGAAGTGAAAGTGACCGATACCCCGAAACGCAGCCGCCGCGACTTTGGCCTGGA TTGCGATGAACACAGCACCGAAAGCCGCTGCTGCCGCTACCCGCTGACCGTGGATTTTGAAGCGTTCGGCTGGGATTGG ATTATTGCGCCGAAACGCTATAAGGCGAACTACTGCAGCGGTGAATGCGAATTTGTGTTTCTGCAGAAATATCCGCACA CCCATCTGGTGCACCAGGCAAACCCGCGCGGCAGCGCGGGCCCGTGCTGTACCCCGACCAAAATGAGCCCGATTAACAT GCTGTATTTTAACGGCAAAGAACAGATTATCTATGGCAAAATCCCGGCGATGGTTGTGGATCGCTGCGGTTGTAGCCAT CACCATCACCATCATCACTAACCAGGCATCAAATAAAACGAAAGGCTCAGTCGAAAGACTGGGCCTTTCGTTTTATCTG TTGTTTGTCGGTGAACGCTCTCTACTAGAGTCACACTGGCTCACCTTCGGGTGGGCCTTTCTGCGTTTATA

Link to Test_myostatin

In Twist having chosen the Clonal Gene, with the vector: pTwist Amp High Copy, this is the output from Twist Twist order page Twist order page Following is the final Clonal Gene in Benchling Benchling Clonal Gene Benchling Clonal Gene

Part 5: DNA Read/Write/Edit

5.1 DNA Read

(i) What DNA would you want to sequence (e.g., read) and why? This could be DNA related to human health (e.g. genes related to disease research), environmental monitoring (e.g., sewage waste water, biodiversity analysis), and beyond (e.g. DNA data storage, biobank).

I’m interested in genetic origin for muscle growth. People have different outcomes for the similar inputs, I’m interested into the marginal influence Gene’s have on ones physiological development. While overall health is important, this also has clinical application, e.g. for patients with muscle loss diseases, age-related muscle atrophy. Therefore a number of genes can be studied. Using search and LLMs, these are relevant proteins for muscle growth,

  • Myostatin: given that my Design a Gene Challenge was about this, it makes sense to study this gene. I suspect mutations in the gene lead to enhanced or reduced function.
  • ACTN3: Determines between “fast” or “slow” muscle fiber twitches. They determine whether one is dispositioned for heavy lifting or endurance.
  • IGF1: This gene expresses a growth factor that aids muscle repair and their growtn after exercise. People with different versions might respond differently to certain training styles.

(ii) In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why?

I’d choose NGS sequencing, as I’m interested in targeted genes. NGS allows for a good middle ground of cost, and accuracy, and also has established protocols for variant detetction.

Also answer the following questions:

Is your method first-, second- or third-generation or other? How so?

Second-generation sequencing. DNA fragments are sequenced simultaniously in massively paralellized fashion.

What is your input? How do you prepare your input (e.g. fragmentation, adapter ligation, PCR)? List the essential steps.

With NGS library prep is necessary. First DNA is extracted. Then the DNA is fragmented into smaller chunks (around 300-500 bp). Next adapter ligation is used to attache each DNA chunk to the coded oligonucleotide on the flow plate. PCR afterwards is optional

What are the essential steps of your chosen sequencing technology, how does it decode the bases of your DNA sample (base calling)?

NGS sequencing takes DNA and fragments it, then ligates it to adapter sequences, which add to a sequencing library. The fragments bind to a flow cell, then aplified to create clusters of identical DNA strands. The gene strand is sequenced by synthesizing a complementary strand using labeled nucleotides, and only one base is incorporated per cycle. After incorporation the flow cell is imaged, and the specific fluoresence is used to identify the added base.

What is the output of your chosen sequencing technology?

A FASTQ files, with a large number of small DNA fragment readouts. Additionally these fragements are then aligned and referenced. These can then be screened for differences.

5.2 DNA Write

(i) What DNA would you want to synthesize (e.g., write) and why? These could be individual genes, clusters of genes or genetic circuits, whole genomes, and beyond. As described in class thus far, applications could range from therapeutics and drug discovery (e.g., mRNA vaccines and therapies) to novel biomaterials (e.g. structural proteins), to sensors (e.g., genetic circuits for sensing and responding to inflammation, environmental stimuli, etc.), to art (DNA origamis). If possible, include the specific genetic sequence(s) of what you would like to synthesize! You will have the opportunity to actually have Twist synthesize these DNA constructs! :)

A synthetic DNA construct that can be injected locally, that will express miRNAs that silence myostatin mRNA, to allow for temporary enhanced muscle growth. This would pose a therapeutical intervention, either for biohackers or patients with degenerative muscle diseases.

(ii) What technology or technologies would you use to perform this DNA synthesis and why? Also answer the following questions:

I’d probably choose Phosphoramidite Chemistry, as its the current gold standard and ample for smaller DNA fragments.

What are the essential steps of your chosen sequencing methods? What are the limitations of your sequencing method (if any) in terms of speed, accuracy, scalability?

I’m a bit confused why sequencing instead of synthesis methods are mentioned. So assuming that synthesis is meant, I’ll talk about the synthesis based on Phosphoramidite Chemistry The essential steps with PC is repeating the so called Coupling cycle, that is repeated for each base. First the protecting group gets removed (deetritylation), theen the next phosphoramidite nucleeatide is added (coupling), the unreacted 5’ groups get blocked (cappping), lastly unstable phosphite is converted to stable phosphate (oxidation). This way fragemnts up to 200bp are synthesised and later assembled.

5.3 DNA Edit

(i) What DNA would you want to edit and why? In class, George shared a variety of ways to edit the genes and genomes of humans and other organisms. Such DNA editing technologies have profound implications for human health, development, and even human longevity and human augmentation. DNA editing is also already commonly leveraged for flora and fauna, for example in nature conservation efforts, (animal/plant restoration, de-extinction), or in agriculture (e.g. plant breeding, nitrogen fixation). What kinds of edits might you want to make to DNA (e.g., human genomes and beyond) and why?

Going with the theme a reversible knockout for myostatin expression. It could be an alternative treatment for patients with degenerative muscular diseases as well as biohackers.

(ii) What technology or technologies would you use to perform these DNA edits and why? To ensure reversability, I’d consider base editors, introducing stop codons into the myostatin coding sequence. I’d chose base editors as they make single-nucleotide changes without cutting DNA, to reduce the chance of unwanted off-target effects, compared to classical CRISPR-Cas9 technology. Also answer the following questions:

How does your technology of choice edit DNA? What are the essential steps?

First Guide RNA detects a specific myostatin sequence, then guides the Base editor there. Next a base conversion occures, where e.g. a cytosine base editor converts a C to a T. With this edit normal codons are converted to pre-mature stop codons. This leads to the myostatin being misformed and ideally not bioactive.

What preparation do you need to do (e.g. design steps) and what is the input (e.g. DNA template, enzymes, plasmids, primers, guides, cells) for the editing?

Select a target sequence to turn a normal codon into a premature stop codon, e.g. CGA to TGA. Next you have to design a Guide RNA to position the base editor at the chosen target. Next check whether there are any off target effects. I’d need the DNA of the base editor, the DNA Template for the guide RNA, an plasmid that integrates the base editor with the guideRNA, restricition and ligation enzymes. For injection into humans, probably also the targeted muscle cells as well as culture media and transfection equipment.

What are the limitations of your editing methods (if any) in terms of efficiency or precision?

Base editing has some key limitation, the most severe one being the low efficiency compared to classical CRISPR methods. Human cells are notoriously difficult to work with, muscle cells are some of the hardest to edit in humans. The delivery of the editor to muscle cells poses another challenge. The base editors have limited target options, based on their capability to make edits in a narrow basepair window. Off target effects are also a concern. Also it needs to be checked, that the editor only comes in contact with the intended C to T conversion as it converts Cs indiscriminantly.

Use of Generative AI

Generative AI was used as a drafting aid throughout the development of this homework assignment. Specifically, it supported the structuring and refinement of complex ideas at the of DNA Design as well as aiding the understanding of DNA Read, Write and Edit technologies. The AI was used to iteratively clarify language and explore alternative framings, while all substantive ideas, judgments, and final decisions were made by the author.

Week 3 HW: Lab Automation

Python Script for Opentrons Artwork

  • Review this week’s recitation and this week’s lab for details on the Opentrons and programming it.
  • Generate an artistic design using the GUI at opentrons-art.rcdonovan.com. I generated a quick design using the above mentioned tool: BioPunk Initials See the: BioPunk Initial BioPunk Initial
  • Using the coordinates from the GUI, follow the instructions in the HTGAA26 Opentrons Colab to write your own Python script which draws your design using the Opentrons.
    • You may use AI assistance for this coding — Google Gemini is integrated into Colab (see the stylized star bottom center); it will do a good job writing functional Python, while you probably need to take charge of the art concept.
    • If you’re a proficient programmer and you’d rather code something mathematical or algorithmic instead of using your GUI coordinates, you may do that instead.
    • If the Python component is proving too problematic even with AI and human assistance, download the full Python script from the GUI website and submit that:
    • If you use AI to help complete this homework or lab, document how you used AI and which models made contributions.

As its good practice in Software Engineering, not to reinvent the wheel, I had a look at the provided examples, and figured Example 7 would be a good basis for my requirements. Nevertheless, significant updates needed to be done to make the code useful for my purposes. These include:

  1. Importing necessary libraries
  2. Converting CSV file reading to Pandas Dataframe
  3. Removing inversion of y-data
  4. Changing the color scheme
  ###
  ### YOUR CODE HERE to create your design
  ###
  # Using nonstandard color setup - will not run on a robot with this lab's standard deck setup.
  well_colors = {
    'B1' : 'Green'
  }

  # Converting coordinates for the design to Opentrons usable dataframe
  bp_coord = [(-19.8,17.6), (-17.6,17.6), (-15.4,17.6), (-13.2,17.6), (-11,17.6), (2.2,17.6), (4.4,17.6), (6.6,17.6), (8.8,17.6), (11,17.6), (13.2,17.6), (-19.8,15.4), (-8.8,15.4), (2.2,15.4), (15.4,15.4), (-19.8,13.2), (-6.6,13.2), (2.2,13.2), (17.6,13.2), (-19.8,11), (-6.6,11), (2.2,11), (17.6,11), (-19.8,8.8), (-6.6,8.8), (2.2,8.8), (17.6,8.8), (-19.8,6.6), (-6.6,6.6), (2.2,6.6), (17.6,6.6), (-19.8,4.4), (-6.6,4.4), (2.2,4.4), (17.6,4.4), (-19.8,2.2), (-6.6,2.2), (2.2,2.2), (17.6,2.2), (-19.8,0), (-8.8,0), (2.2,0), (15.4,0), (-19.8,-2.2), (-17.6,-2.2), (-15.4,-2.2), (-13.2,-2.2), (-11,-2.2), (2.2,-2.2), (4.4,-2.2), (6.6,-2.2), (8.8,-2.2), (11,-2.2), (13.2,-2.2), (-19.8,-4.4), (-8.8,-4.4), (2.2,-4.4), (-19.8,-6.6), (-6.6,-6.6), (2.2,-6.6), (-19.8,-8.8), (-6.6,-8.8), (2.2,-8.8), (-19.8,-11), (-6.6,-11), (2.2,-11), (-19.8,-13.2), (-6.6,-13.2), (2.2,-13.2), (-19.8,-15.4), (-6.6,-15.4), (2.2,-15.4), (-19.8,-17.6), (-8.8,-17.6), (2.2,-17.6), (-19.8,-19.8), (-17.6,-19.8), (-15.4,-19.8), (-13.2,-19.8), (-11,-19.8), (2.2,-19.8)]
  data = pd.DataFrame(bp_coord, columns=["x", "y"])
  data.columns = ["x", "y"]

  #Get min and max x-/y-values from my coordinates
  raw_x_min = np.amin(data['x'])
  raw_x_max = np.amax(data['x'])
  raw_y_min = np.amin(data["y"])
  raw_y_max = np.amax(data["y"])


  #Shift data, so that the centerpoint 0/0 is at the center of my design
  bp_x_shifted = data['x']-((raw_x_min + raw_x_max)/2)
  bp_y_shifted = data["y"]-((raw_y_min + raw_y_max)/2)

  all_distances_to_center = np.sqrt(np.square(bp_x_shifted) + np.square(bp_y_shifted));

  bp_x_85mm_shifted = 40/np.amax(all_distances_to_center)*bp_x_shifted;
  bp_y_85mm_shifted = 40/np.amax(all_distances_to_center)*bp_y_shifted;

  # Get the top-center of the plate, make sure the plate was calibrated before running this
  center_location = agar_plate['A1'].top()
  cell_well = color_plate['B1'] # Change to location of green transformands

  # Aspirate
  pipette_20ul.pick_up_tip()

  for i in range(len(bp_x_85mm_shifted)):
    if i%20 == 0:
      # pick up more every 20 uL, but only as much as we're going to need!
      pipette_20ul.aspirate(min(20, len(bp_x_85mm_shifted)-i), cell_well)

    adjusted_location = center_location.move(types.Point(bp_x_85mm_shifted[i], bp_y_85mm_shifted[i]))
    pipette_20ul.dispense(1, adjusted_location)
    hover_location = adjusted_location.move(types.Point(z = 2))
    pipette_20ul.move_to(hover_location)

  # Don't forget to end with a drop_tip()
  pipette_20ul.drop_tip()
  • Submit your Python file via this form.

Post-Lab Questions

Find and describe a published paper that utilizes the Opentrons or an automation tool to achieve novel biological applications.

Given my background in tissue engineering I picked the paper “Fabrication of cell culture hydrogels by robotic liquid handling automation for high-throughput drug testing” published in “Communications Engineering” (2025)4:222 DOI

The paper introduces HYDRA (HYDrogels by Robotic liquid-handling Automation), a method for fabricating flat and thin hydrogel films directly in standard 96- and 384-well HTS plates using an Opentrons liquid handling machine. The paper tackles one of the main problems in modern drug discovery. Around 50% of compounds passing preclinical assays fail human trials. This has several proposed reasons, the paper tackles one of the widely mentioned problems. In standard HTS cells are grown on plastic dishes that do not resemble the ECM structure of the human body, nor have the dynamic context (mechanical, electrical, etc.) that the ECM provides. Organ on a chip improve the missing biomimicry, but are not compatible with HTS. Additionally, they are too complex too manufacture cheaply and are incompatible with automated screening pipelines. There are existing hydrogel coatings, but they either are so thin that they don’t provide the proper ECM environment mechanically, or so thick that they block high-resolution imaging. Both produce a curved meniscus that makes uniform cell seeding impossible.

This gap, hydrogel coatings thin enough for imaging and thick enough for mechanosensing, is tackled with HYDRA. HYDRA uses fish gelatin for its hydrogel base material, dissolved in PBS at 5-20% w/v and is cross-linked enzymatically with transglutaminase at 0.5-2% w/v. The method was demonstrated on an Opentrons, for easy accessibility, for scalability on a INTRGRA Assist Plus. The robot dispenses precalculated sub-contact volumes 96 well plates: 12 micro liter, 382 well plate: 1 micro liter), and immediately re-aspirates the volume, to archive a thin film. This is possible by using contact angle hysteresis. The volume was optimized using FE simulations. The process takes around 10 minutes for the whole plate, then the plates are gelled, and incubated at 37 degree Celsius, sterilized and swollen and rinsed in PBS. Afterwards cells can be seeded. The hydrogel thin films are between 10-50 micrometer thickness, with tunable stiffness. Lastly the authors validated several imaging platforms (digital holographic, widefield fluorescence, and high-resolution confocal microscopy) see Fig. 1 Figure 1 Figure 1

Write a description about what you intend to do with automation tools for your final project. You may include example pseudocode, Python scripts, 3D printed holders, a plan for how to use Ginkgo Nebula, and more. You may reference this week’s recitation slide deck for lab automation details.

Developmental biology aims to understand how tissues and cells develop from dynamic, history-dependent processes governed by interacting biochemical, electrical, mechanical cues in a temporal manner. While these properties are well established theoretically, in practice experiments lack behind by what can be executed manually. Even if automation is deployed it only addresses one of the drawbacks of manual experimentation. This forced experiments to make simplifications, such as searching only low-parameter search spaces, coarse temporal guiding and open loop design of experiments. This results in heuristic sampling and interpretation of results with lacking search spaces. Using automation directly addresses these limitations by enabling precise temporal control, systematic exploration of parameter rich search spaces, and adaptive, feedback-driven intervention. This is especially relevant in organoid and organ-on-chip systems.

Mimicking the ECM in developmental biology

Many developmental signals only have their proper effect through the timing, duration, and frequency, rather than concentration alone. Temporal regimes are rarely explored experimentally, e.g. competence windows, pulsed signaling, and oscillatory dynamics due to the impracticality of executing. Automated liquid-handling systems can change that by enabling stable, repeatable temporal execution.

Exploring the design space

Cell fate and tissue morphology have very high-dimensional development search spaces. This comes from the interaction of multiple biochemical and mechanical variables that evolve over time. Contrasting this intuition or slow sequential tuning is currently used. Automation enables systematic, multi-parameter exploration of said search space. The goal of automation is to find non-linear responses and regime changes that are not obvious in manual experiments.

Guiding Experiments

Development proceeds through continuous feedback between tissue state and signaling environment, yet most experiments are designed in an open-end fashion, with data analysis and redesign being manual and in a discrete fashion. Automation should allow experiments to close the loop, where readouts inform the next steps of the experiment. The goal is to adaptively explore tissue formation.

Final Project Ideas

As explained in this week’s recitation, add 1-3 slides in your Node’s section of this slide deck with 3 ideas you have for an Individual Final Project. Be sure to put your name, city, and country on your slide!

  • slides are added

Use of Generative AI

Generative AI was used as a conceptual drafting aid during the development of this project. Specifically, it supported the structuring and refinement of complex ideas related to automated biological experimentation and experimental design in tissue engineering, as well as clarifying concepts around temporal control, high-dimensional experimentation, and reproducibility in laboratory automation. The AI was used to iteratively improve clarity of language and to explore alternative conceptual framings. No generative AI was used for the implementation or coding components of the project, and all substantive ideas, technical decisions, and final judgments were made by the author.

Week 4 HW: Protein Design I

Part A: Conceptual Questions

Answer any NINE of the following questions from Shuguang Zhang: (i.e. you can select two to skip)

How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)

Assuming that approx 20-25% of the weight is protein, that leaves us with 100-125g protein in a 500 gram meat piece. The average amino acid molar mass is approx. 100 g/mol. With Avogadro’s number 6.022×1023 the number of molecules is between: 1 x 6.022×1023 = 6.022×1023 and 1.25 x 6.022×1023 = 7.5 ×1023 (approx)

Why do humans eat beef but do not become a cow, eat fish but do not become fish?

Because the DNA of the food doesn’t get copied into the host. Digestion breaks proteins into amino acids or small peptides. Then the human body rebuilds new (human proteins) using DNA instructions and its own cell machinery.

Why are there only 20 natural amino acids?

As with all biology it’s probably an energy equilibrium. Biological life build itself with the smallest amount of chemical diversity it could find, e.g. charged, polar, hydrophobic, aromatic, sulfur, etc, that would make accurate encoding possible.

Can you make other non-natural amino acids? Design some new amino acids.

Chemists have shown that they can make other non-natural amino acide. Some useful ones would be

  1. Clickable amino acid: an alanine-like side chain with an azide or alkyne group (bio-orthogonal labeling).
  2. Photo-crosslinker: phenylalanine-like side chain bearing a benzophenone (forms covalent links under UV to map contacts).

Where did amino acids come from before enzymes that make them, and before life started?

Amino acids could have potentially made by

  1. Atmospheric / energy chemistry: classic spark/UV experiments can generate amino acids from simple gases.
  2. Hydrothermal / geochemical routes: mineral surfaces can catalyze formation and concentration.

If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?

D-amino acids form the opposite handedness, so an α-helix made of D-amino acids is left-handed.

Can you discover additional helices in proteins?

Depends on the definition on the alpha-helices. One can define an alpha-helices, so that there are helices that don’t match that group. There are examples in nature like the 310-helices.

Why are most molecular helices right-handed?

The Chirality of building blocks is relevant, as biology uses mostly L-amino acids, which makes the right-handed α-helix sterically favorable and the left-handed version strained.

Why do β-sheets tend to aggregate? What is the driving force for β-sheet aggregation?

β-strands have “sticky edges”, the backbone has H-bond donors/acceptors that like to be satisfied. Additionally β-sheets can often extend by adding another strand at an exposed edge. Many β-strands present alternating side chains, creating flat, complementary surfaces that pack well.

Part B: Protein Analysis and Visualization

Briefly describe the protein you selected and why you selected it.

As referenced in Week 2, I’m interested in the Protein Myostatin. It’s function is to downregulate new muscle growth. That makes it an attractive protein to inhibit for biohackers as well as clinicians treating patients with muscular dystrophy diseases.

Identify the amino acid sequence of your protein.

To answer these questions, I went to Uniprot, looking up the protein (Entry ID: “O14793, GDF8_HUMAN”), I copied the sequence from the subchapter “Sequence”.

>sp|O14793|GDF8_HUMAN Growth/differentiation factor 8 OS=Homo sapiens OX=9606 GN=MSTN PE=1 SV=1 MQKLQLCVYIYLFMLIVAGPVDLNENSEQKENVEKEGLCNACTWRQNTKSSRIEAIKIQI LSKLRLETAPNISKDVIRQLLPKAPPLRELIDQYDVQRDDSSDGSLEDDDYHATTETIIT MPTESDFLMQVDGKPKCCFFKFSSKIQYNKVVKAQLWIYLRPVETPTTVFVQILRLIKPM KDGTRYTGIRSLKLDMNPGTGIWQSIDVKTVLQNWLKQPESNLGIEIKALDENGHDLAVT FPGPGEDGLNPFLEVKVTDTPKRSRRDFGLDCDEHSTESRCCRYPLTVDFEAFGWDWIIA PKRYKANYCSGECEFVFLQKYPHTHLVHQANPRGSAGPCCTPTKMSPINMLYFNGKEQII YGKIPAMVVDRCGCS

How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids.

Using the Colab notebook that was provided by the HTGAA TAs (see link in the HW announcement). I copied the protein sequence into the variable protein_sequence in the notebook. After running the code block I got the answer: The length of the protein is: 375 aminoacids. The most common amino acid is: L, which appears 33 times.

How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.

With a BLAST search, it was determined that the protein had 250 homologes (see Fig. 1)

ADD FIGURE 1

Does your protein belong to any protein family?

Myostatin belongs to the Transforming Growth Factor - beta (TGF-β) protein family.

Identify the structure page of your protein in RCSB

The protein I use to answer the following questions has the ID 5JI1 | pdb_00005ji1 in RCSB. This is not a protein from Homo Sapiens, but from mus musculus (a mouse). While techincally not the same protein, both proteins are very similar. The choice for the mouse protein is mainly, that there is no isolated myostatin protein from homo sapiens in RCSB.

When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)

According to the overview page of the protein, it was deposited on the 21 April 2016 and released on the 22 March 2017. The resolution of the structure is 2.25 Å, which is below the threshhold mentioned in the question text, therefore one can assume, that the resolution and therefore quality of the protein is good.

Are there any other molecules in the solved structure apart from protein?

As mentioned above the protein is isolated, therefore there are no other structures.

Does your protein belong to any structure classification family?

As mentioned myostatin is part of the TGF-β superfamily (growth factor, cystine-knot fold).

Open the structure of your protein in any 3D molecule visualization software:

PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands)

I tried to install pymol with homebrew, this failed repeatedly due to errors with the QT5 library, that manages desktop applications in python. Initializing a conda environment and installing pymol was successful though. I read through the tutorial, though i decided to use the command line interface of pymol, as its better documented.

Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.

Cartoon Myostatin Cartoon Myostatin Cartoon Myostatin Cartoon Myostatin Cartoon Myostatin Cartoon Myostatin

Color the protein by secondary structure. Does it have more helices or sheets?

Secondary Structure Myostatin Secondary Structure Myostatin As one can clearly see the protein has more beta sheets

Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?

Residue Myostatin Residue Myostatin Hydrophobic residues cluster in the interior of the protein, forming a stabilizing core, while hydrophilic and charged residues are predominantly surface-exposed. This distribution reflects a well-folded, soluble signaling protein whose surface properties are optimized for molecular recognition rather than catalysis.

Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

Surface 1 Myostatin Surface 1 Myostatin Surface Pockets Myostatin Surface Pockets Myostatin The protein surface lacks deep, well-defined binding pockets. Instead, it displays shallow surface depressions and grooves, characteristic of protein–protein interaction interfaces rather than enzyme active sites. This indicates that myostatin binds interaction partners through extended surface patches rather than classical binding pockets.

Part C: Using ML-Based Protein Design Tools

Copy the HTGAA_ProteinDesign2026.ipynb notebook and set up a colab instance with GPU.

I set up a copy of the notebook and set it up with a Google T4 GPU.

Choose your favorite protein from the PDB.

Consistent with the rest of my homework, I chose the myostatin protein, to keep it consistent with Part B, i chose to continue to use the mus musculus sequence. We will now try multiple things in the three sections below; report each of these results in your homework writeup on your HTGAA website:

C1. Protein Language Modeling

Deep Mutational Scans

I decided to use the smallest model “esm2_t6_8M_UR50D” to conserve gpu ressources and avoid hitting rate limits, as the protein in question is rather large. Still I hit the rate limit.

Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.
Mutation Scan Heatmap Myostatin Mutation Scan Heatmap Myostatin
Can you explain any particular pattern? (choose a residue and a mutation that stands out)

The protein has three super columns (7-13, 97-108, 225-235), with very low scores, indicating that they are very constrained, this could indicate a lynchipin in the protein, as the areas are structurally or functionally important.

A conserved cysteine in the mature TGF-β–like domain, mutated to Ser or Ala (Cys→Ser / Cys→Ala).

(Bonus) Find sequences for which we have experimental scans, and compare the prediction of the language model to experiment.

I struggled to find the data necessary ot make the comparison.

Latent Space Analysis

I downloaded the model and processed the batches:

Processing batches: 0%| | 0/15177 [00:00<?, ?it/s]sdpa attention does not support output_attentions=True. Please set your attention to eager if you want any of these features. Processing batches: 100%|██████████| 15177/15177 [18:21<00:00, 13.77it/s] Finished forwarding sequences through the LLM and collecting mean embeddings.

Use the provided sequence dataset to embed proteins in reduced dimensionality.

Latent Space analysis Latent Space analysis Shape of embeddings array before 3D t-SNE: (15177, 320) Shape of embeddings array before 3D t-SNE: (15177, 3)

Analyze the different formed neighborhoods: do they approximate similar proteins?

Latent Space analysis with marker Latent Space analysis with marker There form of the latent space looks like a sphere, with one half being densly filled, while the other one is more sparsely filled. Nevertheless no clear neighborhoods are distinguishable.

When looking at one nearest neighbor neighborhood, I still fail to see any pattern in the data, as the organisms are different, as well as the type of the protein, and the function.

Place your protein in the resulting map and explain its position and similarity to its neighbors.

I was unable to place the myostatin in the resulting map.

C2. Protein Folding

Folding a protein I used Version 1 of ESMFold.

Fold your protein with ESMFold. Do the predicted coordinates match your original structure?

I named the job “myostatin_v1”, with 1 copy and 3 num_recycles. To visualize i used the colorsheme “rainbow” ESMFold structure Myostatin ESMFold structure Myostatin Cartoon Myostating Cartoon Myostating While the overall structure can be described as V or Y shaped in both predicted and actual structure, both structures don’t match on closer look. The original is mostly beta-sheets and has a destinct V shape, while the predicted model is more Y shaped, has several alpha-helices and has a more complex shape to it.

Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?

I repeatedly tried to change the structure, though i hit ratelimits with the GPU. Furthermore, as the predicted structure doesn’t match the database myostatin, either trying mutations or changing large segments will likely further remove the predicted structure from it experimental data. Proposed mutations:

  1. Splitting the protein in 4 equal parts and mutating the sequence at that point
  2. switching 20 aa for a randomly generated 20 aa section.
  3. Inverting the aa sequence.

C3. Protein Generation

Inverse-Folding a protein: Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN I used the model “v_48_020”, the pdb 5JI1, the homomer designed_chain A, num_seqs=1, sampling_temp=0.1

Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one.

Predicted Sequence Probabilities Predicted Sequence Probabilities Mutation Scan Heatmap Myostatin Mutation Scan Heatmap Myostatin Looking at the two heatmaps, they aren’t very similar, as the predicted structure, has selectively high probabilites in random spots and else rather low probabilites. Its completely missing the 3 super clusters of low confidence, like the original one Generating sequences… New Sequence:GQNIVAGPGAAITECSLWPKTVDFKAAGYDWVISPKSYERNYCSGTCTSSXXXXXXXXXXXXXXGPDNVTEKRCVPTETAPITMTYSLGDGKIITETVPNQIVKACGCV

The predicted sequence isn’t similar to the originally procured sequence.

Input this sequence into ESMFold and compare the predicted structure to your original.

Trying to input the newly generated sequence, led to an error. While I cannot determine the reason for said error, my best guess is the large sequence of “X”.

Part D: Group Brainstorm on Bacteriophage Engineering

Due to later start of our Node, we had limited time to find groups and set up a meeting, therefore the drafts of our group are mainly individual, and not discussed

Goal

We target two complementary objectives: (A) Increased stability of the L protein, specifically engineering DnaJ-independent variants that fold correctly without host chaperone assistance; and (B) Higher toxicity / faster lysis, by optimizing the transmembrane oligomerization interface to accelerate pore formation. Goal A is prerequisite to Goal B: a stable, chaperone-independent L is resistant to the most documented E. coli escape mechanism (DnaJ P330Q mutation), and faster lysis narrows the window for resistance acquisition.

Scientific Rational

Three findings define our design space.

  1. DnaJ binds the highly basic N-terminal domain (res. 1–36) of L and relieves a steric inhibition blocking target engagement; removing this domain eliminates DnaJ dependency and accelerates lysis (Chamakura, J Bacteriol 2017).
  2. Near-saturating mutagenesis shows the LS motif (Leu48-Ser49) and flanking residues form a heterotypic interface with an unknown target; exquisitely conservative mutations matter (L44V = dead, L44I = functional) and all are recessive, pointing to a specific binding event rather than membrane disruption (Chamakura, Microbiology 2017).
  3. MS2-L oligomerizes into 10+ mers in nanodisc membranes via its TM domain; cryo-EM shows large envelope lesions starting at the outer membrane (Mezhyrova et al., 2023).

Strategy: neutralize basic charges in Domain 1 so DnaJ is no longer required, while leaving Domains 2–4 (the lytic machinery) untouched.

Computational Tools

ToolApplicationWhy it helps
Clustal OmegaAlign L homologs to identify which aminoacids are freely mutableReproduces and extends the LS-motif alignment from Chamakura (2017). Essential first step: tells us where NOT to mutate.
ESMFoldPredict 3D structure and each designed variant; verify the TM helix remains intact after mutationsFast single-sequence predictor. For a 75 aa peptide with few homologs, much more practical than full AlphaFold for screening many candidates.
AlphaFold-MultimerModel the L–DnaJ complex; confirm charge-neutralized variants show reduced interface confidence. Also model L–L homodimers to check TM packing.Key validation for Goal A: if predicted L–DnaJ interface weakens for our variants, that supports DnaJ independence.
ProteinMPNNInverse folding: redesign Domain 1 (res. 1–36) to be uncharged while fitting the ESMFold-predicted backbone. Domains 2–4 fixed as hard constraints.new sequence for existing fold with position-specific constraints. Generates diverse candidates we can then filter with ESM-2.
ESMZero-shot fitness scoring: rank all candidate variants by pseudo-log-likelihood as a sequence-level sanity checkIndependent of structure prediction. Benchmarked first against known mutants — if it captures L biology, we use it to filter; if not, we rely on conservation alone.

Schematic

Schematic Bacteriophage Schematic Bacteriophage

Pitfalls

We cannot model the most critical interaction (L with its unidentified host target) computationally. ML models may not capture L biology, as L is a 75 aa phage toxin with very few homologs, far outside the training distribution of ESM-2 and AlphaFold

Use of Generative AI

Generative AI was used as a conceptual drafting aid during the development of this project. Specifically, it supported the structuring and refinement of complex ideas related to protein design and use of computational tools. It was instrumental in drafting the computational strategy of engineering the MS2 Lysis Protein L, as well as clarifying the scientific concepts in the related reading. The AI was used to iteratively improve clarity of language and to explore alternative conceptual framings. All final judgments were made by the author. The link for the prompts and responses is attached in the repository.

Week 5 HW: Protein Design II

Part A: SOD1 Binder Peptide Design (From Pranam)

Part 1: Generate Binders with PepMLM

Begin by retrieving the human SOD1 sequence from UniProt (P00441) and introducing the A4V mutation.

The Protein can be found with this link The Protein sequence is:

>sp|P00441|SODC_HUMAN Superoxide dismutase [Cu-Zn] OS=Homo sapiens OX=9606 GN=SOD1 PE=1 SV=2 MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTS AGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVV HEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

Using the PepMLM Colab linked from the HuggingFace PepMLM-650M model card:

Colab Notebook was copied to my personal drive. I applied for access, even though that doesn’t seem to be necessary anymore.

Generate four peptides of length 12 amino acids conditioned on the mutant SOD1 sequence.To your generated list, add the known SOD1-binding peptide FLYRWLPSRRGG for comparison.Record the perplexity scores that indicate PepMLM’s confidence in the binders.

I checked the single-sequence checkmark and pasted the protein sequence above. I gave it the job name “SOD1_fk_v1”. Afterwards I continued with the model and generated several sequences. This had to be done in different runs as most batches contained sequences with an “X”, which later would not be accepted by AlphaFold. In the end I decided to mix and match from different batches. These are the sequences (aa1 - aa4) with the reference sequence being named aa0. The following table recodes the sequence and the perplexity score.

IDSequencePseudo Perplexity
aa0FLYRWLPSRRGG-
aa1AHYGVLAAAVKWRRK15.4397
aa2SRYDVYVGRVKARAK18.3568
aa3WRYDPVTGRYAAKKA9.3430
aa4SWVPVYTAVVKLKRK20.8359

Part 2: Evaluate Binders with AlphaFold3

Navigate to the AlphaFold Server: alphafoldserver.com

For each peptide, submit the mutant SOD1 sequence followed by the peptide sequence as separate chains to model the protein-peptide complex.

aa0 aa0 aa1 aa1 aa2 aa2 aa3 aa3 aa4 aa4

Record the ipTM score and briefly describe where the peptide appears to bind. Does it localize near the N-terminus where A4V sits? Does it engage the β-barrel region or approach the dimer interface? Does it appear surface-bound or partially buried?

IDSequencePseudo PerplexityipTMpTMBindung SiteBinding Format
aa0FLYRWLPSRRGG-0.320.79- In the same region as N-terminus, parallel to the beta barrel
- in the same region as the dimer interface (the dimer interface is where N-Terminus and C-Terminus meet, or easier, the beginning and the end of the protein. It has the form of a strand
Surface BoundSurface Bound
aa1AHYGVLAAAVKWRRK15.43970.430.82- On the opposite side of the protein from the N-terminus
- wrapping around the beta barrel from the side and the top
- on the opposite side of the protein from the N-terminus
Surface Bound
aa2SRYDVYVGRVKARAK18.35680.460.87- around a 90 degree away from N-Terminus and therefore the dimer interface
- paralell to the beta barrel, though it has the form of a C
Surface Bound
aa3WRYDPVTGRYAAKKA9.34300.230.84- around a 90 degree away from N-Terminus and therefore the dimer interface
- perpendicular to the beta barrel
- it has the form of a C, two beta sheets on either side with the belly of the C pointing towards the protein
mostly surface bound, though more burried than the others
aa4SWVPVYTAVVKLKRK20.83590.420.84- around a 120 degree away from N-Terminus and therefore the dimer interface
- paralell to the beta barrel
- shape is random, but wrapped into the protein in a 3D shape
mostly surface bound, though of all the generated the most burried than the others

In a short paragraph, describe the ipTM values you observe and whether any PepMLM-generated peptide matches or exceeds the known binder.

All ipTM values exceed the refrence linker, except aa3, the, though all pTM values exceed the reference sequence. After discussion with TA all values above 0.8 constitute a good confidence score that the overall structure is correct.

Part 3: Evaluate Properties of Generated Peptides in the PeptiVerse

Structural confidence alone is insufficient for therapeutic development. Using PeptiVerse, let’s evaluate the therapeutic properties of your peptide! For each PepMLM-generated peptide:

Paste the peptide sequence. Paste the A4V mutant SOD1 sequence in the target field. Check the boxes Predicted binding affinity Solubility Hemolysis probability Net charge (pH 7) Molecular weight

Peptiverse Screenshot Peptiverse Screenshot

Compare these predictions to what you observed structurally with AlphaFold3. In a short paragraph, describe what you see.

All peptides show weak binding affinity. This is somewhat expected from the AlphaFold data, as the generated Peptides all show only surface level binding.

Do peptides with higher ipTM also show stronger predicted affinity?

According to a quick trendline analysis, the relationship is negative. The affinity scores scatter around 6.27 and with a standard deviation of around 200.

Are any strong binders predicted to be hemolytic or poorly soluble?

My predictions didn’t produce any strong binders, therefore I cannot answer the question. All predicted peptides are non-hemolytic (values range between 22 and 77) and soluable

Which peptide best balances predicted binding and therapeutic properties?

As non of my peptides have strong binding, while they have good therapeutic properties, the sequence with the highest binding affinity is the best balance. Which is aa3, that does have the lowest ipTM score.

Choose one peptide you would advance and justify your decision briefly.

I’d choose aa3, as the sample is doesn’t produce a predicted sequence that has a strong binding affinity, while the therapeutic properties are solid, the peptide with the highest binding affinity is chosen.

Part 4: Generate Optimized Peptides with moPPIt

Now, move from sampling to controlled design. moPPIt uses Multi-Objective Guided Discrete Flow Matching (MOG-DFM) to steer peptide generation toward specific residues and optimize binding and therapeutic properties simultaneously. Unlike PepMLM, which samples plausible binders conditioned on just the target sequence, moPPIt lets you choose where you want to bind and optimize multiple objectives at once.

Open the moPPit Colab linked from the HuggingFace moPPIt model card
Make a copy and switch to a GPU runtime.
In the notebook:
Paste your A4V mutant SOD1 sequence.
Choose specific residue indices on SOD1 that you want your peptide to bind (for example, residues near position 4, the dimer interface, or another surface patch).
Set peptide length to 12 amino acids.
Enable motif and affinity guidance (and solubility/hemolysis guidance if available). Generate peptides.

Unfortunately I wasn’t able to complete this part as google didn’t allow for the use of A100 or L4 GPU. I got the tried with T4 GPUs but the first cell ran in a infinite loop till my GPU credits ran out.

After generation, briefly describe how these moPPit peptides differ from your PepMLM peptides. How would you evaluate these peptides before advancing them to clinical studies?

See above

Part B: BRD4 Drug Discovery Platform Tutorial (Gabriele) [optional for Committed Listeners]

Part 0: Sign-up to Boltz Lab

I signed up for access to the Boltz Lab, though as of writing the Homework i have not received any credentials

Part 1: Structural Predictions in the Sandbox

Part 2: Setting Up a BRD4 Design Project

Part 3: Running Your Virtual Screen

Part 4: Analysis and Discussion

Part C: Final Project: L-Protein Mutants

Option 1: Mutagenesis

Designing these mutants with good computational confidence is hard. It will show you limitations of some of the structure based models. Ultimately, you can pick various combinations of mutations and get lab results and then decide to pick the next round of mutations, but this assay will not be easy to run at scale in this class.

I copied the Colab notebook and worked through it to the best of my ability. Any experimental work will be done in the BioPunk Node at a later time, as per discussion with Eliott Roth (Our Node Leader).

Run this notebook to generate for each position in the amino acid sequence, a “score” for what would happen to the protein if you mutated into another amino acid. It can be positive or negative for the protein. We want to identify possible mutations that are “positive” If you run this notebook - you will see a .csv file in the sidebar. You can download it and look at it in the google sheets if that’s easier.

Running the Colab notebook, gave me several outputs, unfortunately they were quite badly documented therefore all information should be used with caution. The first run analyses the entire protein sequence and produces an exhaustive list of potential mutation sites: Model 1

PositionWild_Type_AAMutation_AALLR_Sccore
50KL2,561468
29CR2,395427
39YL2,24178
29CS2,04315
9SQ2,014325
29CQ1,997049
29CP1,971029
29CL1,960646
50KI1,928801
53NL1,864932
61EL1,818098
52TL1,813968
50KF1,802069
29CT1,797247
29CK1,795878
5FQ1,795244
5FR1,659717
29CA1,648656
27YR1,628061
22FR1,602028
5FP1,596891
50KV1,594576
50KS1,574557
5FT1,559024
5FS1,556417
45AL1,539248
39YS1,517457
27YS1,497053
40VL1,47763
27YL1,474637
22FS1,423358
29CE1,383281
39YA1,364999
29CN1,362601
50KA1,357795
29CI1,344121
5FL1,332615
17NR1,323651
39YI1,320103
39YT1,302804
26DR1,268762
29CH1,246107
39YF1,245851
39YV1,24439
23KR1,236555
25ER1,22935
24HR1,227779
50KT1,222131
27YQ1,218851
27YT1,215567

Model 2

IDAmino AcidPositionLLR_Score
0L502,561468
1L392,24178
2I501,928801
3L531,864932
4L521,813968
5F501,802069
6V501,594576
7S501,574557
8L451,539248
9S391,517457
10L401,47763
11A391,364999
12A501,357795
13I391,320103
14T391,302804
15F391,245851
16V391,24439
17T501,222131
18L541,12086
19R391,064191

Model 3

PositionWild_Type_AAMutation_AALLR_Score
50KL2,561468
29CR2,395427
39YL2,24178
29CS2,04315
9SQ2,014325
29CQ1,997049
29CP1,971029
29CL1,960646
50KI1,928801
53NL1,864932

Model 4

PositionWild_Type_AAMutation_AALLR_Score
50KL2,561468
29CR2,395427
29CR2,395427
39YL2,24178
29CS2,04315
29CS2,04315
9SQ2,014325
9SQ2,014325
29CQ1,997049
29CQ1,997049

Here is the Heatmap generated from the Colab Notebook. Phage Mutation Heatmap Phage Mutation Heatmap

Use the experimental data here. This dataset contains information about mutants of the L-Protein and their effect on lysis in the lab - L-Protein Mutants

I copied the csv file from the website to the googlesheet and the Colab.

First check, does the experimental data correlate with the scores from the notebook in (b)? This should give you a clue on how well these language embeddings capture information about this protein sequence.

Let’s first look at the experimental data. Here the positions, mutations are displayed, together with a binary, whether the lysis has happened. Therefore one needs to look at the lysis = 1. These are 35 entries. Now one needs to compare, whether the positions are in the model generated mutation sites.

I’m unsure what the instructors mean with correlating the scores from the experimental data with the predicted data. The above mentioned workflow is a comparison of targets. In a next step on can check, whether the identified targets have a positive LLR score.

Using information about the effect of protein mutations at these sites - both the scores and the experimental data in the drive, come up with 5 mutations for each student along with how you came up with them and why you believe they would work. 2 of the variants you submit must have mutations in the transmembrane region (refer to notes above on what amino acid positions these are) and 2 of them must be in the soluble region . Remember that you can also use the pBLAST to see which residues are conserved and not mutate them if you want to. One easy way to generate sequence mutations could be to look for residue positions and mutations that have a positive mutational effect either in the experimental or have a positive score from step 1. And pick a combination of those mutations.

IDModelPositionWildtypeMutation
1161EL
2150KF
3253NL
4252L
5353NL

You can utilize Af2_Multimer to generate a Multimeric Assembly; you can do this by making your query sequence as. We want to do this because - A running hypothesis for how this protein functions is that it assembles to make a perforation in the bacterial membrane.

No Colab was provided for that. If applicable the colab from option 2 could be utilized. This can be submitted at a later point.

Use of Generative AI

Generative AI was used as a conceptual drafting aid during the development of this project. Specifically, it supported the structuring and refinement of complex ideas related to protein design and use of computational tools. It was instrumental in explaining the methods used in the colab notebook to generate mutations of the lysis protein, as well as clarifying the scientific concepts in the related reading. The AI was used to iteratively improve clarity of language and to explore alternative conceptual framings. All final judgments were made by the author. The link for the prompts and responses is attached in the repository.

Week 6 HW: Genetic Circuits I

DNA Assembly

Answer these questions about the protocol in this week’s lab:

What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose?

I didn’t find anything pertaining the matter in the protocol itself, though a quick websearch (Link 1, Link 2), revealed that the PCR Master Mix contains 4 main ingreadients.

  1. Phusion DNA Polymerase
    • Copying the Gene
  2. Deoxynucleotide Triphosphates (dNTPs)
    • Building blocks (ATCG) used by the polymerase to synthesize the new strand of DNA
  3. Reaction Buffer
    • Maintaining the ideal environment for the polymerase to function
  4. Magnesium Chloride
    • Cofactor for the polymerase reaction

What are some factors that determine primer annealing temperature during PCR?

The annealing temperature Ta is determined by the melting temperature Tm, around 3-5°C lower. The melting temperature is determined by the primer length and the GC content.

  1. Primer length and GC Content: The longer the primer length, the and the higher the GC content, the higher the annealing temperature can become. GC have three hydrogen bonds, that needs more energy (temperature) to break.
  2. Salt and Ion Concentration: DNA backbone is negatively charged and two strand would repel each other. The salt provides cations to cancel the negative charge.
  3. Primer Concentration: Similar logic as 1). The higher the concentration of the primer the higher the annealing temperature

There are two methods from this class that create linear fragments of DNA: PCR, and restriction enzyme digests. Compare and contrast these two methods, both in terms of protocol as well as when one may be preferable to use over the other.

In short PCR builds DNA up from scratch, while restriction digests cut existing DNA down. PCR binds to single stranded DNA and complements the strand with free flowting dtNs, effectively “negative copying” the DNA. It needs little input DNA. The polymerase docks at custom primers. It can output billions of strands of DNA. The PCR Protocol is thermocyclic and consists of denaturation (95°C), annealing (55-65°C) and extension (72°C). This process is then repeated 30-40 times. Restriction Enzyme Digests cuts DNA fragments by enzymatic cleavage. It needs large amounts of clean DNA fragments. The enzyme binds to restriction sites and cuts them. Protocol for restriction enzyme digest is isothermal. The main steps include mixing DNA, restriction enzyme and buffer, incubation at 37°C for 30 min, then inactivation of the enzyme (and the reation) through heat (around 65-85°C).

The two methods are very different, but from the description it gets clear when to pick each method. PCR should be used when you have little samples, when you want to add sequences, like tags and the sequence is known (essential for primer design) Restriction enzymes should be used when you are cloning (as you need sticky sites, checking the validity of a plasmid, or you’re working with unknown sequences.

How can you ensure that the DNA sequences that you have digested and PCR-ed will be appropriate for Gibson cloning?

Gibson Assembly operates with homologous overlapping ends, and doesn’t require specific restriction sites or leave behind “scars” like traditional restriction cloning. Therefore the main requirement for a successful Gibson Assembly is the correct design of the 20-40bp long overlap.

How does the plasmid DNA enter the E. coli cells during transformation?

There are several transformation methods, like heat shock, electroporation or conjugation, etc. The lab went into detail with heat shock, therefore this will be the focus of the answer. The cells and plasmids are suspended in a cold calcium chlorid solution (mostly in an ice bucket), then heated up in a waterbath or PCR machine (42°C) for 30-60 seconds.

Here is a detailed instruction video

Describe another assembly method in detail (such as Golden Gate Assembly)

I choose Golden Gate Assembly.

Explain the other method in 5 - 7 sentences plus diagrams (either handmade or online).

Golden Gate Assembly is a molecular cloning method that uses Type IIS restriction enzymes to assemble multiple DNA fragments into a single vector in one reaction. Type IIS enzymes (like BsaI or BbsI) cut at a specific distance away, leaving behind 4-base “overhangs” that are not resticted by the enzyme’s binding sequence, unlike traditional enzymes that cut within their recognition site. Plainly they have a recognition site, where they dock, but then instead of cutting in their recognition site, they reach over and cut a few basepairs away. By thoughtfully designing these overhangs, you can ensure that multiple fragments anneal only in a specific, predetermined order. Because the recognition sites are oriented to be “cut out” during the reaction, the final assembled product lacks the original enzyme sites, making the process irreversible . This “one-pot” reaction combines digestion and ligation in a single tube, allowing for the seamless assembly of multiple fragments simultaneously.

Model this assembly method with Benchling or Asimov Kernel!

Asimov Kernel

NOTICE 2026-03-17: As of the deadline, we have not received access to the Asimov Kernel

Create a Repository for your work

Create a blank Notebook entry to document the homework and save it to that Repository

Explore the devices in the Bacterial Demos Repo to understand how the parts work together by running the Simulator on various examples, following the instructions for the simulator found in the “Info” panel (click the “i” icon on the right to open the Info panel)

Create a blank Construct and save it to your Repository

Recreate the Repressilator in that empty Construct by using parts from the Characterized Bacterial Parts repository

Search the parts using the Search function in the right menu

Drag and drop the parts into the Construct

Confirm it works as expected by running the Simulator (“play” button) and compare your results with the Repressilator Construct found in the Bacterial Demos repository

Document all of this work in your Notebook entry - you can copy the glyph image and the simulator graphs, and paste them into your Notebook

Build three of your own Constructs using the parts in the Characterized Bacterials Parts Repo

Explain in the Notebook Entry how you think each of the Constructs should function

Run the simulator and share your results in the Notebook Entry

If the results don’t match your expectations, speculate on why and see if you can adjust the simulator settings to get the expected outcome

Week 7 HW: Genetic Circuits Part II

Part 1

What advantages do IANNs have over traditional genetic circuits, whose input/output behaviors are Boolean functions?

The key advantage of intracellular analog neural networks (IANNs) over traditional genetic circuits lies in the shift from discrete logic to continuous computation. Classical genetic circuits are typically engineered as Boolean systems: inputs are interpreted as “on” or “off”. This abstraction is convenient for engineering and design, but it is fundamentally misaligned with how biology actually operates, where signals exist as continuously varying concentrations and reaction rates.

IANNs, in contrast, operate on these continuous biochemical signals directly. Instead of forcing a threshold-based decision, where a small fluctuation can flip the output entirely, they produce graded responses. In practice, this means a cell can scale its response to stimulus intensity rather than switching abruptly, which is far more consistent with natural regulatory systems.

A second major advantage is robustness as biological systems are inherently noisy. Boolean circuits are fragile under these conditions because they rely on precise thresholding. IANNs mitigate this by distributing computation across multiple interacting components. Much like artificial neural networks, they integrate signals through weighted interactions, which effectively averages out noise and reduces the likelihood of incorrect responses.

Closely related to this is the issue of scalability. As traditional genetic circuits grow in complexity, the number of required components increases rapidly, often becoming impractical to implement in living cells. IANNs compress complexity into tunable parameters such as binding affinities and interaction strengths. This allows them to represent much more complex input–output relationships with fewer physical parts, making them more feasible for real biological systems.

Another important distinction is the type of functions each approach can represent. Boolean circuits are naturally suited for discrete decision boundaries, but they struggle with complex, nonlinear mappings across many inputs. IANNs, on the other hand, can approximate arbitrary nonlinear functions. This enables cells to perform tasks that resemble pattern recognition, such as distinguishing subtle combinations of molecular signals that define a disease state or environmental condition.

Describe a useful application for an IANN; include a detailed description of input/output behavior, as well as any limitations an IANN might face to achieve your goal.

Useful applications of an intracellular analog neural network (IANN) can be cell-based diagnostics and therapy of cancer, where a cell must interpret a complex molecular environment and make a precise decision about whether to release a therapeutic drug. This is a setting where Boolean genetic circuits struggle, because disease states are rarely defined by single markers or sharp thresholds; instead, they emerge from subtle, multidimensional patterns of signals,which are impractical to engineer with boolean logic gates.

  1. Concept
    Engineering an immune cell that can distinguish between healthy tissue and tumor tissue by integrating multiple intracellular and extracellular signals. The cell evaluates a combination of cues, such as surface receptors, cytokine levels, metabolic indicators, and stress signals (pick peer reviewed literature for input adoption) and determines whether the overall pattern corresponds to a malignant state. If the pattern matches, the cell releases a therapeutic molecule; if not, it remains inactive.
  2. Input The inputs to the IANN would be continuous biochemical signals, for example:
    • Expression levels of tumor-associated antigens
    • Cytokine concentrations such as IL-6 or TGF-β in the microenvironment
    • Indicators of hypoxia or altered metabolism
  3. Output
    The output is also continuous but can be coupled to thresholded downstream actions. For example:
    • Low output function → no response (cell remains inactive)
    • Intermediate output → mild response (e.g., secretion of signaling molecules to recruit immune cells)
    • High output → full activation (e.g., cytotoxic killing, release of therapeutic proteins, or induction of apoptosis in the target cell)
  4. Advantages This application benefits directly from the properties of IANNs. The system must interpret noisy, overlapping biological signals and map them to a decision boundary that is not easily expressible as a simple logical rule. An IANN can approximate this complex mapping, remain robust to fluctuations, and avoid false positives that would arise from single-marker detection.
  5. Limitations Two main limitations apply. First, there is the implementation issue: encoding precise weights and nonlinearities in biochemical systems is difficult, as binding affinities and expression levels are not infinitely tunable and are subject to drift over time. Second, training and calibration remain a major challenge; unlike digital circuits, the correct parameter set must often be found through iterative screening or evolution, which is experimentally intensive.

Below is a diagram depicting an intracellular single-layer perceptron where the X1 input is DNA encoding for the Csy4 endoribonuclease and the X2 input is DNA encoding for a fluorescent protein output whose mRNA is regulated by Csy4. Tx: transcription; Tl: translation.Draw a diagram for an intracellular multilayer perceptron where layer 1 outputs an endoribonuclease that regulates a fluorescent protein output in layer 2.

Diagram multilayer perceptron Diagram multilayer perceptron

Part 2

What are some examples of existing fungal materials and what are they used for? What are their advantages and disadvantages over traditional counterparts?

Fungal materials usually based on mycelium (the filamentous network of fungi) are increasingly used as sustainable alternatives to plastics, foams, leather, and some construction materials.

  1. Mycelium packaging materials These are grown by letting mycelium bind loose biomass into a foam-like structure and are used for protective packaging (e.g., replacing Styrofoam). Their main advantage is that they are fully biodegradable, low-energy to produce, and utilize waste streams as feedstock. However, they suffer from lower mechanical consistency, reduced moisture resistance, and currently higher costs compared to petrochemical foams.
  2. Mycelium-based construction materials These include insulation panels, acoustic materials, and experimental bricks. They offer a very low carbon footprint, good thermal and sound insulation, and natural fire resistance. In contrast, they have much lower compressive strength than concrete or engineered wood, are sensitive to moisture, and face significant regulatory hurdles before widespread adoption in construction.
  3. Mycelium leather and emerging fungal textiles These materials are derived from fungal structures grown into dense mats or processed into fiber-like forms for use in fashion, upholstery, and potentially broader textile applications. They offer clear advantages, including avoiding animal use, significantly reducing environmental impact, and enabling tunable material properties during growth. Additionally, they hold promise as biodegradable alternatives to synthetic fibers produced through low-energy processes. However, they currently face several limitations: lower durability compared to high-quality leather, reliance on coatings to achieve desired performance characteristics, insufficient mechanical strength for many applications, and ongoing challenges in processing and scaling production with consistent quality.

What might you want to genetically engineer fungi to do and why? What are the advantages of doing synthetic biology in fungi as opposed to bacteria?

Fungi represent an underexploited chassis for synthetic biology. Unlike bacteria, they are eukaryotic, naturally adapted to degrade complex substrates, and already widely used in industry. This makes them particularly attractive for engineering applications that require secretion, material formation, or complex metabolic processing.

  1. Enhancing material properties for fungal biomaterials One compelling direction is to engineer fungi to produce mycelium-based materials with improved mechanical strength, water resistance, or functionality. This could involve modifying cell wall composition (e.g., chitin, glucans), introducing crosslinking proteins, or enabling secretion of hydrophobic compounds.
  2. Programming fungi for advanced biomanufacturing Fungi could be engineered to produce high-value compounds such as pharmaceuticals, enzymes, or specialty chemicals, especially those requiring post-translational modifications. Compared to bacteria, fungi (as eukaryotes) can correctly fold complex proteins, perform glycosylation, and secrete products efficiently into the medium.
  3. Improving degradation of complex waste streams Fungi are naturally capable of breaking down lignocellulose, plastics, and other recalcitrant materials. This enables applications in waste valorization and circular bioeconomy systems.
  4. Developing functional living materials Fungi could be programmed to create living materials that respond to environmental stimuli, such as self-healing construction materials or responsive insulation. This might involve integrating sensing circuits with growth or material production pathways.

Subsections of Labs

Week 1 Lab: Pipetting

cover image cover image

Subsections of Projects

Individual Final Project

cover image cover image

Group Final Project

cover image cover image