I have included my OpenTron work, answers to post-lab questions and 3 early stage project ideas in the Week 3 lab section.
Subsections of Homework
Week 1 HW: Principles and Practices
First, describe a biological engineering application or tool you want to develop and why.
I want to develop a closed loop pipeline for peptide engineering that uses Feynman–Kac steering to control diffusion-based protein generation at inference time. The goal is to go beyond zero-shot prediction and instead build an automated engineering cycle that repeatedly:
uses FK steering to bias the next round of generative sampling toward better candidates without needing to retrain the underlying diffusion model
This is inspired by FK-steering approach which wraps a diffusion protein generator with a sampling scheme so trajectories are continuously reweighted toward user-defined rewards, which in this case, is the experimental readout.
Peptides are a good choice for this project as they are often fast to synthesize and test, making them compatible with iterative lab loops. However, many properties of peptides we care about (solubility, stability, expression, off-target behavior) can be hard to optimize from prediction alone so a wet-lab loop is attractive. Functionally, they can serve as binders, inhibitors, diagnostic reagents, or modular parts in synthetic biology pipelines.
As a concrete MVP within this class, I hope to learn how to perform the wet lab experiments associated to this project and finish at least 1 cycle. In the medium term, I would like to run comparisons between different computational approaches like simple finetuning or RL. In the long term, I would like to utilizie this method to discover therapeutic proteins.
Next, describe one or more governance/policy goals related to ensuring that this application or tool contributes to an “ethical” future, like ensuring non-malfeasance (preventing harm). Break big goals down into two or more specific sub-goals.
Closed loop design could be repurposed to create harmful biomolecules. Governance should reduce the probability of both deliberate misuse and accidental creation of dangerous function. Thus, one major goal would be to prevent misuse. As sub goals, the following may be good options:
Ensure the system does not optimize toward harmful or restricted targets/functions.
Reduce the chance that hazardous sequences are synthesized without review.
Ensure that there are audit trails and responsible-use norms.
Next, describe at least three different potential governance “actions” by considering the four aspects below (Purpose, Design, Assumptions, Risks of Failure & “Success”).
I propose three governance actions spanning institutional review, synthesis controls, and a logging infrastructure.
Option 1: Institutional Review
Purpose: Add structured risk assessment before synthesis, target changes, or new reward functions in academic protein design projects.
Assumptions: Small review gates and enforce good record keeping practices
Risks: Could push students to under-report. If too strict, it may slow down R&D>
Option 2: Synthesis Controls
Purpose: Require synthesis vendors to use functional or homology-based screening.
Design: Institutions only purchase from vendors who screen orders and verify customers
Assumptions: It is possible to do screening meaningfully well to reduce risk
Risks: The screening needs to be highly accurate to catch edge cases which could have massive negative effects
Option 3: Logging Infrastructure
Purpose: Create a secure shared database that tracks when AI tools generate protein designs
Design: Logging of AI tools and cross-referencing of orders.
Assumptions: Confidentiality and transparency is balanced
Risks: Security or confidentiality concerns from hacking or from sensitive IP
Does the option:
Option 1
Option 2
Option 3
Enhance Biosecurity
• By preventing incidents
2
1
2
• By helping respond
1
2
1
Foster Lab Safety
• By preventing incident
1
2
3
• By helping respond
1
2
1
Protect the environment
• By preventing incidents
2
2
3
• By helping respond
2
2
1
Other considerations
• Minimizing costs and burdens to stakeholders
2
2
2
• Feasibility?
1
2
3
• Not impede research
1
2
1
• Promote constructive applications
1
2
2
Last, drawing upon this scoring, describe which governance option, or combination of options, you would prioritize, and why. Outline any trade-offs you considered as well as assumptions and uncertainties.
In order of priority:
Option 1: This option can arguably be implemented the fastest. MIT already has the safety infrastructure (IBC, EHS) to build on. As a leading institution in AI protein design, MIT can set standards that others follow. A well-designed, lightweight review process could become a widely adopted model.
Option 2: The existing government framework provides a strong template with vendor screening, customer verification, and reporting requirements. However, this depends on federal action and industry cooperation beyond MIT’s control. MIT can help by researching better screening algorithms and influencing governement gold standards.
Option 3: If this project becomes a widely used system, tracking who designed what becomes relatively easy. However, the system will have to be designed extremely well to be scalable, secure, transperent yet confidential.
Tradeoffs:
Speed vs. safety
Open science vs. closed science
Transparent vs. confidential
Key Uncertainties:
How manageable it is to manually gate research directions.
How well screening actually works against deliberate misuse.
How feasible it is to design a logging system everyone is happy with.
Reflecting on what you learned and did in class this week, outline any ethical concerns that arose, especially any that were new to you. Then propose any governance actions you think might be appropriate to address those issues. This should be included on your class page for this week.
Unfortunately, I was ill this week so I was not able to attend class.
Week 2 HW: DNA Read, Write, & Edit
Gel Electrophoresis Designs
Create a pattern/image in the style of Paul Vanouse’s Latent Figure Protocol artworks
I have created an image of mount fuji with clouds in the sky. I have inverted the image so it is easier to visualize.
Note: Since we worked in groups during lab this week, we created a different design than the one shown above for the lab activity.
DNA Design Challenge
Choose your protein.
RES-701-3 is a tiny natural protein made by soil bacteria (Streptomyces). It belongs to a family called lasso peptides, named because their structure looks like a lasso or slipknot. The tail of the protein threads through a loop, creating a knot that is extremely hard to unravel.
This knotted shape makes lasso peptides unusually tough. They resist being broken down by digestive enzymes, heat, and harsh chemical environments. These are properties that most proteins lack, and that make them attractive as potential drugs.
RES-701-3 blocks a receptor on the surface of blood vessel cells called the endothelin type B receptor (ETB). The endothelin system controls blood vessel tightening and relaxation, and becomes dysregulated with age, contributing to high blood pressure and vascular disease. RES-701-3 acts as an inverse agonist, meaning it blocks the receptor and pushes toward a less active state than its resting baseline.
In nature, the bacteria makes this peptide in two parts:
Leader section: MSDITLTPMDLLDLDELAAGGGRSTARE
Core peptide sequence: GNWHEPEIDGWNPHGW
The core is removed from the leader with an enzyme, which makes it active.
Reverse Translate: Protein (amino acid) sequence to DNA (nucleotide) sequence.
The nucleotide sequence of the leader and the core is shown respectively.
Due to evolution, different species have different codons it uses frequently and has abundant matching transfer RNAs for, and codons it rarely uses and has few tRNAs for.
RES-701-3 comes from Streptomyces and strongly prefers codons loaded with G and C. Twist has a Streptomyces coelicolor for codon optimization.
However, it’s worth mentioning that in a 2025 paper by Shihoya et al. paper, they used Streptomyces venezuelae as organism and achieved the highest reported yields. If I was in a real drug development setting, I might go with this.
Here is the codon optimized variant for both leader and core together:
Ribosome Binding Site: We’re using Shine Dalgarno (SD) sequence, AAGGAG, which is supposed to be a good RBS for streptomyces with leaders. It is supposed to be positioned 6 to 10 nucleotides upstream of the start codon, so we will use 7 nucleotides. We’re going to put two spacers before and after the SD sequence, CGACG and ACAC.
CGACGAAGGAGACAC
Start Codon: This is just going to be the usual ATG.
Coding Sequence: We are going to put both of our leader and core peptide sequence together here.
His tag: This is a short string of six histidine amino acids added to the protein so you can fish it out of a mixture using a nickel column. The histidines stick to nickel, letting you pull your protein out of everything else the cell makes. However, in practice, apparently this is not actually good to put on for RES-701-3 because it would interfere with binding the ETB receptor.
CACCACCACCACCACCAC
Stop Codon:TGA tells the ribosome to stop building the protein here. TGA is the preferred stop codon in Streptomyces because it is relatively speaking, GC-rich, matching the organism’s DNA preferences as discussed before. For example, typical stop codon is TAA.
Terminator: Tells the cell’s RNA-copying machinery to stop making mRNA. Without it, the cell would keep reading past your gene into random neighboring DNA. We’re using the fd terminator from a bacteriophage which is commonly used in Streptomyces expression vectors.
GGATCCAAACTCGAGTAAGGATCTCCAGGCATCAAATAAAACGAAAGGC
Reagents
In order to produce these proteins we also need to use some enzymes to be used as reagents, namely, LasB1, LasB2 and LasC. For this lasso peptide, LasB1 binds the leader, delivers the whole precursor to LasB2 which cuts the leader off, and then LasC closes the ring on the core. It doesn’t seem easy to order the reagents so it seems like this peptide wouldn’t be a great choice for the class. In addition, the yield is optimized by using Streptomyces venezuelae, which is also not too common.
Prepare a Twist DNA Synthesis Order
I prepared the lasso peptide order. Here is a picture of the expression cassette below in benchling.
Instead of a clonal gene, I used gene fragments because they work better Streptomyces as an organism rather than e coli, which are the standard cloning vectors.
DNA Read/Write/Edit
5.1 DNA Read
What DNA would you want to sequence (e.g., read) and why?
I would want to sequence the whole genomes of all ~6,000 mammalian species. The largest current collection of mammalian genomes is the Zoonomia project, which contains around 250 whole genomes along with known maximum lifespan data for most of these species. However, expanding this to cover all mammals—paired with their maximum lifespan records—would allow us to train computational models that identify DNA patterns predicting how long a species can live. In short, more genomes means better predictions about which parts of DNA are linked to longevity.
In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why?
Illumina short-read sequencing (second-generation): This produces highly accurate short reads (~150–300 base pairs) and is great for spotting small genetic differences between species.
Is your method first-, second-, or third-generation?
I am using both second-generation Illumina. First-generation refers to older Sanger sequencing, which reads one fragment at a time and is too slow and expensive for whole genomes. Second-generation sequences millions of short fragments in parallel, making it fast and cheap.
What is your input? How do you prepare your input?
The input is genomic DNA extracted from tissue or blood samples of each mammalian species. The essential preparation steps are:
DNA extraction: Isolate high-quality DNA from the biological sample.
Fragmentation: Break the DNA into smaller pieces.
Adapter ligation: Attach short known DNA sequences adapters to the ends of each fragment so the sequencing machine can recognize and handle them.
PCR amplification (Illumina): Make many copies of each fragment to boost the signal.
Quality check: Verify the library is the right size and concentration before loading it onto the sequencer.
What are the essential steps of your chosen sequencing technology? How does it decode bases (base calling)?
Fragmented DNA is attached to a glass surface flow cell, amplified into clusters, and then sequenced one base at a time. In each cycle, a fluorescently labeled nucleotide is added, a camera captures which color lights up at each cluster where each of the four bases has a different color, and the machine records the base. This process repeats hundreds of times to read out each fragment.
What is the output?
The output is digital sequence files, typically in FASTQ format, containing millions of reads—short or long strings of A, T, C, and G letters—along with quality scores indicating how confident the machine is about each base call. These reads are then assembled and aligned computationally to reconstruct each species’ complete genome.
5.2 DNA Write
What DNA would you want to synthesize (e.g., write) and why?
Based on the sequencing data above, I would use trained computational models to predict specific DNA sequences associated with high maximum lifespan. I would then synthesize these predicted longevity-linked sequences—for example, specific gene variants or regulatory elements found in long-lived species like bowhead whales or naked mole-rats—so they can be tested in cell cultures or animal models. The goal is to move from computational prediction to experimental validation: do these DNA sequences actually promote cellular health and longevity?
What technology or technologies would you use to perform this DNA synthesis and why?
Oligonucleotide synthesis from Twist Bioscience: For building short to medium DNA fragments (up to a few thousand base pairs). These companies use chemical synthesis on microchips to build many sequences in parallel, making it fast and affordable.
Gibson Assembly or Golden Gate Assembly: For stitching shorter synthesized fragments together into larger constructs. These are molecular cloning methods that use enzymes to join DNA pieces seamlessly.
What are the essential steps of your chosen synthesis method?
Sequence design: Use computational models to design the target DNA sequences, optimizing codon usage for the target organism and avoiding problematic features (e.g., long repeats, extreme GC content).
Oligonucleotide synthesis: Short single-stranded DNA pieces (oligos, ~50–200 bases) are built base by base using chemical reactions on a solid support. Each cycle adds one nucleotide at a time.
Assembly: Overlapping oligos are combined and joined enzymatically into longer double-stranded fragments (a few hundred to a few thousand base pairs).
Cloning: The assembled fragments are inserted into a circular DNA carrier (plasmid vector) and introduced into bacteria, which copy the DNA as they grow.
Verification: The final constructs are sequenced to confirm the correct sequence was built.
Large construct assembly: Multiple verified fragments are stitched together using Gibson Assembly or Golden Gate Assembly to create larger genetic constructs.
What are the limitations of your synthesis method in terms of speed, accuracy, and scalability?
Speed: Synthesizing and assembling long constructs (>10,000 base pairs) can take weeks, since each fragment must be built, verified, and then joined together step by step.
Accuracy: Chemical synthesis introduces errors at a rate of roughly 1 in 200 bases per oligo. While these errors are corrected through screening and verification, it adds time and cost.
Scalability: Very long or repetitive sequences are difficult to synthesize because the oligos may misassemble or fold in unwanted ways. Sequences with extreme GC content are also harder to build reliably.
5.3 DNA Edit
What DNA would you want to edit and why?
I would want to edit specific genes in model organisms (such as mice) to replace their native sequences with the longevity-associated sequences identified from the analysis above. For example, if the computational model predicts that a certain variant of a DNA repair gene is linked to longer lifespan in mammals, I would edit a mouse’s genome to carry that variant. This would let us test whether swapping in these predicted “long-life” DNA variants actually extends lifespan or improves age-related health outcomes like cancer resistance or cellular repair.
What technology or technologies would you use to perform these DNA edits and why?
I would use CRISPR-Cas9 gene editing, because it is the most precise, versatile, and widely used genome editing tool available. It can make targeted changes at specific locations in the genome of living cells and organisms, and it works well in mammalian systems including mice.
How does your technology edit DNA? What are the essential steps?
Target selection: Identify the exact location in the genome you want to edit.
Guide RNA design: Design a short RNA sequence that matches the target DNA site.
Cutting: The Cas9 protein, guided by the RNA, binds to the matching DNA site and makes a double-strand break.
Repair: The cell’s natural repair machinery fixes the break. If a DNA template with the desired new sequence is provided alongside the CRISPR components, the cell can use it as a blueprint to incorporate the new sequence, called homology-directed repair.
Screening: Edited cells are sequenced to confirm the desired change was made correctly.
What preparation do you need to do, and what is the input?
Design inputs: The target DNA sequence, a custom guide RNA matching that sequence, and a DNA donor template carrying the desired new sequence flanked by regions that match the area around the cut site.
Molecular inputs: Cas9 protein or mRNA, synthesized guide RNA, donor template DNA, and delivery reagents.
Biological inputs: Target mouse cell.
What are the limitations of your editing method in terms of efficiency or precision?
Off-target edits: The guide RNA can sometimes bind to similar sites elsewhere in the genome, causing unintended cuts and mutations.
Low HDR efficiency: Only a fraction of edited cells may carry the precise desired change, requiring extensive screening.
Delivery challenges: Getting CRISPR components into every target cell efficiently, especially in living animals, remains difficult. Some tissues are harder to reach than others.
Week 3 HW: Lab Automation
I have included my OpenTron work, answers to post-lab questions and 3 early stage project ideas in the Week 3 lab section.
Week 4 HW: Protein Design
Part A: conceptual question: Answer any of the following questions from Shuguang Zhang
Why do beta-sheets tend to aggregate?
A beta-strand is what happens when a protein’s backbone which involves the repeating NH–Calpha–CO chain shared by every amino acid stretches out into a nearly flat zigzag. When two or more of these strands line up next to each other and link through hydrogen bonds (where an N–H on one strand bonds to a C=O on the neighbor), you get a beta-sheet. The strands on the outer edges still have a full row of exposed N–H and C=O groups resulting in another strand being added, and so on.
What forces pull sheets together?
The hydrophobic effect is the biggest one. In a beta-strand, side chains stick out. Since many side chains are hydrophobic, two sheets stack such that the greasy surfaces are in the interior.
Hydrogen bonding gives the structure its regularity. Each new strand that joins the sheet edge contributes roughly one H-bond per amino acid along its length. Individually, H-bonds in water are not enormously strong because breaking one with a neighbor just lets you form one with a water molecule instead, but across a strand of ten or more residues, they add up meaningfully.
Van der Waals packing stabilizes sheets that have stacked together. Van der Waals forces are much weaker and shorter-range. They arise from temporary, fluctuating dipoles.
Part B: Protein Analysis and Design
Briefly describe the protein you selected and why you selected it.
I selected a macrocyclic peptide for the following reasons:
They have the ability to interfere with Protein-Protein Interactions (PPI), which is applicable to therapeutics
They have the ability to permeat membranes as they are small and can change conformation depending on hydrophobicity of the environment
They can be programmed with ML for targeting purposes
Compared to linear peptides, they are more robust to proteases because the N-terminus and C-terminus are hidden from proteases
How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids.
How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.
When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)
Are there any other molecules in the solved structure apart from protein?
Does your protein belong to any structure classification family?
Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.
Color the protein by secondary structure. Does it have more helices or sheets?
Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?
Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?
Gel Electrophoresis Designs Our group tried to make a design with the letters “HA” which stands for the name of one of our group members, Hines Alayah. Instead, we somehow ended up with “LU”. Sometimes discoveries in biology are sometimes made serendipitously, so we have decided that “LU” means Love U.
Here are a few photo highlights below.
OpenTron Designs I tried to push OpenTron to the limit and chose a fairly hard design. Specifically, I chose the Mitsudomoe design, which is a type of “Kamon” or traditional family crest, associated with my family. The design didn’t come out particularly well but with higher resolution and/or non-sequential pipetting (for speed) it would be a more tractable design.
Our group tried to make a design with the letters “HA” which stands for the name of one of our group members, Hines Alayah. Instead, we somehow ended up with “LU”. Sometimes discoveries in biology are sometimes made serendipitously, so we have decided that “LU” means Love U.
Here are a few photo highlights below.
Putting the restriction enzymes into the lanes
Preparing buffer
Performing PCR
Pipetting the dye
Separation of the dye
Machine for visualizing the gel electrophoresis results
Result!@
Our team
Week 3 Lab: Lab Automation
OpenTron Designs
I tried to push OpenTron to the limit and chose a fairly hard design. Specifically, I chose the Mitsudomoe design, which is a type of “Kamon” or traditional family crest, associated with my family. The design didn’t come out particularly well but with higher resolution and/or non-sequential pipetting (for speed) it would be a more tractable design.
Mitsudomoe Design & OpenTron Version
Lab Automation
Find and describe a published paper that utilizes the Opentrons or an automation tool to achieve novel biological applications.
The paper I chose is “AssemblyTron: flexible automation of DNA assembly with Opentrons OT-2 lab robots” by Bryant et al., published in Synthetic Biology (2023). The authors developed an open-source Python package called AssemblyTron that connects j5 DNA assembly design software to an Opentrons OT2 liquid handling robot, allowing users to go from a digital DNA design to a physically assembled construct with minimal hands-on work.
What makes this paper compelling is that it automates the entire “Build” step of the Design–Build–Test–Learn cycle, which is traditionally the most manual and error prone part. AssemblyTron handles PCR setup (including calculating optimal annealing temperature gradients), DpnI digestion, and final multi-fragment assembly all on the OT2. The authors validated the system by performing Golden Gate assemblies and in vivo assemblies of four fragment chromoprotein reporter plasmids, achieving fidelity comparable to manual assembly. They also demonstrated automated site directed mutagenesis. The key takeaway is that affordable, open source automation can make DNA assembly more reproducible, less wasteful, and accessible to labs without expensive biofoundry infrastructure.
Write a description about what you intend to do with automation tools for your final project. You may include example pseudocode, Python scripts, 3D printed holders, a plan for how to use Ginkgo Nebula, and more. You may reference this week’s recitation slide deck for lab automation details.
In general, I want to use my adaptive AI system for scientific discovery at a small scale, something realistic as a final project given the resources we have from Twist and Ginkgo Bioworks.
My first idea is a promoter design project to maximize expression. I would order oligos from Twist, clone them into reporters, and observe expression in E. coli. Fluorescence intensity would be recorded as the reward signal. I could possibly do two rounds of this.
As a second idea, the most feasible version would be to ditch the lab-in-the-loop entirely by performing validation in silico. This would also allow for much more complex protein designs since there wouldn’t be a constraint on what is physically feasible to test given the project budget.
As an ideal final project, which is totally not doable in this timeframe or budget, I would use my system to discover higher order transcription factor combinations that forward program iPSCs into a target cell type. The computational engine uses Bayesian optimization to predict TF combinations, balancing exploration and exploitation based on experimental results. To handle the cloning overhead, I would outsource synthesis of polycistronic lentiviral transfer vectors to Ginkgo Bioworks’ Nebula platform, which algorithmically assembles the DNA and returns plasmids in a high throughput 96 well format. Each vector can carry 3 to 4 TFs linked by 2A peptides, and co-transduction with multiple vectors allows testing of even larger combinations.
The OT-2 would then automate lentivirus production by dispensing transfection reagent into arrayed HEK293T packaging cells, harvesting viral supernatant, and transducing iPSC cultures. The robot would also handle the media change schedule post transduction. Because lentivirus integrates into the genome, TF expression is sustained throughout the differentiation window without repeated dosing. At the endpoint, high content phenotypic imaging quantifies differentiation efficiency in each well, and this data feeds directly back into the Bayesian model to predict a more refined batch of TF cocktails for the next automated run.