Does the option: Option 1 Option 2 Option 3 Enhance Biosecurity • By preventing incidents • By helping respond Foster Lab Safety • By preventing incident • By helping respond Protect the environment • By preventing incidents • By helping respond Other considerations • Minimizing costs and burdens to stakeholders • Feasibility? • Not impede research • Promote constructive applications title: ‘Week 1 HW: Principles & Practices’ weight: 10 Introduction and Motivation This week emphasized that biological engineering is not only about what we can build, but how and why we choose to build it. The lectures and recitation highlighted that ethics, safety, and governance should not be treated as external constraints applied after a technology is developed, but rather as integral design dimensions from the earliest stages of a project.
Part 0 — Gel Electrophoresis Basics (Concepts) This week, I reviewed how gel electrophoresis turns a DNA “mixture” into an interpretable pattern. In an agarose gel, DNA fragments migrate toward the positive electrode because DNA is negatively charged, and smaller fragments travel farther through the gel matrix than larger ones. A DNA ladder provides a size reference so unknown bands can be estimated in base pairs. When a restriction enzyme digest is performed, the DNA sequence is converted into a predictable set of fragment lengths, and those fragments appear as bands at specific positions. Band brightness is roughly related to how much DNA mass is in that fragment (longer fragments can look brighter if molar amounts are similar). Overall, the key idea is that restriction digests plus gels let you “read out” a cutting pattern, validate identity, and compare designs or conditions in a simple visual way.
This week emphasized that biological engineering is not only about what we can build, but how and why we choose to build it. The lectures and recitation highlighted that ethics, safety, and governance should not be treated as external constraints applied after a technology is developed, but rather as integral design dimensions from the earliest stages of a project.
Revisiting a previous biosensing project through the HTGAA framework allowed me to explicitly articulate design decisions that were originally motivated by technical performance, but which also carry strong ethical, safety, and governance implications. This exercise helped me move beyond a purely technical evaluation and reflect more deeply on responsibility, context, and downstream impact.
Biological Engineering Application
The biological engineering application I focus on is a cell-free biosensor based on a Pb²⁺-specific DNAzyme coupled to CRISPR-Cas12a, designed for the ultrasensitive detection of lead in water.
Lead contamination represents a serious public health concern, with no safe threshold for chronic exposure. While analytical techniques such as ICP-MS or atomic absorption spectroscopy provide high sensitivity, they require centralized laboratories, specialized equipment, and trained personnel, limiting their accessibility for frequent or decentralized monitoring.
Previous generations of biological sensors, including whole-cell bacterial biosensors, demonstrated the feasibility of biological detection but suffered from long response times, higher detection limits, and biosafety concerns related to the use of living genetically modified organisms. In contrast, this project deliberately adopts a cell-free, in vitro architecture, translating the presence of Pb²⁺ into a fluorescent signal in under one hour.
The motivation behind this application is to combine high sensitivity, portability, and safety by design, enabling environmental monitoring in settings where conventional laboratory infrastructure is unavailable, while minimizing biological risks.
Governance and Policy Goals
Reframing this project within the HTGAA framework led to the identification of several governance and policy goals that extend beyond technical performance.
Goal A – Prevent harm and misuse (Non-malfeasance)
Avoid enabling biological manipulation or amplification of hazardous agents.
Prevent repurposing of the sensing platform for unintended or harmful biological activities.
Goal B – Enhance biosafety and biosecurity
Minimize risks associated with handling living organisms by using a fully cell-free system.
Reduce the likelihood of accidental environmental release or uncontrolled replication.
Goal C – Promote constructive and equitable use
Enable access to sensitive environmental monitoring tools without requiring advanced infrastructure.
Support public health and environmental decision-making rather than surveillance or coercive applications.
Option 1 – Safe-by-design, cell-free system architecture
Purpose Many biosensing platforms rely on living cells, which introduce biosafety, containment, and regulatory challenges. This project replaces whole-cell systems with a fully cell-free, non-replicative architecture.
Design This approach is implemented directly by academic researchers during the design phase and can be reinforced by funding agencies that prioritize safe-by-design technologies.
Assumptions
Eliminating living components significantly reduces biosafety risks.
Performance can be maintained or improved in vitro.
Risks of Failure and “Success”
Failure: reduced robustness in complex environmental matrices.
Success risk: overconfidence in technical safeguards without complementary governance measures.
Option 2 – Transparent documentation of limitations and failures
Purpose Scientific reporting often emphasizes successful outcomes while underreporting failures. This project explicitly documents experimental failures, matrix effects, and design trade-offs.
Design Implemented through detailed lab records and public documentation on the course website, supported by academic training and publication norms.
Assumptions
Transparency improves reproducibility, safety awareness, and ethical reflection.
Risks of Failure and “Success”
Failure: documentation becomes superficial or performative.
Success risk: increased reporting burden for early-stage researchers.
Option 3 – Context-specific deployment guidelines
Purpose Environmental biosensors may be deployed in diverse contexts with different ethical implications. This option proposes context-aware guidelines distinguishing research, environmental monitoring, and regulatory use.
Design Developed by public health and environmental agencies in collaboration with researchers and adapted to local regulatory frameworks.
Assumptions
Misuse risk depends strongly on deployment context.
Local institutions have the capacity to enforce guidelines.
Risks of Failure and “Success”
Failure: inconsistent enforcement across regions.
Success risk: delayed deployment in high-need environments.
Scoring Matrix
Policy Goal
Option 1
Option 2
Option 3
Enhance biosecurity (prevention)
1
2
2
Foster lab safety
1
1
2
Protect the environment
2
2
1
Minimize costs and burdens
1
3
2
Feasibility
1
2
2
Not impede research
1
2
3
Promote constructive applications
1
1
2
Prioritization and Recommendation
Based on this analysis, the highest priority should be given to Option 1 (cell-free, safe-by-design architecture), complemented by Option 2 (transparent documentation). Together, these strategies embed ethical and governance considerations directly into technical design and research practice, rather than relying solely on downstream regulation.
This combined approach is particularly relevant for academic research institutions and funding agencies, where early design choices strongly influence future applications. While these decisions may introduce additional development effort, they significantly enhance safety, trust, and long-term societal benefit.
Weekly Reflection
A key insight from this week is that biosensing technologies are not ethically neutral, even when developed for public health or environmental protection. Portability and accessibility, while beneficial, can also enable misuse if deployment contexts are not carefully considered.
Engaging with the recitation examples reinforced the importance of situating my project at the detection and prevention end of the biological intervention spectrum. This week shifted my perspective from asking only “can this work?” to also asking “should it work this way, and under what conditions?”, a mindset I intend to maintain throughout the course and into the final project.
Documentation Practice
In alignment with the course emphasis on documentation, I am recording all in-silico design steps, experimental iterations, failed conditions, and troubleshooting decisions. This documentation is intended to support reproducibility, collaborative learning, and ethical transparency, and to make visible the full experimental journey rather than only successful outcomes.
George Church – Homework Question
Question chosen: (AA:AA and NA:NA codes) What code would you suggest for AA:AA interactions?
Why we need a code (and what it can/can’t do)
Protein–protein interactions are not “pairwise letters” like Watson–Crick base pairing. They depend on 3D context (distance, solvent exposure, orientation, dynamics, PTMs, local environment). Still, a useful AA:AA “code” can exist as a coarse-grained interaction alphabet: a compact way to describe which residue pairs are likely to attract/repel or stabilize contacts, similar in spirit to how other biological codes map chemistry into discrete symbols.
So the goal is not a perfect predictor of structure, but a portable interaction language that is:
symmetric (A–B = B–A),
composable (many contacts → one interface),
extendable (can include non-standard amino acids / PTMs),
and human-usable (a small alphabet rather than a 20×20 table).
Proposed AA:AA interaction code (two-layer)
Layer 1 — Assign each amino acid to an “interaction class”
Define a small set of classes that reflect dominant chemistry:
H = hydrophobic aliphatic (A, V, L, I, M) Ar = aromatic (F, Y, W) P = polar uncharged (S, T, N, Q) D+ = cationic / H-bond donor-leaning (K, R, H, plus N-termini) A− = acidic (D, E, plus C-termini) S = sulfur/thiol special (C) G = glycine (conformational special) Pro = proline (conformational breaker)
Note: H and Ar are separated because π-stacking and cation-π interactions are distinct modes; Cys is treated separately because it can form disulfides and participate in redox/metal interactions.
Layer 2 — Use a compact “interaction operator” between classes
Use a small set of operators that describe the type of contact:
Example: Cys–Cys → S–S (only if oxidation state and geometry allow)
Why this code is useful
Small alphabet, big coverage: compresses 20×20 possibilities into a readable set of “interaction modes.”
Extendable to non-standard amino acids / PTMs: you can add classes/operators for modified residues (e.g., phospho-Ser behaving more A−-like; methyl-Lys tuning D+ strength).
Bridges to protein design: interface reasoning often uses these primitives (hydrophobic core + H-bond networks + salt bridges + cation-π + disulfides).
Known limitations (important)
Context dependence: the same pair can change behavior depending on burial, pH, dielectric, water mediation, and geometry.
Not a folding code: this is an interaction vocabulary, not a full structural specification.
Many-body effects: cooperative networks (packing + H-bond chains) are only approximated by pairwise labels.
Optional refinement (if more precision is needed)
Add an environment tag:
(B) buried, (E) exposed Example: D+ ± A−(B) often stronger than D+ ± A−(E).
AI / Prompt citation
I used ChatGPT to draft and structure this answer.
Given Church’s lecture framing of codes beyond DNA→AA, propose a concise, extensible AA:AA interaction code that captures major interaction types (hydrophobic, salt bridges, H-bonds, cation-π, disulfide).
Week 2 HW: DNA Read, Write, & Edit
Part 0 — Gel Electrophoresis Basics (Concepts)
This week, I reviewed how gel electrophoresis turns a DNA “mixture” into an interpretable pattern. In an agarose gel, DNA fragments migrate toward the positive electrode because DNA is negatively charged, and smaller fragments travel farther through the gel matrix than larger ones. A DNA ladder provides a size reference so unknown bands can be estimated in base pairs. When a restriction enzyme digest is performed, the DNA sequence is converted into a predictable set of fragment lengths, and those fragments appear as bands at specific positions. Band brightness is roughly related to how much DNA mass is in that fragment (longer fragments can look brighter if molar amounts are similar). Overall, the key idea is that restriction digests plus gels let you “read out” a cutting pattern, validate identity, and compare designs or conditions in a simple visual way.
I created a “gel art” pattern inspired by the idea that restriction digests can produce recognizable visual signatures. The design uses symmetry and band density as the main visual elements: enzymes with few cuts generate sparse lanes (lighter), while enzymes with many cuts generate dense lanes (darker).
Lane plan (left → right): Ladder (Life 1 kb Plus), ApaI, EcoRI, HaeIII, EcoRI, ApaI.
HaeIII creates a high-density fragmentation pattern that acts as the “dark center,” while EcoRI and ApaI provide low-cut, high-molecular-weight bands that frame the pattern.
Part 3 — DNA Design Challenge
3.1 Protein choice
I chose sfGFP (superfolder GFP) as the target protein because it is a robust fluorescent reporter widely used to validate expression, folding, and cloning workflows. It provides an easy quantitative readout (fluorescence) and is a standard “sanity check” part in many synthetic biology builds.
3.2 Reverse translation (baseline CDS)
Starting from the sfGFP amino-acid sequence, I generated a DNA coding sequence (CDS) by back-translation using a codon-usage–matching approach (Benchling output). This produces a valid CDS encoding the same protein sequence.
Protein length: 246 aa
DNA CDS length (no stop codon): 738 bp
sfGFP amino-acid sequence (246 aa):

MSKGEELFTGVVPILVELDGDVNGHKFSVRGEGEGDATNGKLTLKFICTTGKLPVPWPTL
VTTLTYGVQCFSRYPDHMKRHDFFKSAMPEGYVQERTISFKDDGTYKTRAEVKFEGDTLV
NRIELKGIDFKEDGNILGHKLEYNFNSHNVYITADKQKNGIKANFKIRHNVEDGSVQLAD
HYQQNTPIGDGPVLLPDNHYLSTQSVLSKDPNEKRDHMVLLEFVTAAGITHGMDELYKGS
HHHHHH


Back-translated / codon-usage–matched CDS (low GC target):
ATGTCAAAAGGTGAGGAATTATTTACCGGAGTAGTACCAATACTGGTAGAATTAGATGGCG
ATGTTAATGGGCATAAGTTTTCAGTGCGTGGAGAAGGAGAAGGCGATGCTACAAATGGAAA
ATTAACGTTAAAATTTATTTGTACTACTGGGAAACTACCTGTACCTTGGCCAACTTTAGTT
ACAACCTTAACATATGGTGTACAATGTTTTTCTCGTTATCCAGATCATATGAAACGTCATG
ATTTTTTTAAAAGTGCGATGCCTGAAGGTTACGTTCAAGAAAGAACTATATCTTTTAAAGAT
GATGGTACATATAAAACACGAGCTGAAGTAAAATTTGAAGGTGATACTTTGGTTAATAGAAT
TGAACTTAAAGGGATTGATTTTAAGGAAGATGGAAATATTCTCGGACACAAATTAGAATACA
ATTTTAATTCACATAATGTTTACATAACAGCTGATAAACAAAAAAATGGCATAAAAGCAAAT
TTTAAAATAAGACATAATGTAGAAGATGGAAGTGTCCAATTAGCAGATCATTATCAGCAAAA
CACACCAATTGGTGATGGTCCTGTCCTTTTACCAGATAATCATTATTTATCAACCCAATCTG
TTTTGTCAAAAGATCCGAATGAAAAAAGAGATCATATGGTTTTATTGGAATTTGTAACAGCA
GCAGGTATTACTCATGGCATGGATGAATTATATAAAGGCTCTCATCATCATCATCATCAT
Codon optimization for E. coli
I then codon-optimized the CDS for Escherichia coli using a “use best codon” strategy. As expected, the amino-acid sequence is unchanged, but the nucleotide sequence changes due to synonymous codon choices that better match E. coli translation preferences.
Nucleotide identity (baseline vs optimized): 76.96%
GC content (baseline, codon-usage–matched): 33.0%
GC content (optimized, best-codon): 50.0%
Rare codons: 11 (baseline) vs 0 (optimized)
Hairpins (reported by the tool): 0 in both
Thymine fraction (reported by the tool): 0.30 (baseline) vs 0.21 (optimized)
ATGAGCAAAGGCGAAGAACTGTTTACCGGCGTGGTGCCGATTCTGGTGGAACTGGATGGCGAT
GTGAACGGCCATAAATTTAGCGTGCGCGGCGAAGGCGAAGGCGATGCGACCAACGGCAAACT
GACCCTGAAATTTATTTGCACCACCGGCAAACTGCCGGTGCCGTGGCCGACCCTGGTGACCA
CCCTGACCTATGGCGTGCAGTGCTTTAGCCGCTATCCGGATCATATGAAACGCCATGATTTT
TTTAAAAGCGCGATGCCGGAAGGCTATGTGCAGGAACGCACCATTAGCTTTAAAGATGATGG
CACCTATAAAACCCGCGCGGAAGTGAAATTTGAAGGCGATACCCTGGTGAACCGCATTGAAC
TGAAAGGCATTGATTTTAAAGAAGATGGCAACATTCTGGGCCATAAACTGGAATATAACTTT
AACAGCCATAACGTGTATATTACCGCGGATAAACAGAAAAACGGCATTAAAGCGAACTTTAA
AATTCGCCATAACGTGGAAGATGGCAGCGTGCAGCTGGCGGATCATTATCAGCAGAACACCC
CGATTGGCGATGGCCCGGTGCTGCTGCCGGATAACCATTATCTGAGCACCCAGAGCGTGCTG
AGCAAAGATCCGAACGAAAAACGCGATCATATGGTGCTGCTGGAATTTGTGACCGCGGCGGGC
ATTACCCATGGCATGGATGAACTGTATAAAGGCAGCCATCATCATCATCATCATCAT
Best way to obtain the DNA
For a ~0.74 kb CDS like sfGFP, the most straightforward approach is gene synthesis (ordering a dsDNA fragment). It is fast, accurate, and does not require an existing template. If a plasmid template is already available, an alternative is PCR amplification + cloning (e.g., restriction cloning or Gibson), but synthesis avoids PCR-introduced mutations and simplifies the workflow.
Codon-optimized CDS (best codons, medium GC target)
## Part 4 — DNA Write (Ordering + Construct Design)
### 4.1 Expression cassette design (what I would build)
To express **sfGFP in *E. coli***, I would build a standard bacterial expression cassette:
- **Promoter:** T7 promoter (for high expression in BL21(DE3)-like strains) or a strong constitutive promoter if T7 is not desired
- **RBS:** strong bacterial RBS (e.g., a consensus Shine–Dalgarno / gene10-like RBS)
- **CDS:** sfGFP coding sequence, codon-optimized for *E. coli* (AA sequence unchanged)
- **Tag / stop:** optional **C-terminal 6xHis** tag for purification + **stop codon**
- **Terminator:** strong transcription terminator (e.g., T7 terminator / bacterial terminator)
This design is simple, robust, and makes fluorescence an immediate readout for “does expression work?”.
### 4.2 What I would order (DNA “write” step)
Because the sfGFP CDS is short (~0.7–0.8 kb), the most straightforward approach is **DNA synthesis** (a dsDNA fragment or a cloned gene). Concretely, I would order one of these:
**Option A — Gene fragment (fast + flexible)**
- Order the **sfGFP insert as dsDNA** with flanking overlaps for Gibson/HiFi assembly (or with restriction sites).
- Then clone into an expression plasmid in the lab.
**Option B — Cloned gene in a plasmid (one-step ready)**
- Order **sfGFP already cloned** into a high-copy plasmid backbone.
### 4.3 Twist Bioscience access limitation (Argentina) + workaround plan
From my location (Argentina), the Twist ordering portal is not accessible and prompts me to contact a local operator. In a real order scenario, I would do one of the following:

1) **Contact Twist local sales/support** (as requested) and place the order via email (sequence + vector + cloning format).
2) Use an **alternative synthesis provider** that ships to my region (e.g., ordering a dsDNA fragment from another vendor) and then perform the same assembly into an equivalent plasmid backbone.
For the purposes of this homework, I describe the intended order and construct as if placing a standard synthesis + cloning order.
### 4.4 Vector choice and final construct
If using Twist’s catalog, I would choose a standard **high-copy AmpR plasmid backbone** (e.g., a pTwist Amp high-copy–type vector), and insert the sfGFP expression cassette into it.
Final construct conceptually looks like:
**[T7 promoter] – [RBS] – [sfGFP CDS (E. coli optimized)] – [6xHis] – [STOP] – [Terminator]**
### 4.5 How I would obtain protein from this DNA (high-level workflow)
1) **Assemble** the insert into the plasmid (Gibson/HiFi or restriction cloning).
2) **Transform** into *E. coli* (expression strain if using T7).
3) **Verify** by sequencing (to confirm sfGFP is correct and in-frame).
4) **Express** and measure fluorescence as a fast functional readout.
5) (Optional) **Purify** via His-tag if purification is required.
This approach separates “DNA write” (ordering/synthesis) from “DNA read” (sequencing verification) and “DNA function” (fluorescence output).
## Part 5 — DNA Read / Write / Edit (Dengue focus: Argentina)
### 5.1 DNA Read
**(i) What DNA/RNA would I want to sequence and why?**
I would focus on **genomic surveillance of Dengue virus (DENV) in Argentina**, integrating **clinical** and **environmental** sequencing to support public health decisions in real time.
Concretely, I would sequence:
1) **Clinical DENV genomes (RNA → cDNA)** from a **representative subset** of confirmed cases:
- **Across regions** (e.g., AMBA vs. northern provinces where dengue burden can be higher).
- **Across time** (weekly/biweekly sampling during season peaks).
- **Across epidemiological contexts** (outbreak clusters, travel-associated cases, and sporadic detections).
**Why:**
- To track **serotype dynamics** (DENV-1/2/3/4) and detect shifts that may correlate with outbreak intensity.
- To monitor **lineage introductions** (new clades entering a province) and infer **transmission connectivity** between regions.
- To support **molecular epidemiology**: identify clusters, potential superspreading contexts, and genomic signatures associated with rapid spread (without overclaiming causality).
- To generate local datasets that strengthen **regional capacity** and reduce dependence on external sequencing pipelines.
2) **Environmental DENV surveillance in Aedes aegypti pools** (and optionally wastewater as exploratory):
- **Mosquito pools** (RT-PCR confirmed) from vector surveillance programs: this can provide early hints of circulating serotypes/lineages even before clinical case counts surge.
- **Wastewater** is less standard for DENV than for enteric viruses, but could be explored as a research add-on; vector-based sampling is usually more direct for arboviruses.
**Why:**
- To get **earlier warning signals** and a broader picture of circulation beyond who shows up at clinics.
- To link **vector circulation** with **human cases**, improving outbreak models.
---
**(ii) What sequencing technology would I use and why?**
I would use a **two-tier strategy**:
- **Illumina short-read sequencing (2nd generation)** for routine surveillance:
- High per-base accuracy, scalable multiplexing, strong variant calling.
- Great for producing reliable consensus genomes and phylogenies.
- **Oxford Nanopore sequencing (3rd generation)** for rapid, field-forward situations:
- Faster turnaround when you need same-week answers (e.g., suspected new introduction or unusual outbreak).
- Useful for decentralized labs or mobile workflows, at the cost of higher raw read error (mitigated by coverage + consensus polishing).
This hybrid approach fits a realistic public health workflow: Illumina as the “gold standard backbone”, Nanopore as the “rapid response tool”.
---
**1) Is it first-, second-, or third-generation? How so?**
- **Illumina = second-generation**: massively parallel short reads (sequencing-by-synthesis).
- **Nanopore = third-generation**: single-molecule sequencing, long reads, electrical signal through nanopores.
---
**2) What is the input? How do you prepare your input? Essential steps.**
**Input:** Dengue is an **RNA virus**, so the primary input is **viral RNA** extracted from samples, then converted to **cDNA**.
A practical pipeline:
**Clinical samples (serum/plasma/whole blood, depending on stage):**
1. **Sample + metadata collection** (date, location, Ct value, suspected serotype if known, etc.).
2. **RNA extraction**.
3. **RT step → cDNA**.
4. **Target enrichment strategy** (choose one):
- **Amplicon tiling PCR** (common for viral genomes; efficient and cheap).
- OR **capture-based enrichment** (more flexible but more expensive).
5. **Library preparation**:
- Illumina: adapter ligation + indexes (multiplexing), optional PCR.
- Nanopore: end-repair + adapter ligation, optional barcoding.
6. **Sequencing run**.
7. **Bioinformatics**: QC → mapping → consensus → variants → phylogeny.
**Mosquito pool samples:**
1. **Pool preparation** (Aedes aegypti pools, ideally with RT-qPCR confirmation).
2. **RNA extraction** (often with inhibitors → extra QC).
3. RT → cDNA, then same as above.
**Key practical note:** For DENV, sampling time matters: early infection tends to have higher viremia (better genome recovery). Also, using Ct thresholds to select samples improves success rate.
---
**3) How does it decode the bases (base calling)?**
- **Illumina**: fluorescent signals from nucleotide incorporation per cycle → base calls + quality scores.
- **Nanopore**: ionic current shifts as molecules pass through the pore → signal-to-sequence base calling (model-based), then consensus polishing.
---
**4) What is the output?**
- **FASTQ** reads (with quality scores).
- **BAM/CRAM** alignments to a reference genome.
- **Consensus genome FASTA** per sample.
- **Variant calls (VCF)** (when appropriate).
- **QC reports** (coverage depth, % genome recovered, contamination checks).
- Downstream: **phylogenetic trees** and **lineage/cluster summaries** for epidemiological interpretation.
---
### 5.2 DNA Write
**(i) What DNA would I want to synthesize and why? (Dengue-focused)**
I would “write” DNA that enables **faster and more deployable dengue diagnostics** and/or supports local R&D.
Three concrete synthesis targets:
1) **DENV diagnostic standards and controls** (safe, non-infectious):
- Synthetic **gene fragments** (e.g., conserved regions of DENV genome used in RT-qPCR/CRISPR assays).
- **Positive control templates** for assay development and QA/QC.
**Why:** robust controls are crucial for reliable diagnostics, especially across multiple labs and seasons.
2) **CRISPR-based dengue detection components** (research prototype):
- Synthetic DNA templates to generate **RNA targets** (IVT) or **reporter constructs** for assay benchmarking.
- If building cell-free or isothermal detection workflows, you can synthesize the necessary templates without needing infectious material.
**Why:** safer, faster iteration.
3) **Aedes-related biosensor modules** (optional):
- DNA parts for sensor chassis optimization (e.g., expression cassettes for reporters in E. coli cell-free systems).
**Why:** create modular “plug-and-play” parts to accelerate prototyping.
---
**(ii) What technology would I use for DNA synthesis and why?**
- For ~0.3–3 kb fragments: **commercial gene synthesis** (dsDNA fragments or cloned gene in a plasmid).
- For many variants: **oligo pools** (array-based synthesis) + assembly.
**Why:** speed + reliability, avoids PCR errors, and supports rapid iteration (especially when you want multiple versions: different primers, target regions, or assay designs).
---
**1) Essential steps (high-level)**
- Design sequence (include constraints: avoid repeats/extreme GC, include needed cloning sites/overlaps).
- Order as dsDNA fragment (or oligos + assembly).
- If needed: clone into plasmid backbone (Gibson/HiFi or restriction cloning).
- Verify by sequencing (at least Sanger for inserts, or NGS for pools).
- Use as template/control in downstream assays.
---
**2) Limitations (speed, accuracy, scalability)**
- **Length & complexity**: longer sequences or high repeat content may fail or take longer.
- **Error rate**: increases with length; sometimes error correction or clone screening is needed.
- **Sequence constraints**: extreme GC, hairpins, homopolymers can reduce success.
- **Regulatory/shipping**: international access can be limited; some vendors require regional sales contact.
- **Cost**: scales with length and number of variants.
---
### 5.3 DNA Edit
**(i) What DNA would I want to edit and why? (Dengue context)**
I would focus on edits that are **ethically appropriate, feasible, and beneficial**, avoiding speculative or high-risk human germline scenarios.
Two realistic editing directions:
1) **Editing lab strains (E. coli or cell-free chassis) to improve dengue diagnostic prototyping**
Examples (conceptual):
- Reduce background nuclease activity that can degrade reporters.
- Improve expression stability of reporter proteins or enzymes used in readouts.
**Why:** more robust, reproducible diagnostics and faster prototyping cycles.
2) **Vector biology research (Aedes aegypti) — in controlled research settings**
Examples (high-level):
- Knock-in/knock-out genes to study **vector competence** or immune pathways relevant to arbovirus replication.
**Why:** better understanding of transmission biology can support long-term control strategies (with strong oversight and biosafety/ethics review).
---
**(ii) What technology would I use and why?**
- **CRISPR-Cas9** for knock-outs and knock-ins in model systems.
- **Base editing** for precise point mutations (when you want to avoid double-strand breaks).
- **Prime editing** for flexible small edits (insertions/deletions/substitutions) with less HDR dependence.
Choice depends on the edit:
- Big insertions → Cas9 + HDR (or targeted integration strategies).
- Single base changes → base editor.
- Small flexible edits → prime editor.
---
**1) How does it edit DNA? (conceptual steps)**
- Guide RNA targets a specific locus.
- Editor performs cut or base conversion.
- Cellular repair/processing results in the desired change.
- Screen and validate clones/lines.
---
**2) What preparation is needed and what is the input?**
- Target selection + guide design + off-target risk assessment.
- Editor delivery strategy (plasmid, mRNA, RNP).
- Optional donor template for HDR edits.
- Validation plan:
- PCR across the locus, Sanger/NGS confirmation,
- phenotype/functional assay relevant to the edit,
- off-target screening where appropriate.
---
**3) Limitations (efficiency/precision)**
- **Delivery** limitations (some cell types/organisms are difficult).
- **Off-targets** and unintended edits (varies with editor/guide).
- **HDR efficiency** can be low; requires careful design and screening.
- Need for **strong controls**, replication, and transparent reporting.
Week 3 HW: Lab Automation
## What I built
I created a two-color agar-art pattern (hummingbird) using the Automation Art Interface to generate coordinate lists for red and green dots. I then implemented an Opentrons OT-2 protocol (Python API) that dispenses 1 µL droplets at each (x, y) coordinate on a black agar plate.
Key constraints and design choices
Units: all coordinates are in mm.
Safety boundary: all points are constrained within a 40 mm radius from (0,0).
Droplet volume: 1 µL per dot (default for black agar plates).
Anti-streaking: used dispense_and_detach() motions to reduce streaking artifacts.
Contamination control: used one tip per color (red tip, green tip).
Efficiency: aspirated in chunks (up to 20 µL for P20) to reduce overhead while avoiding waste.
How I validated
I ran the provided Colab simulation and confirmed the visualized plate matches the intended design.
I confirmed the protocol does not raise any “outside radius” errors.
Simulator screenshot is saved in assets/simulation.png.
Files
protocol.py — OT-2 run code (robot-run block)
post_lab.md — mandatory post-lab questions (automation plan + paper summary)
weekly_questions.md — questions + short answers for node presentation
ai_disclosure.md — brief disclosure of AI assistance (if applicable)
pass this e.g. ‘Red’ and get back a Location which can be passed to aspirate()
def location_of_color(color_string):
for well,color in well_colors.items():
if color.lower() == color_string.lower():
return color_plate[well]
raise ValueError(f"No well found with color {color_string}")
For this lab, instead of calling pipette.dispense(1, loc) use this: dispense_and_detach(pipette, 1, loc)
def dispense_and_detach(pipette, volume, location):
"""
Move laterally 5mm above the plate (to avoid smearing a drop); then drop down to the plate,
dispense, move back up 5mm to detach drop, and stay high to be ready for next lateral move.
"""
assert(isinstance(volume, (int, float)))
above_location = location.move(types.Point(z=location.point.z + 5)) # 5mm above
pipette.move_to(above_location) # Go to 5mm above the dispensing location
pipette.dispense(volume, location) # Go straight downwards and dispense
pipette.move_to(above_location) # Go straight up to detach drop and stay high
YOUR CODE HERE to create your design
— Coordinates copied from the Automation Art Interface (units: mm) —
Use ONLY these two lists to comply with “red + green only”
def assert_within_radius(points, max_r=40.0):
for (x, y) in points:
r = (x2 + y2) ** 0.5
if r > max_r:
raise ValueError(f"Point outside allowed radius: (x={x}, y={y}) has r={r:.2f} mm > {max_r} mm")
Q1) How would you use automation tools for your final project?
I plan to use automation (Opentrons OT-2 and/or cloud lab workflows) to accelerate the design-build-test-learn (DBTL) loop for a rapid biosensing platform aligned with my research interests (aptamers + CRISPR-based detection).
What I would automate:
High-throughput reaction setup (96-well): systematic screening of buffer composition (Mg2+, salt, pH), reporter concentration, enzyme concentrations (Cas12/Cas13), and incubation time/temperature.
Controls and calibration: automated no-target controls, positive controls, and dilution series to estimate LOD/LOQ and dynamic range.
Matrix robustness: testing sensor performance in different sample matrices (buffer vs. complex matrices) and common interferents.
Data capture and analysis: standardized plate-reader workflows + automated parsing/plotting scripts to compare conditions and select top-performing protocols.
Why automation matters:
It reduces pipetting variability, improves reproducibility, and enables exploration of larger experimental design spaces with fewer manual errors.
It makes protocols traceable and shareable as code (protocol + metadata), which supports reproducible science and scalability.
Success criteria:
Faster iteration (more conditions tested per unit time) compared to manual setup.
Improved reproducibility across replicates and across days.
Identification of robust assay conditions that preserve sensitivity under realistic sample conditions.
Q2) Summarize one published paper that uses Opentrons / lab automation
Paper
Title: Slowpoke: An Automated Golden Gate Cloning Workflow for Opentrons OT-2 and Flex
This paper introduces Slowpoke, an open-source, user-friendly automation workflow for Golden Gate-based cloning on the Opentrons OT-2 and Opentrons Flex. The motivation is that manual DNA assembly and downstream steps (transformation, plating, screening) become labor-intensive and error-prone at scale, and accessible automation can improve standardization and throughput while reducing hands-on time.
Overview (Paragraph 2)
Slowpoke automates major steps of the DNA assembly pipeline, including cloning, E. coli transformation, plating, and colony PCR, with user intervention primarily for colony picking and plate transfers. The authors also provide a free GUI (Streamlit app) to generate robot protocols through simple file uploads, lowering the barrier for users who do not want to write code manually. The full suite (code and templates) is made available as open source.
Key findings (Paragraph 3)
The workflow is validated using two Golden Gate toolkits: MoClo Yeast Toolkit (YTK) and SubtiToolKit (STK). Reported assembly outcomes include 17/17 positive colonies with YTK on OT-2, 11/12 on Flex, and 8/13 with STK on OT-2. For higher-throughput combinatorial assemblies on Flex (six-part assemblies), 55 out of 57 combinations resulted in correct constructs. Overall, the results support that affordable automation platforms can achieve robust cloning performance while improving reproducibility and scalability.
### Figures (1–2 maximum)
Suggested figures to include in your submission:
A workflow schematic figure showing the end-to-end automated pipeline (assembly → transformation → plating → colony PCR).
A results figure/table showing assembly success rates or validation outcomes across toolkits/platforms (including the high-throughput 55/57 result).
Week 3 — Questions Developed (Opentrons Artwork)
1) What are the core constraints for OT-2 agar art?
All coordinates are in millimeters, points must remain within a 40 mm radius from the center, and 1 µL drops are a safe default on black agar plates.
2) Why does spacing matter (e.g., 2.5 mm vs 3.5 o 5 mm)?
Smaller spacing increases resolution but increases the chance droplets merge; larger spacing reduces merging risk but lowers image detail.
3) What causes streaking and how do you prevent it?
If the tip moves laterally immediately after dispensing, it can drag liquid and create streaks. Using a dispense-and-detach motion (up/down) helps detach the droplet and reduces streaking.
4) Why use one tip per color?
Using one tip per color prevents cross-contamination of color wells and keeps fluorescence signals cleanly separated.
5) How do you minimize wasted reagents and time?
Aspirate in chunks (up to 20 µL for a P20) and only aspirate what you will dispense, while keeping tip usage minimal without cross-contaminating color wells.
6) What depends on TA calibration and why?
The agar plate labware calibration determines the true plate center location. If calibration is off, the entire pattern can shift and potentially hit the plate wall.
7) How did you validate your protocol before submission?
I ran the Colab simulator, confirmed the visualization matches the intended design, confirmed no “outside radius” errors, and ensured the protocol uses two tips (one per color).
8) What are the main failure modes to watch for?
Points outside radius, dot merging due to tight spacing, streaking due to motion, and permission issues (Colab link not shared as viewer).
Week 4 HW: Protein Design Part I
Week 4 — Protein Design Part I
Part A — Conceptual Questions (9/11)
Selection note: The assignment allows answering 9 out of 11 questions. I focused on questions most directly connected to protein design: size/constraints, chirality and secondary structure, and why β-structures tend to aggregate.
Q1) How many amino acids are in a typical protein? How large is it?
It depends on the organism and the protein family, but a practical rule of thumb is:
Typical bacterial proteins: ~250–350 aa
Typical eukaryotic proteins: ~350–600 aa (more domains and regulation)
Real range: from microproteins <50 aa to very large proteins like titin (~30,000+ aa).
In terms of mass:
A rough average is ~110 Da per amino acid.
Therefore, a 300 aa protein is ~33 kDa (300 × 110 Da).
Key point: “typical size” is not a rule; it reflects tradeoffs among function, biosynthetic cost, folding constraints, and domain modularity.
Q2) Why can’t humans eat grass and become like cows? (i.e., why can’t we digest cellulose?)
Humans lack cellulases, the enzymes needed to hydrolyze the β(1→4) glycosidic bonds of cellulose.
We can digest starch (α(1→4) and α(1→6)) using amylases.
Cellulose is still glucose-based, but the bond stereochemistry changes polymer geometry and packing: it becomes crystalline and rigid, and our enzymes do not recognize/attack it effectively.
Cows are not “magical” either:
They rely on a rumen microbiome (bacteria/protozoa/fungi) that produces cellulases.
In practice, the cow hosts an internal bioreactor and absorbs the breakdown/fermentation products.
Q3) Why are there 20 amino acids (and not 10 or 50)?
The canonical set of 20 amino acids likely represents an evolutionary “sweet spot” balancing:
Sufficient chemical diversity
charged (+/−), polar, hydrophobic, aromatic, nucleophilic, sulfur-containing side chains, etc.
enough to build catalysis, recognition, and stable structures.
Translation cost and fidelity
more amino acids ⇒ more tRNAs, aminoacyl-tRNA synthetases, quality control
higher energetic cost and potentially higher error burden.
Genetic code robustness
the code is redundant; point mutations often yield chemically similar substitutions
supports robustness while still offering broad functional expressivity.
Also, biology already extends beyond 20 through:
selenocysteine (Sec, U) and pyrrolysine (Pyl, O), and
post-translational modifications (phosphorylation, glycosylation, etc.) that expand functional chemistry without rewriting the entire code.
Q4) What advantages would proteins with non-natural amino acids have?
Potential advantages include:
New chemistry: functional groups not available in the canonical 20 (azides, alkynes, photoreactive groups, bioorthogonal handles).
Greater stability: increased resistance to proteases, oxidation, or unfolding (context dependent).
External control: photoactivatable or chemically switchable residues.
Enhanced catalysis: introduce designed nucleophiles or metal-binding functionalities.
Main limitation: the cellular “stack” must support it (e.g., genetic code expansion with orthogonal tRNA/synthetase systems, and ribosomal compatibility).
Q5) Could amino acids form under prebiotic conditions? How?
Yes—there is classic experimental evidence:
Miller–Urey-type chemistry produces simple amino acids (e.g., glycine, alanine) from small molecules plus energy inputs (e.g., electrical discharge).
Plausible additional routes include meteoritic synthesis (amino acids detected in meteorites) and chemistry on mineral surfaces.
However, amino acids alone do not imply functional proteins. Key barriers include:
Polymerization: long peptide formation in water is thermodynamically challenging.
Functional folding: protein function requires information-rich sequences, not random polymers.
Q6) Can an α-helix form with D-amino acids?
Yes. The α-helix exists as a geometry; what changes is handedness.
With L-amino acids, α-helices are typically right-handed.
With D-amino acids, the corresponding helix tends to be left-handed.
Design relevance: D-peptides can preserve stable secondary structure while being highly protease-resistant, since most proteases are adapted to L-amino acid substrates.
Q8) Why are most α-helices in proteins right-handed?
Because proteins are made of L-amino acids, and for L-backbones the right-handed α-helix is energetically favored (reduced steric clashes in backbone and side-chain packing).
Left-handed helices can occur but are typically short, rare, and associated with specific constraints rather than being the default.
Q9) Why do β-sheets tend to aggregate?
β-structures are “sticky” because β-strands expose backbone hydrogen-bond donors/acceptors in a geometry that can pair with other β-strands.
If a β-prone region becomes exposed or partially unfolded, it can nucleate intermolecular β-pairing, leading to aggregation.
Additional contributors:
β-prone sequences are often hydrophobic or have low net charge, enabling stacking.
Aggregation is thermodynamically favorable because it satisfies backbone H-bonds and buries hydrophobic surface area.
Q10) Why do amyloids form so easily?
Amyloids (cross-β architecture) form readily because this state is an accessible energetic minimum for many sequences:
Stabilization comes from extensive backbone hydrogen-bond networks, not requiring very specific side-chain chemistry.
Once a nucleus forms, growth proceeds by templating: monomers add like bricks.
In energy landscape terms, native states can be kinetically stable, but stress, mutations, high concentration, or impaired proteostasis can redirect proteins into this alternative “valley.” This is why cells invest heavily in chaperones and quality-control pathways.
(Optional) Reflection — Why this matters for protein design
Many design failures come from confusing folding with function, especially for membrane-active or oligomeric systems.
β-aggregation highlights the need for negative design (avoid exposed β-edges and aggregation-prone motifs).
Language-model scoring can help rank mutations, but it may penalize sequences that are intentionally unusual (e.g., toxic or membrane-disruptive proteins).
Part B — Protein Analysis & Visualization (Cas12a)
Protein selected
Protein: Lb2Cas12a (Cas12a from Lachnospiraceae bacterium MA2020)
PDB ID: 8I54
Complex: Cas12a–crRNA–DNA (ternary complex)
Method / resolution: Cryo-EM, 3.95 Å
Chains: Protein A (1206 aa), RNA B (33-mer), DNA C (25-mer) and D (9-mer)
Why I chose it: Cas12a is a programmable CRISPR nuclease used in genome editing and diagnostics. This structure includes both guide RNA and target DNA, which makes it ideal to visualize the binding channel (“pocket”), the protein–nucleic acid interface, and design constraints for activity.
PyMOL visualizations
Figure 1 — Global view (cartoon + nucleic acids). Cas12a is shown in cartoon representation and the RNA/DNA strands are shown as sticks. The nucleic acids sit inside a prominent groove formed by the protein, highlighting that substrate positioning is a primary structural constraint for function.
Figure 2 — Surface representation reveals the binding channel (“pocket”). A semi-transparent surface view emphasizes a continuous channel accommodating the RNA–DNA duplex. This channel is the most obvious pocket-like feature in this complex and suggests that mutations lining the groove can strongly affect binding and activity.
Figure 3 — Alternative surface/channel view (second angle). A second viewpoint helps confirm that the nucleic acids traverse a well-defined channel rather than binding to a flat surface, reinforcing the interpretation of a structured binding path.
Figure 4 — Interface residues within ~4 Å of RNA/DNA. Residues located within ~4 Å of nucleic acids highlight the likely functional interface. This provides a rational set of positions expected to be more constrained in mutational scans (interface mutations can disrupt function even if the global fold remains stable).
Figure 5 — Qualitative “electrostatics-like” surface coloring (charged patches). A qualitative mapping of charged residues on the surface shows patches consistent with nucleic-acid binding, supporting the idea that electrostatics contributes to substrate recruitment and stabilization in the binding groove.
Figure 6 — Charged patches + channel view (combined). This combined view links charge distribution with geometry: charged surface regions are positioned near the nucleic-acid channel, consistent with a binding-and-positioning role.
Figure 7 — Secondary structure emphasis (helices). Cas12a is strongly helix-rich, consistent with many large nucleic-acid binding proteins that use extended helical scaffolds to shape binding channels and mediate conformational changes upon substrate binding.
Figure 8 — Coarse lobe/domain segmentation (REC vs NUC). A coarse two-color segmentation illustrates Cas12a’s modular architecture: a recognition lobe (REC-like region) and a nuclease lobe (NUC-like region) together shape the binding channel and position substrates for cleavage.
Key structural takeaways (summary)
The RNA–DNA duplex runs through a clear binding channel, which can be treated as the main “pocket” in the complex.
The ~4 Å interface highlights the most likely constrained region for function and provides candidate sites for mutational sensitivity (Part C).
Surface charge patches near the groove suggest electrostatics is important for nucleic-acid binding, emphasizing that function depends on local chemistry, not only global folding.
Part C — ML-Based Protein Design Tools
To keep runtime practical, I analyzed a subsequence of Cas12a from the 8I54 structure (chain A, residues 450–800; 351 aa).
C1 — ESM2: in silico mutational scan
I performed an in silico deep mutational scan (DMS-like) using ESM2 by masking each position and scoring all 20 substitutions (Δ log-prob = mutant − WT). More negative values indicate substitutions that are less compatible with the sequence context (more constrained positions), whereas values closer to zero indicate more tolerated substitutions.
Interpretation: The tolerance map shows heterogeneous constraint across the fragment, consistent with a folded scaffold containing both structurally constrained positions and more permissive regions. This provides a rational way to choose mutation sites (avoid strongly constrained positions; target tolerant ones) before structural screening.
C2 — ESMFold: folding filter (WT vs mutants)
I folded the WT fragment and two mutants with ESMFold: a conservative substitution (K518R) and a disruptive substitution (L706D). The goal is to use folding prediction as a rapid viability filter: keep variants that preserve the fold, and flag variants that reduce confidence or destabilize structure.
Structures
K518R (conservative):
L706D (disruptive):
Confidence / error diagnostics
Interpretation: Both variants produce a plausible global fold, but confidence metrics are generally low-to-moderate (pLDDT values mostly ~20–50) and the PAE matrix is broadly high off the diagonal, indicating uncertainty in the relative positioning of many regions. This is consistent with either (i) a fragment that is partially flexible outside its native context, or (ii) limited confidence for this isolated subsequence. Importantly, these results illustrate that ESMFold can screen gross misfolding, but folding confidence does not guarantee biological function.
C3 — ProteinMPNN (inverse folding)
Using the WT fragment backbone (Cas12a 8I54 chain A residues 450–800; 351 aa), I ran ProteinMPNN to generate 10 alternative sequences compatible with the same backbone (T=0.2). The designed sequences show low sequence recovery (~0.15–0.18), indicating substantial sequence diversity under a fixed-backbone constraint.
>MPNN_T0.2_sample1_seq_recovery0.1652
IKIKNVDGKPIPPGLIVIVPDPRVLKLLDKLKLLKELIEKLLKGVPPTPVPLPPLLTPELLLLLLKPDDLYRELKILLKKDGKWYLLTIDVSKFPELKDLPLKKDPELLKDIPYPLKEIKPEEIPEYLLKNIPLDLSLPLLPLYQAIKAGKIPKGLVPTLADVLAFLALLALLLGALGLPLLLGAILRPDPTPLDLLLLALLLRALGLKIKPLPLSPALLELLKKLGLLLPLLPLLEELKKLKGLLPPRELLELLLQLSPELQESLLLILPKEGPLFLLPPPLTPDDILLPDPSVPLLPPDPSSLERPRLPSLLLPLLEDPDLDPDDPELSIPLDLDPTPEEIKELEEKLK
>MPNN_T0.2_sample2_seq_recovery0.1624
LEIRDVNGKPIPPGVILLVPDPLLALLLAALPLLLLLLLLAALGVPLPPIPLPLLLTPEVLGLLLLPLAPDVELKIILKENGKYYLLTLDLSKLPELLLPPPLPLPELLKDIPYEKILIPPSAIPLVLGVGLPIDLSDPLDPLYKLLKEGKIPPGLLPTPLLLKLYKERRKKRLEEKKELKKFGIVLKKNPTPEDILKALELLKKLGLKLVPRPLPLEELEELRKKNKVPPLIPLLEELLELLGLRPPLELLRLLLLLDPDRPADLVLVLLLGLPLPLLPPPVTPGLPLLPPPSLPPLSPLPELLALPLPLAPIVPLLKLPLLPPDVPLLLLPLLLLPTPEELLKLLREIL
>MPNN_T0.2_sample3_seq_recovery0.1766
PVIRDVNGRPIPPGLLVIFPVPLLLKLLKLLPLLLGLVKALREGIPPLPLPIPPLLSPLLLGGLLTPLLPLFELEIILKKDGKYYLATLDLSALPAILDPPPLDDPELLKDIPWTLTPIPPEDIPYVLSRFIPIDWSDPRSPLYKALKAGEIPKGKIPSKEDILKYLKSLLKLLLESDDLSELGIVLTPNPTLADLLALLGLLRSLGIEIRLLPLLPLVLLLLKLLNAVPPLLPLLVDLSSLAGLLPPLLVLLLLLLLSPEAPEAVILNLKDRGPLPPLPPPLTPDAPDLPPPLPPPPLPDPSLLQLPVIPLPLLLLLPLPLLPPLEPVLLLPLELLPTPEELAQLEALLK
Interpretation: ProteinMPNN proposes highly diverged sequences (low recovery) that are still compatible with the fixed backbone, suggesting this fragment is “designable” under the model assumptions. Notably, many designs are enriched in low-complexity / helix-favoring residues (e.g., Leu/Pro), which may reflect limitations of designing on an isolated fragment with uncertain confidence and without the full Cas12a context.
Bacteriophage Engineering Proposal: L Protein Stabilization
Primary Goal: Increased stability (easiest).
Specific Approach: Engineering DnaJ-independence by reducing chaperone-recognition signals while preserving the structural scaffold of the L protein.
1. Computational Tools and Pipeline Justification To achieve this goal, we propose a three-step computationally efficient pipeline:
Step 1: Sequence-level Mutational Scanning using ESM2
Approach: We will perform a zero-shot in silico mutational scan across the L protein sequence using the ESM2 Protein Language Model (PLM). We aim to identify exposed hydrophobic patches (typical DnaJ recognition motifs) and propose polar/hydrophilic substitutions.
Why this helps: ESM2 has learned deep evolutionary constraints across millions of protein sequences. It allows us to rapidly differentiate between highly constrained residues (which are structurally vital and "untouchable") and mutation-tolerant positions. This ensures we only disrupt chaperone-binding motifs without breaking the core evolutionary scaffold of the protein, all at a fraction of the computational cost of molecular dynamics.
Step 2: Rapid Structural Filtering using ESMFold
Approach: The top candidate sequences from the ESM2 scan will be predicted using ESMFold. We will filter out any variants that collapse, show low pLDDT (confidence) scores, or have a high RMSD compared to the Wild-Type (WT) backbone.
Why this helps: While ESM2 evaluates sequence-level fitness, we need explicit 3D structural validation. ESMFold is significantly faster than AlphaFold2, making it ideal for high-throughput filtering. This step ensures that our hydrophilic mutations do not inadvertently destroy the L protein's ability to fold independently.
Step 3: Complex Modeling using Boltz-1
Approach: We will model the L protein + DnaJ complex for both the WT and our top folded mutant candidates. We will analyze the predicted interface contacts and Predicted Aligned Error (PAE) to assess binding affinity.
Why this helps: Folding correctly in isolation is not enough; we must explicitly prove reduced chaperone dependency. By comparing the mutant-DnaJ interface against the WT-DnaJ interface, we can prioritize variants that maintain a stable fold but show a significantly weakened or abolished interaction with the DnaJ chaperone.
2. Potential Pitfalls
Pitfall 1: Overlapping Reading Frames and Genomic Constraints. Phage genomes are highly compact, meaning the DNA sequence encoding the L protein might also encode parts of other proteins or regulatory elements in alternative reading frames. Our targeted mutations could have unintended, fatal consequences for the phage's overall viability. While genomic foundation models like Evo could assess these genome-wide constraints, their computational cost is prohibitive for our current scope.
Pitfall 2: The Stability vs. Function Trade-off. ESMFold guarantees that the protein adopts a stable 3D conformation in solution, but it does not guarantee biological function (membrane lysis). Lytic activity heavily depends on complex factors like membrane insertion dynamics, oligomerization, and reaction kinetics. Furthermore, completely abolishing chaperone interaction might inadvertently prevent the L protein from being properly delivered to its target membrane.

Week 5 HW: Protein Design Part II
Part 1: Generate Binders with PepMLM
For this part, I first retrieved the human SOD1 sequence from UniProt (P00441) and then introduced the A4V mutation, which is a well-known ALS-associated substitution in superoxide dismutase 1. The canonical human SOD1 sequence is:
MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQGIAQ
To generate the mutant form, I introduced the A4V substitution, yielding the following sequence:
MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIED
I then used the PepMLM Colab notebook linked from the HuggingFace model card to generate peptide binders conditioned on this mutant SOD1 sequence.
Note on peptide length
The assignment requested four peptides of length 12 amino acids. However, after repeatedly adjusting the peptide length setting in the public PepMLM notebook, the model consistently returned 15-mer peptides. Because I wanted to preserve the actual model output rather than manually trimming the sequences and introducing an artificial modification, I proceeded using the peptides exactly as generated by the notebook.
PepMLM-generated binders
The model returned the following four candidate binders:
Binder Sequence Length Pseudo Perplexity
P1 SHWPVYVVRKAWRAX 15 17.62794512
P2 ARVPELTARVELKKX 15 16.37907539
P3 SRWGVYVGRVEWRRA 15 16.19368433
P4 WRVGPVAAVYEWAKK 15 11.62216745
For comparison, I also added the known SOD1-binding peptide provided in the assignment:
Binder Sequence Length
Known binder FLYRWLPSRRGG 12
Interpretation of PepMLM output
To evaluate the PepMLM outputs, I used the reported pseudo perplexity values as a measure of the model’s internal confidence. Lower pseudo perplexity indicates that the peptide is more plausible according to the model in the context of the target sequence.
Based on this metric, P4 (WRVGPVAAVYEWAKK) was the strongest PepMLM candidate, with the lowest pseudo perplexity value (11.62216745). The next best fully specified peptide was P3 (SRWGVYVGRVEWRRA) with a pseudo perplexity of 16.19368433.
Two peptides, P1 and P2, contained an X residue, which indicates an ambiguous or unresolved amino acid identity. Because of that ambiguity, those two sequences are less reliable for downstream structural interpretation and comparison. For that reason, I prioritized P3 and P4 for the AlphaFold3 analysis.
Overall, this step produced a small set of candidate binders ranked by PepMLM confidence, with P4 emerging as the most promising candidate according to the model and P3 as the next most interpretable option.
Part 2: Evaluate Binders with AlphaFold3
To assess whether the generated peptides formed plausible structural complexes with mutant SOD1, I used the AlphaFold Server to model protein-peptide complexes. For each run, I submitted the A4V SOD1 sequence as one chain and the peptide sequence as a separate second chain. I then examined both the ipTM score and the predicted position of the peptide on the SOD1 structure.
Because P1 and P2 contained ambiguous residues (X), I focused the structural analysis on the two fully specified PepMLM-generated peptides, P3 and P4, and compared them against the known binder.
AlphaFold3 results
Binder Sequence ipTM Putative binding site Notes
P3 SRWGVYVGRVEWRRA 0.37 Surface of the β-barrel region Surface-bound and elongated; not clearly localized near the N-terminal A4V region
P4 WRVGPVAAVYEWAKK 0.36 Lateral surface of the β-barrel region Surface-bound, no clear burial, and not strongly focused near the A4V site
Known binder FLYRWLPSRRGG 0.37 External surface of the β-barrel region Surface-bound and extended; does not appear deeply buried or strongly concentrated at the N-terminus
Structural interpretation
The AlphaFold3 predictions gave very similar ipTM values for all three tested complexes. Peptide P3 and the known binder both produced an ipTM of 0.37, while P4 gave a slightly lower ipTM of 0.36. This indicates that none of the complexes stood out as having a dramatically stronger or more confident interface than the others.
When I visually inspected the predicted structures, all three peptides appeared to be mostly surface-bound rather than deeply buried into a defined pocket or groove. In each case, the peptide stretched across exposed regions of the SOD1 surface, particularly along areas consistent with the β-barrel exterior. The binding did not appear highly compact or tightly enclosed, which suggests relatively modest interface definition.
A key point from the assignment was to evaluate whether the peptides localized near the N-terminus, where the A4V mutation is located. In these models, none of the peptides showed a strong preference for that region. Instead, the peptides appeared to contact broader exposed surfaces of the protein, rather than specifically clustering around the mutant N-terminal site. Likewise, none of the models clearly suggested a deeply buried interaction or a highly specific approach to the dimer interface.
Comparison to the known binder
The known binder FLYRWLPSRRGG did not clearly outperform the PepMLM-generated peptides in this AlphaFold3 analysis. In fact, P3 matched the known binder exactly in ipTM (0.37), while P4 was only slightly lower at 0.36. This means that at least one PepMLM-generated peptide reached the same structural confidence score as the reference peptide.
However, the visual models also suggest that these interactions are likely modest and mostly surface-associated, rather than strong, sharply localized interfaces. So while P3 matched the known binder numerically, none of the tested peptides showed an obviously superior structural pose or a clear binding mode centered on the A4V mutation itself.
## Part 3: Evaluate Properties of Generated Peptides in PeptiVerse
Structural confidence alone is not sufficient for therapeutic development, so I next evaluated the PepMLM-generated peptides using **PeptiVerse**. For each peptide, I entered the peptide sequence as the binder and the **A4V mutant SOD1 sequence** as the target. I then collected the following predicted properties:
- binding affinity
- solubility
- hemolysis probability
- net charge at pH 7
- molecular weight
The mutant SOD1 sequence used as the target was:
```text
MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ
PeptiVerse results
Binder Sequence AlphaFold3 ipTM Predicted binding affinity Solubility Hemolysis probability Net charge (pH 7) Molecular weight (Da) Overall assessment
P1 SHWPVYVVRKAWRAX Not prioritized Weak binding (6.692) Soluble (1.000) Non-hemolytic (0.039) 2.55 1777.1 Good developability profile, but contains ambiguous residue X
P2 ARVPELTARVELKKX Not prioritized Weak binding (5.529) Soluble (1.000) Non-hemolytic (0.022) 1.80 1692.0 Lowest hemolysis risk, but weakest affinity and contains ambiguous residue X
P3 SRWGVYVGRVEWRRA 0.37 Weak binding (6.964) Soluble (1.000) Non-hemolytic (0.092) 2.46 1877.1 Best affinity among the tested peptides and best structural support among resolved sequences
P4 WRVGPVAAVYEWAKK 0.36 Weak binding (5.856) Soluble (1.000) Non-hemolytic (0.032) 1.76 1760.0 Clean sequence and favorable safety/solubility profile, but weaker predicted binding than P3
Comparison with AlphaFold3
The PeptiVerse analysis showed that structural confidence alone was not sufficient to rank the peptides, but it did help identify the strongest overall candidate. Among the two fully specified peptides that were also evaluated with AlphaFold3, P3 had the highest ipTM (0.37) and also the highest predicted binding affinity in PeptiVerse (6.964), whereas P4 had a slightly lower ipTM (0.36) and a weaker predicted affinity (5.856). This means that, for the two best-resolved candidates, the peptide with the better structural score also showed the stronger predicted binding signal. At the same time, all four peptides were predicted to be soluble and non-hemolytic, so none of them showed an obvious developability red flag. However, P1 and P2 both contained an ambiguous X residue, which makes them less reliable as lead candidates despite their otherwise acceptable PeptiVerse profiles. Overall, P3 provided the best balance between structural support and predicted binding, while still remaining soluble and non-hemolytic.
Peptide selected for advancement
I would advance P3 (SRWGVYVGRVEWRRA) because it showed the strongest overall combination of properties among the interpretable candidates. It matched the known binder in AlphaFold3 ipTM (0.37), gave the highest predicted binding affinity in PeptiVerse (6.964), and was still predicted to be soluble and non-hemolytic. Although its interaction with SOD1 still appeared mostly surface-bound rather than deeply buried, it showed the best overall compromise between predicted binding and therapeutic properties, making it the most reasonable peptide to prioritize for the next design or validation step.
## Part 0 — Assignment Overview and Objective
For this week, my main task is **Part C: Final Project: L-Protein Mutants**, which is the required section for committed listeners. The goal of this assignment is to improve the **stability** and **auto-folding** of the **MS2 phage lysis protein (L protein)**. This is biologically relevant because the L protein is essential for phage-mediated killing of *E. coli*, and bacterial resistance can emerge if the host alters the factors required for proper L-protein function.
In the MS2 system, the L protein is thought to contribute to bacterial lysis through membrane-associated activity. However, correct processing of the L protein depends on the bacterial chaperone **DnaJ**. If *E. coli* acquires a mutation in DnaJ that disrupts this interaction, the phage may lose infectivity. Therefore, the central design challenge is to propose L-protein mutants that may improve folding, reduce dependence on DnaJ, increase expression, or enhance lysis activity.
The assignment asks us to use a **mutational scoring notebook**, compare those computational predictions with **experimental mutational data**, and then propose **five mutations** supported by a clear rationale. In addition, at least **two proposed variants must contain mutations in the soluble region** and **two must contain mutations in the transmembrane region**.
Overall, I interpret this homework as a **rational mutagenesis exercise** combining computational prediction, prior experimental data, and biological reasoning. The final result is not proof that the mutants will work experimentally, but rather a justified proposal of promising L-protein variants for future testing.
---
## Part 1 — Understanding the L Protein Sequence and Defining Its Regions
The L-protein sequence provided in the homework is:
`METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT`
The full sequence is **75 amino acids** long. According to the homework notes, the **last 35 residues correspond to the transmembrane region**, while the N-terminal portion corresponds to the **soluble domain** involved in interaction with **DnaJ**.
Based on that definition, the sequence can be divided as follows:
- **Soluble region:** residues **1–40**
- **Transmembrane region:** residues **41–75**
This division is important because the final mutant proposal must include candidates from both structural and functional regions of the protein.
### Region map
| Position range | Sequence segment | Region |
|---|---|---|
| 1–40 | `METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYV` | Soluble N-terminal domain |
| 41–75 | `LIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT` | Transmembrane domain |
At this stage, this region map serves as the framework for all subsequent analysis. Mutations in the **soluble domain** are more likely to affect folding and interaction with DnaJ, whereas mutations in the **transmembrane region** are more likely to affect membrane insertion, oligomerization, or lysis-related activity.
---
## Part 2 — Understanding the Mutational Scoring Step
After defining the soluble and transmembrane regions of the MS2 L protein, the next step is to understand the role of the **mutational scoring notebook** provided in the homework.
The purpose of this notebook is to assign a computational score to possible amino acid substitutions in the L-protein sequence. These scores are not direct measurements of biological activity. Instead, they are **predictive estimates** that help identify mutations that may be more favorable, better tolerated, or less disruptive.
This means the notebook should be used as a **prioritization tool**, not as final proof that a mutation improves the system. A favorable score does not guarantee improved lysis, correct folding, or DnaJ independence. Likewise, an unfavorable score does not prove that a mutation is impossible. The computational output is useful because it helps narrow the sequence space and identify candidate substitutions worth comparing with experimental evidence.
### Why this step matters
The number of possible amino acid substitutions across the full L-protein sequence is large, even for a small protein. Without a scoring step, mutant selection would be largely arbitrary. The notebook provides a rational first filter that makes the downstream design process more systematic.
### What I want to extract from this step
From the mutational scoring output, I aim to identify:
1. positions that appear mutationally tolerant,
2. substitutions that seem favorable,
3. whether those substitutions fall in the soluble or transmembrane region,
4. and which candidates are worth carrying forward into comparison with the experimental dataset.
At this stage, I am not yet choosing the final five mutants. I am only generating a preliminary candidate list.
---
## Part 3 — Using Experimental Mutational Data to Evaluate the Computational Scores
After obtaining the computational mutational scores, the next essential step is to compare them with the available **experimental mutational data** for the MS2 L protein.
This comparison is important because the notebook only provides a **computational estimate** of how favorable or unfavorable each amino acid substitution might be. In contrast, the experimental dataset reflects what was actually observed in the lab. Since the main functional interest of this project is improved lysis-protein performance, the experimental effects on lysis are more directly relevant than sequence-model predictions alone.
I see this comparison as serving two main purposes.
First, it helps evaluate how informative the computational scoring approach is for this particular protein. If experimentally favorable mutations also tend to receive favorable computational scores, then the notebook is capturing useful information. If the agreement is weak, then the scores should be interpreted more cautiously.
Second, this step helps prioritize candidates for the final design proposal. Mutations that look favorable in the experimental dataset, the computational scores, or ideally both, become stronger candidates for the final set of proposed variants.
### Questions I will use to filter candidate mutations
For each mutation, I want to ask:
- Does the mutation have a favorable or at least non-disruptive experimental effect?
- Does the notebook assign it a favorable computational score?
- Is the mutation located in the soluble or transmembrane region?
- Is the site likely to be too conserved to mutate safely?
This comparison is the bridge between raw prediction and rational design. It allows me to move from a large set of possible substitutions to a smaller and more biologically plausible group of candidate mutants.
---
## Part 4 — Comparing Computational Scores with Experimental Mutational Data
To move from general prediction to actual mutant selection, I next compared the **computational mutational scores** from the notebook with the available **experimental mutational data** for the MS2 L protein. This step is explicitly required in the assignment and is important because the notebook only predicts whether a mutation may be favorable, while the experimental dataset reports how specific L-protein mutants affected lysis in the lab.
The main goal of this comparison is to determine whether the computational scores are actually informative for this protein. If mutations with favorable experimental effects also tend to receive favorable notebook scores, then the language-model-based scoring method is likely capturing meaningful constraints in the L-protein sequence. If the agreement is weak, then the scores should be treated more cautiously and used only as one supporting source of evidence rather than the main basis for mutant selection.
At this stage, I used the comparison as a filtering step. Instead of selecting mutations directly from the full sequence, I prioritized candidates by asking whether each mutation met one or more of the following criteria:
1. it showed a favorable or at least non-disruptive effect in the experimental lysis dataset,
2. it received a positive or relatively favorable score in the computational notebook,
3. it was located in the appropriate region of the protein for the final assignment requirements,
4. and it was not obviously at a highly conserved position that might be risky to mutate.
This approach is consistent with the recommendation in the homework, which suggests looking for positions and mutations with either a positive experimental effect or a positive score and then using combinations of those mutations to design candidate variants.
Because the L protein contains both a **soluble N-terminal domain** and a **transmembrane region**, I also considered the structural context of each mutation during this comparison. Mutations in the soluble domain are more likely to affect folding or interaction with DnaJ, whereas mutations in the transmembrane region are more likely to affect membrane-associated lysis activity. Therefore, I did not interpret all favorable scores in the same way; instead, I evaluated them in the context of where the residue is located in the protein.
At the end of this comparison step, the outcome is not yet a final mutant list, but rather a **shortlist of plausible candidates**. These candidates can then be narrowed down further using conservation analysis and biological reasoning before proposing the final five mutations required for submission.
## Part 5 — Building a Shortlist of Candidate Mutations
After comparing the computational mutational scores with the available experimental mutational data, the next step is to build a **shortlist of candidate mutations** for the final design proposal.
At this stage, the goal is not yet to define the final five mutants, but rather to identify a smaller group of substitutions that appear promising enough to consider further. I approached this as a filtering problem: starting from many possible substitutions across the full L-protein sequence, I narrowed the list by combining computational, experimental, and biological criteria.
### Candidate selection criteria
I considered a mutation to be a strong candidate when it met one or more of the following conditions:
1. it showed a favorable or non-disruptive effect in the experimental lysis dataset,
2. it received a favorable computational score in the mutational scoring notebook,
3. it occurred at a residue that was not obviously too conserved to mutate safely,
4. and it fit one of the two required structural regions of the protein:
- the **soluble N-terminal domain**
- the **transmembrane domain**
This filtering strategy is important because not all favorable-looking mutations should be treated equally. A mutation with a strong score but poor experimental support is less convincing than one supported by both sources. Similarly, a mutation at a highly conserved position may be riskier even if the score looks favorable.
### Separating candidates by region
Because the assignment requires mutations from both major regions of the L protein, I separated candidate mutations into two categories:
- **soluble-domain candidates** (residues 1–40)
- **transmembrane-domain candidates** (residues 41–75)
This regional classification is biologically meaningful. Mutations in the soluble domain are more likely to affect folding, expression, or interaction with DnaJ, while mutations in the transmembrane domain are more likely to affect membrane insertion, oligomerization, or lysis-related activity.
By separating candidates this way, I can make sure that my final mutant proposal satisfies the homework requirements while also reflecting the different functional roles of the two parts of the protein.
### Why a shortlist is necessary
A shortlist is useful because the final design step should be based on a manageable set of plausible candidates rather than the full mutational landscape. It creates a structured transition from broad screening to focused design.
At the end of this step, I expect to have:
- a set of promising **soluble-domain mutations**,
- a set of promising **transmembrane-domain mutations**,
- and enough information to begin assembling the **final five proposed mutants** for submission.
### Interim conclusion
This shortlist-building step is the practical outcome of the earlier analysis. It converts general computational and experimental evidence into a focused pool of candidate mutations that can be used in the final rational design proposal.
## Part 6 — Strategy for Selecting the Final Five Mutants
After building a shortlist of candidate mutations, the next step is to define a clear strategy for selecting the **final five mutants** required for the assignment.
The homework does not simply ask for five random substitutions. Instead, it asks for a rationally chosen set of mutations supported by computational scoring, experimental evidence, and biological interpretation. For that reason, my selection strategy is based on combining multiple types of evidence rather than relying on a single ranking metric.
### Overall selection strategy
My goal is to choose five mutations that together satisfy both the **assignment constraints** and the **biological design goals** of the project.
To do this, I plan to:
1. select at least **two mutations in the soluble region**,
2. select at least **two mutations in the transmembrane region**,
3. and use the fifth mutation as either:
- an additional strong individual candidate, or
- part of a combined design if there is a good biological reason to combine favorable substitutions.
This ensures that the final design is balanced across both major functional regions of the protein.
### What makes a mutation strong enough for final selection
A mutation is more likely to be chosen for the final set if it meets several of the following conditions:
- it has a favorable or non-disruptive experimental effect,
- it has a favorable computational score,
- it occurs at a position that is not strongly constrained,
- it makes biological sense for the region where it occurs,
- and it contributes to a diverse final set rather than repeating the same logic multiple times.
This last point is important. I do not want all five mutations to reflect the exact same design idea. A stronger final proposal includes candidates that test different but plausible hypotheses about how L-protein performance might be improved.
### Region-specific reasoning
For **soluble-domain mutations**, I will prioritize candidates that could plausibly improve:
- folding,
- protein stability,
- expression,
- or interaction with DnaJ.
For **transmembrane-domain mutations**, I will prioritize candidates that could plausibly improve:
- membrane insertion,
- helix packing,
- oligomerization,
- or lysis-associated membrane activity.
This means that the same score value may be interpreted differently depending on whether the mutation lies in the soluble or transmembrane part of the protein.
### Why the fifth mutant matters
The fifth mutant gives some flexibility in the design strategy. It can be used in one of two ways.
One option is to choose the **single best remaining candidate** after selecting the required soluble and transmembrane mutations.
Another option is to use it as a **combined or more exploratory design**, for example by combining individually favorable substitutions if there is a reasonable hypothesis that their effects could be compatible or additive.
This makes the fifth choice especially useful because it can strengthen the overall design logic of the final proposal.
### Interim conclusion
At the end of this step, I should be ready to move from a broad shortlist to a final set of **five justified mutant designs**. The next stage will therefore be to present those final candidates and explain, for each one, why it was selected and what effect it is expected to have.
## Part 7 — Final Proposed Mutants
Based on the comparison between computational mutational scores, experimental mutational data, and region-specific biological reasoning, I selected the following **five candidate L-protein mutants** for the final proposal.
These candidates were chosen to satisfy the assignment requirement of including mutations in both the **soluble region** and the **transmembrane region**, while also prioritizing substitutions that appear favorable, non-disruptive, or biologically plausible.
### Final mutant set
| Mutant | Substitution | Region | Main rationale |
|---|---|---|---|
| Mutant 1 | `X##Y` | Soluble | Supported by favorable score and plausible effect on folding or DnaJ interaction |
| Mutant 2 | `X##Y` | Soluble | Supported by experimental tolerance and suitable location in the N-terminal domain |
| Mutant 3 | `X##Y` | Transmembrane | Plausible effect on membrane behavior or lysis-related activity |
| Mutant 4 | `X##Y` | Transmembrane | Supported by score and compatible with transmembrane-region design goals |
| Mutant 5 | `X##Y` or combined design | Soluble / TM / Combined | Selected as the strongest remaining candidate or exploratory combined variant |
### Mutant 1 — [insert mutation]
This mutation was selected as a **soluble-domain candidate** because it appears to be compatible with the design goal of improving folding, stability, or interaction with DnaJ. In addition, it showed either a favorable computational score, a favorable experimental effect, or both. Because this residue lies in the N-terminal soluble region, I interpret its potential effect mainly in terms of protein processing and chaperone-related behavior rather than membrane activity.
### Mutant 2 — [insert mutation]
This mutation was also selected from the **soluble region** to satisfy the assignment requirement and to provide a second independent candidate affecting the non-transmembrane portion of the protein. Compared with Mutant 1, this substitution may represent a different design logic, such as improving tolerance at a flexible site or reducing disruption at a position relevant to folding. Including more than one soluble-region candidate increases the diversity of the final proposal.
### Mutant 3 — [insert mutation]
This mutation was selected from the **transmembrane region** because it is a plausible candidate for altering membrane insertion, helix packing, oligomerization, or lysis-related membrane activity. Since the C-terminal portion of the L protein is transmembrane, mutations in this region were interpreted in a different biological context than soluble-region substitutions. This candidate was prioritized because it appears compatible with preserving or improving membrane-associated function.
### Mutant 4 — [insert mutation]
This is the second **transmembrane-domain candidate** in the final set. It was chosen to ensure that the final proposal includes more than one plausible way to alter the membrane-associated properties of the L protein. As with Mutant 3, this substitution was selected based on a combination of computational support, experimental tolerability, and regional biological interpretation.
### Mutant 5 — [insert mutation or combined design]
The fifth mutant was used as a flexible design choice. Depending on the final ranking of candidates, this position can be filled by either:
- the strongest remaining single mutation, or
- a combined design built from individually favorable substitutions.
I included this final slot to preserve some design flexibility while still keeping the overall proposal rational and biologically interpretable. If used as a combined variant, the goal would be to test whether individually tolerated substitutions might produce an additive or complementary effect.
### Summary of design logic
Overall, these five proposed mutants were selected to balance:
- the structural requirements of the assignment,
- the computational mutational scoring results,
- the experimental lysis data,
- and the biological differences between the soluble and transmembrane regions of the protein.
Rather than treating the full sequence as a uniform mutational landscape, I used a region-aware strategy to generate a final set of candidates that are diverse, interpretable, and suitable for future experimental testing.
Week 6 HW: Genetic Circuits Part I — Assembly Technologies
Week 6 — Genetic Circuits Part I: Assembly Technologies
Assignment: DNA Assembly
1. What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose?
In this week’s protocol, the PCR reactions are assembled using Phusion HF PCR Mix (2X) together with template plasmid, forward primer, reverse primer, and nuclease-free water. The role of the master mix is to provide the core PCR chemistry in a convenient premixed format, while the user adds the sequence-specific primers and DNA template separately.
Some key components typically found in a high-fidelity PCR master mix include:
a high-fidelity DNA polymerase, which synthesizes new DNA strands with lower error rates than standard Taq polymerase
dNTPs, which are the nucleotide building blocks used to extend the new DNA strands
magnesium ions (Mg²⁺), which are required as cofactors for polymerase activity
an optimized reaction buffer, which maintains pH, ionic strength, and enzyme performance
stabilizing components that help preserve enzyme activity during thermocycling
The purpose of using a high-fidelity system in this lab is especially important because the PCR products are later used for Gibson Assembly, so sequence accuracy matters.
2. What are some factors that determine primer annealing temperature during PCR?
Primer annealing temperature is mainly determined by the melting temperature (Tm) of the primers. In practice, Tm depends on several sequence properties, including primer length, GC content, base composition, and whether there are mismatches or secondary structures such as hairpins or dimers.
According to the lab guidance, a good binding region is usually around 18–22 bp, with a target Tm of about 52–58 °C, and primer pairs should ideally be within 5 °C of each other. The protocol also recommends a modest GC clamp at the 3′ end, avoiding excessive G/C content in the final few bases. These features improve specific binding and reduce inefficient or nonspecific amplification.
In this specific cloning workflow, annealing temperature is also influenced by the fact that the primers contain two functional regions: a binding region to amplify the template and a 5′ overlap region used later for Gibson Assembly. The overlap helps with assembly, but the annealing behavior during PCR is mostly governed by the binding portion of the primer.
3. There are two methods from this class that create linear fragments of DNA: PCR, and restriction enzyme digests. Compare and contrast these two methods, both in terms of protocol as well as when one may be preferable to use over the other.
PCR and restriction enzyme digestion can both generate linear DNA fragments, but they do so in very different ways. PCR amplifies a defined region of DNA using primers, polymerase, nucleotides, and thermocycling. It is especially useful when you want to amplify a specific fragment, introduce mutations, add overlaps, or generate a fragment even when no convenient restriction sites are available.
In contrast, a restriction digest cuts DNA at pre-existing recognition sites using sequence-specific restriction enzymes. This is often simpler when the correct restriction sites already exist in the plasmid or insert and when you want a clean excision without introducing sequence changes. However, restriction digestion is constrained by the locations of those recognition sites and is less flexible than PCR for introducing new overlaps or mutations.
For this week’s Gibson workflow, PCR is particularly advantageous because it allows the experimenter to generate a backbone fragment and a color fragment while also incorporating sequence changes in the chromophore region through primer design. Restriction digestion is often preferable when the fragment boundaries are already defined by existing sites and no mutagenesis or custom overlap design is needed.
4. How can you ensure that the DNA sequences that you have digested and PCR-ed will be appropriate for Gibson cloning?
To be appropriate for Gibson cloning, the DNA fragments must have correctly designed overlapping ends so that adjacent fragments can anneal after exonuclease treatment. In this lab, the recommended overlap length is generally around 20–22 bp in the primer design guidance, while Gibson/HiFi assembly more broadly uses overlaps in the 20–40 bp range. The fragments must also be in the correct orientation and must cover the intended regions without missing or duplicating critical sequence.
It is also important to reduce background from the original plasmid template. The protocol therefore includes a DpnI digest after PCR, which selectively digests methylated parental plasmid DNA while leaving the newly amplified PCR products intact. After that, the fragments should be purified, quantified, and checked on a diagnostic gel to confirm the expected sizes.
Finally, Gibson reactions should be set up using an appropriate molar ratio, and this week’s lab recommends a 2:1 insert-to-vector ratio for efficient assembly. Good fragment quality, correct overlaps, proper concentration, and clean purification are all essential for successful cloning.
5. How does the plasmid DNA enter the E. coli cells during transformation?
In this lab, plasmid DNA enters E. coli through heat-shock transformation. First, chemically competent cells are thawed on ice, then mixed with the assembled DNA and kept on ice to allow the DNA to associate with the cell surface. The cells are then exposed briefly to 42 °C, which helps create a transient increase in membrane permeability, allowing plasmid DNA to enter.
After heat shock, the cells are returned to ice and then allowed to recover in SOC medium for about one hour. This recovery period helps the cells repair their membranes and begin expressing the antibiotic resistance marker carried by the plasmid. Finally, the cells are plated on selective agar, so only bacteria that received the plasmid can survive and form colonies.
6. Describe another assembly method in detail (such as Golden Gate Assembly)
A powerful alternative to Gibson Assembly is Golden Gate Assembly, which uses Type IIS restriction enzymes such as BsaI or BsmBI together with DNA ligase in a one-pot reaction. Unlike standard restriction enzymes, Type IIS enzymes cut outside of their recognition sequences, which allows the user to design custom overhangs that determine exactly how the DNA parts will assemble. During the reaction, the DNA is repeatedly digested and ligated, and correctly assembled products accumulate because the recognition sites are usually removed in the final construct. This makes Golden Gate especially useful for assembling multiple parts in a defined order with high efficiency. It is often preferred for modular cloning systems, standardized part libraries, and scar-minimized multi-fragment assembly workflows. Compared with Gibson, Golden Gate depends more strongly on careful restriction-site planning, but it can be extremely efficient for combinatorial and standardized DNA assembly workflows.
Golden Gate Assembly diagram
Figure 1. Conceptual Golden Gate Assembly workflow showing Type IIS digestion, custom overhang formation, and ligation into an ordered final construct.
Modeling Golden Gate Assembly in Benchling
To model Golden Gate Assembly in Benchling, I created a simple design with a plasmid backbone and two insert fragments containing Type IIS restriction sites at their boundaries. I annotated the BsaI sites, the expected cut positions, and the custom overhangs that would be exposed after digestion. I then verified that the designed overhangs were compatible only with the intended neighboring fragments, which ensures ordered ligation. This model illustrates the core Golden Gate logic: digestion outside the recognition site, programmable overhang creation, fragment annealing in a defined order, and loss of the restriction sites in the final assembled construct.
Figure 2. Benchling-based conceptual model of Golden Gate Assembly showing Type IIS sites, fragment boundaries, and directed overhang compatibility.
References
HTGAA Spring 2026 — Week 6: Genetic Circuits Part I: Assembly Technologies.
Updated: HTGAA 2026 Gibson Assembly Lab.
NEB Gibson Assembly overview.
Assignment: Asimov Kernel
For the second part of Week 6, I used Asimov Kernel to explore the official Repressilator demo, recreate it in my own construct, and build three additional circuits to compare how different regulatory architectures affect simulated expression dynamics.
Repressilator demo
I opened the official Repressilator construct from the Bacterial Demos repository and ran the simulator.
Expected behavior
I expected oscillatory behavior because the circuit is based on cyclic repression among three regulators.
Observed behavior
The simulator showed a short initial transient phase followed by sustained periodic oscillations in both protein concentrations and RNA concentrations over time. The oscillations appeared stable after the first several hours, which is consistent with the expected behavior of a repressilator circuit.
Interpretation
The simulation matched my expectation. The results support the idea that a three-node cyclic repression network can generate oscillatory dynamics rather than converging to a simple steady state.
Repressilator recreation
I recreated the repressilator in my own construct using the same overall cyclic repression logic as the official example.
Expected behavior
I expected oscillatory behavior again, since the recreated circuit preserves the three-node cyclic repression topology.
Observed behavior
In my recreated version, the simulator did not show sustained oscillations. Instead, the system converged to a non-oscillatory steady state in which LambdaCI accumulated strongly, while LacI and TetR remained at much lower levels. The RNA plots showed the same qualitative pattern, suggesting that one branch of the circuit dominated the overall dynamics rather than producing balanced cyclic repression.
Interpretation
My recreated construct did not match the official repressilator demo. A likely explanation is that the recreated version differs from the original in one or more important details, such as promoter-repressor matching, part order, parameterization, or regulatory balance. Another possibility is that the system is highly sensitive to initial conditions or simulation assumptions, so small differences can push the network into a stable steady state instead of an oscillatory regime.
Possible explanation for the mismatch
Since the pLacI/LambdaCI branch appears to dominate the final state, one possible issue is that repression strengths or expression balance are not equivalent to the official example. This could prevent the delayed cyclic repression required for oscillations and instead stabilize one dominant node.
The recreated repressilator did not reproduce the oscillatory dynamics of the official example. Instead, the simulation converged to a steady state in which the LambdaCI-associated branch dominated, while the LacI and TetR branches remained low. The RNA and flux plots supported the same qualitative conclusion, indicating an imbalanced regulatory architecture rather than sustained cyclic repression.
Construct 1 — Single-gene LacI expression circuit
Design idea
This construct contains a simple transcriptional unit composed of pLacI, A1 RBS, LacI, and a bacterial terminator on a plasmid backbone.
Expected behavior
I expected a simple non-oscillatory expression pattern in which LacI concentration rises over time and then approaches a stable steady state. Since this construct does not include a cyclic feedback loop, I did not expect oscillations.
Observed behavior
The simulator showed a rapid increase in both LacI protein and LacI RNA levels during the initial phase, followed by a stable steady state over the rest of the simulation. No oscillatory behavior was observed. The endpoint RNAP flux and ribosome flux plots were also consistent with active expression of a single transcriptional unit.
Interpretation
The result matched my expectation. This construct behaves as a simple single-gene expression circuit with stable output rather than dynamic oscillatory behavior.
Construct 2 — Cross-repression circuit
Design idea
This construct contains two transcriptional units: pTetR → LacI and pLacI → TetR. The goal was to create a simple two-node cross-repression circuit.
Expected behavior
I expected a more regulated and competitive behavior than in Construct 1, since each branch can influence the other indirectly through repressor-promoter interactions. I did not necessarily expect sustained oscillations, but I expected the system to favor one dominant steady state or a strong imbalance between the two nodes.
Observed behavior
The simulator showed that the TetR branch became dominant, reaching a much higher steady-state protein and RNA level than the LacI branch. LacI remained at a low concentration throughout the simulation, while TetR accumulated quickly and stabilized at a much higher level. The endpoint RNAP and ribosome flux plots were consistent with this asymmetry, showing that the pLacI → TetR branch was much more active than the pTetR → LacI branch.
Interpretation
The result matched the expectation that this circuit would behave differently from a single-gene expression system and would not produce balanced oscillations. Instead, the network converged to a dominant-state steady state in which one regulatory branch strongly outcompeted the other.
Construct 3 — One-way repression cascade
Design idea
This construct contains two transcriptional units arranged as a simple repression cascade: pTetR → LacI and pLacI → LambdaCI. The goal was to build a directional regulatory cascade rather than a symmetric cross-repression circuit.
Expected behavior
I expected the first branch to express LacI strongly, since TetR is not present in this circuit to repress pTetR. I then expected LacI to repress pLacI, leading to lower expression of LambdaCI. Therefore, I expected a non-oscillatory steady state with high LacI and low LambdaCI.
Observed behavior
The simulator showed that both LacI and LambdaCI increased rapidly and then converged to very similar steady-state levels. The RNA plots showed the same qualitative behavior, with both transcripts reaching nearly identical stable concentrations. The endpoint RNAP and ribosome flux plots were also very similar for the two branches, indicating that both transcriptional units remained comparably active.
Interpretation
The result did not match my original expectation of a strongly directional repression cascade. Instead, the circuit behaved more like two balanced expression modules operating in parallel, with no strong suppression of the LambdaCI branch by LacI.
Possible explanation
A likely explanation is that the simplified simulation setup did not generate strong enough regulatory asymmetry for LacI to effectively suppress the second branch. Another possibility is that the promoter-repressor relationships in this model are not sufficient by themselves to create a clear cascade effect under the default simulation conditions.
Final reflection
This week helped me connect molecular cloning concepts with dynamic circuit behavior in simulation. The DNA assembly section clarified how fragment design, overlaps, and transformation logic affect experimental success, while the Kernel section showed how different circuit topologies can produce stable expression, dominant steady states, or oscillatory behavior depending on regulatory architecture and balance.
Week 7 HW: Genetic Circuits II, Fungal Materials, and First DNA Twist Order
Week 7 — Genetic Circuits II, Fungal Materials, and First DNA Twist Order
Part 1: Intracellular Artificial Neural Networks (IANNs)
1. What advantages do IANNs have over traditional genetic circuits, whose input/output behaviors are Boolean functions?
IANNs have important advantages over traditional Boolean genetic circuits because they can perform analog computation rather than only binary ON/OFF logic. Classical genetic circuits are useful for implementing logic gates such as AND, OR, and NOT, but they are limited when the biological problem depends on graded signal levels rather than strict binary states.
In contrast, IANNs can assign different weights to different intracellular inputs, combine them through addition or subtraction, and generate a nonlinear output. This makes them more suitable for interpreting real cellular states, where inputs often vary continuously in magnitude. Instead of forcing biology into rigid digital logic, IANNs can classify more subtle and realistic signal combinations.
Another important advantage is that intracellular artificial neurons can be composed into multilayer networks. A single perceptron is limited to linearly separable decision boundaries, but multilayer systems can produce more complex behaviors. In synthetic biology, this is valuable because cellular environments are noisy, multidimensional, and dynamic. An IANN therefore offers a more flexible and tunable framework for state classification than a conventional Boolean circuit.
2. Describe a useful application for an IANN; include a detailed description of input/output behavior, as well as any limitations an IANN might face to achieve your goal.
A useful application for an IANN would be the intracellular classification of an infection-like cell state in mammalian cells. Instead of responding to just one biomarker, the circuit could integrate multiple molecular signals that together better represent whether a cell is truly infected or entering a suspicious pathological state.
For example, the system could receive three inputs:
X1: a signal associated with interferon pathway activation
X2: a signal associated with inflammatory signaling such as NF-kB activity
X3: a signal more directly linked to viral infection, such as a viral RNA sensing output
In an IANN, each of these inputs could be assigned a different weight. A viral signal could have the strongest positive weight, a general inflammatory signal could have a moderate weight, and a stress-associated signal could even be assigned a negative influence if it tends to create false positives. The output would behave like a classifier: only when the weighted sum crosses a threshold would the cell activate a fluorescent reporter or another downstream response.
This is more realistic than a strict Boolean circuit because infection-related biology is usually not binary. However, there are important limitations. Different plasmids may enter cells at different copy numbers, creating cell-to-cell variability. Different inputs may also rise and decay at different times, which can distort the intended weighted computation. Additional limitations include molecular burden, leakage in the OFF state, crosstalk between regulatory parts, and the fact that many biological neural-like systems still rely on weights that were optimized offline rather than learned directly inside the cell.
3. Draw a diagram for an intracellular multilayer perceptron where layer 1 outputs an endoribonuclease that regulates a fluorescent protein output in layer 2.
3. Draw a diagram for an intracellular multilayer perceptron where layer 1 outputs an endoribonuclease that regulates a fluorescent protein output in layer 2.
Below is a conceptual intracellular multilayer perceptron. In this architecture, layer 1 integrates two DNA inputs and produces an intermediate endoribonuclease output. That endoribonuclease regulates the reporter in layer 2.
Layer 1
X1 DNA ──Tx/Tl──> EndoRNase R1 ─┐
├── hidden node H1 ──Tx/Tl──> EndoRNase R3
X2 DNA ──Tx/Tl──> EndoRNase R2 ─┘
Layer 2
EndoRNase R3 ──regulates reporter mRNA──> Fluorescent protein (e.g., eGFP) ──> Output Y
Figure 1. Conceptual intracellular multilayer perceptron in which layer 1 integrates two DNA inputs and produces an intermediate endoribonuclease that regulates fluorescent output in layer 2.
Part 2: Fungal Materials
1. What are some examples of existing fungal materials and what are they used for? What are their advantages and disadvantages over traditional counterparts?
Existing fungal materials are mainly based on mycelium, the filamentous vegetative structure of fungi. One major category is mycelium-based composites, in which fungi grow through agricultural or industrial waste and bind the substrate into a lightweight solid material. These are being explored or used for protective packaging, thermal insulation, acoustic panels, and interior design elements.
Another important category is pure mycelium materials, which are produced with less dependence on a bulky plant substrate and can be processed into leather-like sheets, foam-like materials, and paper-like materials.
Their main advantages are related to sustainability. They can be grown from agricultural residues, usually require lower energy inputs than many conventional materials, and are often biodegradable or compostable. In addition, fungal materials can show useful properties such as low density, thermal insulation, acoustic absorption, and, in some cases, favorable fire-related behavior.
Their disadvantages are also important. Many fungal materials still have lower and more variable mechanical strength than conventional plastics, foams, or structural composites. They can absorb moisture, which may weaken performance over time. Long-term durability, reproducibility, and large-scale manufacturing consistency remain major challenges. For that reason, fungal materials are currently more realistic for packaging, insulation, acoustics, and leather alternatives than for demanding structural applications.
2. What might you want to genetically engineer fungi to do and why? What are the advantages of doing synthetic biology in fungi as opposed to bacteria?
One application I find especially interesting would be to engineer fungi to create smart building materials that not only provide insulation or structure, but also sense environmental changes. For example, I would like to engineer a fungal material that could detect persistent moisture inside walls and respond with a visible color change or another easy-to-read signal.
This would be useful because hidden water damage is often detected too late, after microbial growth, structural problems, or health risks have already started. A fungal material that acts both as a material and as a living sensor could support more sustainable and safer buildings.
Fungi offer important advantages over bacteria for this type of application. Fungi naturally grow as extended hyphal networks, allowing them to form cohesive three-dimensional materials directly on solid substrates. Many fungi also grow on lignocellulosic or waste-derived feedstocks, which is attractive for low-cost and sustainable manufacturing. In addition, fungi are naturally well suited to material formation because their biology already supports macroscopic structure generation.
Compared with bacteria, fungi may therefore be better chassis for engineered living materials when the goal is to build a physical object rather than only produce a soluble molecule. However, fungi also have drawbacks: they often grow more slowly, can be harder to genetically manipulate than standard bacterial hosts, and may introduce variability in morphology and performance. Even so, they are especially promising for material-oriented synthetic biology.
For my individual final project, I selected the concept of an Automated Optimization of a DNAzyme–Cas12a Amplified Lead Sensor. The project is based on coupling a Pb²⁺-responsive DNAzyme to a CRISPR-Cas12a amplification step, so that substrate cleavage releases a trigger capable of activating Cas12a and generating a fluorescent signal.
In the short term, the project focuses on in-silico design and kinetic modeling. In the medium term, the goal is to optimize the assay experimentally using automated liquid handling. In the long term, the platform could be translated into a modular and portable environmental sensing format.
Aim 1 draft
The first aim of my final project is to computationally design and prioritize a modular DNAzyme–Cas12a lead sensor by optimizing nucleic acid architecture, assessing structural plausibility of the Cas12a activation complex, and building an ODE-based kinetic model to predict signal amplification, leakage, and theoretical sensitivity before wet-lab testing.
DNA design strategy for this assignment
For this first DNA synthesis design exercise, I chose to build a constitutive sfGFP expression cassette as a workflow control. Although my individual final project is focused on a DNAzyme–Cas12a amplified lead sensor, this Week 7 design is intended to document the full sequence design and cloning workflow in a simple and robust way.
The insert was designed as a linear expression cassette containing:
a constitutive promoter
an RBS
a start codon
the sfGFP coding sequence
a 7xHis tag
a stop codon
a terminator
Insert documentation
Backbone documentation
Backbone vector: pTwist Amp High Copy
Reflection
This exercise helped me connect sequence design, annotation, synthesis planning, and plasmid-level documentation into one workflow. In future iterations, I plan to replace the generic reporter cassette with a project-relevant construct connected to my DNAzyme–Cas12a sensing platform.
References
HTGAA 2026 Genetic Circuits II Lab Protocol.
Vasle, A. H., & Moškon, M. (2024). Synthetic biological neural networks: From current implementations to future perspectives. BioSystems, 237, 105164.
HTGAA Spring 2026 — Week 2: DNA Read, Write, & Edit.
HTGAA 2026: Final Project Selection.
HTGAA 2026: Individual Final Project Documentation.
Week 9 HW: Cell-free Systems
Week 9 — Cell-free Systems
Homework Part A: General and Lecturer-Specific Questions
General homework questions
1. Explain the main advantages of cell-free protein synthesis over traditional in vivo methods, specifically in terms of flexibility and control over experimental variables. Name at least two cases where cell-free expression is more beneficial than cell production.
Cell-free protein synthesis offers major advantages over traditional in vivo expression because the reaction occurs outside living cells, in a simplified and highly controllable environment. Instead of relying on cell growth, viability, and intracellular regulation, the experimenter can directly tune DNA concentration, salts, cofactors, energy source, reaction time, temperature, and inducer concentration. This makes the system highly flexible for rapid prototyping, mechanistic studies, and controlled optimization of genetic constructs. Unlike cell-based production, cell-free systems do not require maintaining living hosts and reduce interference from the host’s own physiology and background protein production. This is one of the reasons they are widely used in synthetic biology, protein engineering, biosensing, and CRISPR-related research.
Cell-free expression is especially more beneficial than cell production in at least two important cases. First, it is very useful for rapid testing of synthetic circuits, because constructs can be evaluated without transformation, colony growth, and cellular induction. Second, it is advantageous for proteins that are toxic or difficult to express in vivo, since production is no longer tied to cell survival. A third strong case is portable biosensing, especially with freeze-dried reactions that can be rehydrated on demand in low-resource settings or even spaceflight contexts.
2. Describe the main components of a cell-free expression system and explain the role of each component.
A cell-free expression system contains the molecular machinery needed for transcription and translation but outside living cells. At the core of the system is either a whole-cell extract or a reconstituted PURE system. The extract or purified system provides ribosomes, translation factors, enzymes, and supporting biochemical machinery required for protein synthesis. In whole-cell extract systems, many metabolic enzymes and auxiliary cellular components are still present, while PURE systems contain only essential purified components.
The reaction also needs a buffering system, such as HEPES, to maintain stable pH and preserve enzyme activity. It requires nucleotides (ATP, GTP, CTP, UTP) for transcription and tRNAs for translation. It also needs amino acids, which are the building blocks of the protein product. Additional cofactors help maintain a productive biochemical environment. These include folinic acid, NAD, coenzyme A, spermidine, sodium oxalate, and salts such as magnesium glutamate and potassium glutamate. Magnesium is especially important because it acts as a cofactor for many enzymes involved in transcription and translation. DTT helps maintain reducing conditions and protects sensitive biomolecules.
The system also requires an energy source and a way to maintain energy availability during the reaction. Common energy substrates include 3-PGA or PEP. Finally, the system needs a template, usually DNA or RNA, that encodes the protein or biosensor of interest. In T7-based systems, T7 RNA polymerase may also be included, and RNase inhibitors can be added to protect transcripts from degradation. Together, these components support transcription, translation, RNA stability, enzymatic activity, and sustained protein production.
3. Why is energy provision regeneration critical in cell-free systems? Describe a method you could use to ensure continuous ATP supply in your cell-free experiment.
Energy provision and regeneration are critical in cell-free systems because transcription and translation are highly energy-demanding processes. ATP is required directly for biosynthesis, and the reaction also depends on a stable biochemical environment to sustain RNA synthesis, protein synthesis, and associated enzymatic steps over time. Because there are no living cells continuously regenerating metabolites, the reaction can stall quickly if ATP and related energy intermediates are depleted. The lab notes explicitly include 3-PGA or PEP as energy-supporting substrates and explain that they help provide energy and intermediate metabolites for reaction stability.
One practical method to ensure continuous ATP supply is to include an energy regeneration substrate such as phosphoenolpyruvate (PEP) or 3-phosphoglycerate (3-PGA) in the reaction mixture. These compounds help sustain ATP production through the metabolic capability retained in the extract. In practice, I would test at least two energy conditions in parallel, for example PEP versus 3-PGA, and compare final yield and expression kinetics to determine which formulation better supports prolonged protein synthesis.
4. Compare prokaryotic versus eukaryotic cell-free expression systems. Choose a protein to produce in each system and explain why.
Prokaryotic and eukaryotic cell-free systems differ mainly in complexity, speed, post-translational capability, and the types of proteins they are best suited to express. Prokaryotic systems, especially E. coli-based systems, are typically fast, flexible, and relatively inexpensive. They are ideal for synthetic biology, fluorescent reporters, and proteins that do not require complex post-translational modifications. In contrast, eukaryotic systems such as wheat germ or rabbit reticulocyte extracts are better suited for proteins that require a more eukaryotic folding environment or more complex processing. The HTGAA lab notes directly compare PURE and whole-cell extract systems and note that whole-cell extracts can come from organisms including E. coli, wheat germ, and rabbit reticulocytes.
For a prokaryotic cell-free system, I would choose to produce amilGFP or deGFP, because fluorescent proteins are easy to detect, are commonly used as reporters, and generally do not require complex post-translational modifications. They are ideal for fast optimization and proof-of-concept experiments. In fact, the Week 9 lab demonstrates TX-TL functionality using a T7-IPTG-amilGFP plasmid and fluorescence monitoring across IPTG concentrations.
For a eukaryotic cell-free system, I would choose to produce an antibody fragment or a human secreted signaling protein, because these proteins are more likely to benefit from a eukaryotic translation environment, especially if proper folding, disulfide bonding, or more native-like processing is important.
5. How would you design a cell-free experiment to optimize the expression of a membrane protein? Discuss the challenges and how you would address them in your setup.
To optimize expression of a membrane protein in a cell-free system, I would design a small matrix experiment in which I systematically vary temperature, template concentration, reaction time, salt composition, and especially the presence of membrane-mimicking additives such as detergents, liposomes, or nanodiscs. I would begin with a screening-scale setup to identify conditions that maximize soluble or functional product, not just total expression. This kind of tuning is one of the major strengths of cell-free systems, since the reaction chemistry can be adjusted directly without the constraints of cell viability.
The main challenges with membrane proteins are poor solubility, aggregation, misfolding, and inefficient insertion into membrane-like environments. To address these, I would test a panel of membrane mimics in parallel and compare lower and higher expression temperatures, because slower synthesis often improves folding quality. I would also compare at least two DNA concentrations, because overexpression can worsen aggregation.
To evaluate success, I would not rely only on total protein amount. I would also use a functional readout if possible, such as ligand binding, channel activity, or detergent-stable recovery. In other words, the goal would be to optimize for correctly folded, functional protein, not just maximum yield.
6. Imagine you observe a low yield of your target protein in a cell-free system. Describe three possible reasons for this and suggest a troubleshooting strategy for each.
One possible reason is poor template quality or incorrect template concentration. If the DNA is degraded, impure, or present at a suboptimal concentration, transcription may be inefficient. A troubleshooting strategy would be to verify DNA quality, confirm concentration accurately, and test a small template titration series.
A second possible reason is suboptimal reaction chemistry, including energy limitation, salt imbalance, or insufficient cofactors. Cell-free systems are highly sensitive to magnesium, potassium, energy substrates, and overall reaction composition. A troubleshooting strategy would be to test several magnesium and energy-support conditions in parallel and compare both kinetics and final yield. The Week 9 lab explicitly emphasizes the importance of salts, nucleotides, cofactors, and energy substrates such as 3-PGA or PEP. :contentReference[oaicite:20]{index=20}
A third possible reason is RNA or protein instability. Transcripts may be degraded by RNases, or the protein itself may misfold, aggregate, or be unstable under the chosen conditions. A troubleshooting strategy would be to include RNase protection, reduce reaction temperature, shorten incubation time, or redesign the construct to improve translation and folding. The lab notes specifically include murine RNase inhibitor as a component used to protect mRNA from degradation. :contentReference[oaicite:21]{index=21}
Homework question from Kate Adamala
Design an example of a useful synthetic minimal cell
Pick a function and describe it.
I would design a lead-sensing synthetic minimal cell for environmental monitoring and remediation.
What would your synthetic cell do? What is the input and what is the output?
The synthetic cell would detect Pb²⁺ ions in a water sample and respond by producing a fluorescent readout together with a lead-binding sequestration protein inside the compartment. Input: Pb²⁺ in the surrounding environment. Output: fluorescence plus intracellular lead-capture activity.
Could this function be realized by cell-free Tx/Tl alone, without encapsulation?
Only partially. A purely open cell-free reaction could detect Pb²⁺ and produce a reporter signal, but it would not behave as a discrete synthetic cell and would have limited control over selective uptake, localization, and containment of the response. Encapsulation adds compartmentalization and makes the design more realistic as a minimal cell.
Could this function be realized by genetically modified natural cell?
Yes, it could be realized in a genetically engineered bacterium. However, using a synthetic minimal cell would reduce concerns related to growth, escape, biocontainment, and environmental release of living engineered organisms.
Describe the desired outcome of your synthetic cell operation.
In the presence of lead, the synthetic minimal cell should generate a clear and measurable fluorescent signal and retain part of the toxic metal within the compartment by expressing a sequestration module.
Design all components that would need to be part of your synthetic cell.
The system would require:
a membrane compartment
an internal TX-TL system
a lead-responsive sensing circuit
a fluorescent reporter
a sequestration module
sufficient salts, cofactors, amino acids, nucleotides, and energy substrate
What would be the membrane made of?
A phospholipid membrane made of POPC + cholesterol, with a small fraction of negatively charged lipid such as DOPG to improve stability and tunability.
What would you encapsulate inside? Enzymes, small molecules.
Inside the vesicle I would encapsulate:
an E. coli-based cell-free TX-TL system
nucleotides
amino acids
magnesium and potassium salts
an energy source such as PEP
a plasmid carrying a lead-responsive regulatory system
a fluorescent reporter gene such as sfGFP
a lead-binding protein gene such as smtA or pbrD
Which organism would your Tx/Tl system come from? Is bacterial OK, or do you need a mammalian system for some reason?
A bacterial system is sufficient here. An E. coli-derived TX-TL system is appropriate because the sensing circuit would be based on bacterial regulatory logic, and no mammalian-specific promoter or modification system is required.
How will your synthetic cell communicate with the environment?
Lead ions are not guaranteed to cross the membrane efficiently, so I would include a metal uptake or permeability strategy, such as a membrane transporter or pore. A candidate gene would be pbrT, a lead uptake transporter. The reporter signal would be measured optically from outside the vesicle.
Experimental details
Lipids:
POPC
cholesterol
DOPG
Genes:
pbrR (lead-responsive transcriptional regulator)
pbrT (lead uptake transporter)
sfGFP (fluorescent reporter)
pbrD or smtA (metal-binding/sequestration protein)
How will you measure the function of your system?
I would measure fluorescence as the primary output and compare signal across a Pb²⁺ concentration gradient. As a secondary assay, I would quantify residual lead in the external solution before and after incubation to assess whether sequestration occurred.
Homework question from Peter Nguyen
Freeze-dried cell-free systems integrated into materials
Application field
Architecture
One-sentence summary pitch
I propose a freeze-dried cell-free wall patch that becomes fluorescent when exposed to lead-contaminated water from leaking pipes.
How will the idea work, in more detail?
The concept is a replaceable patch integrated into high-risk areas of buildings, such as behind sinks, near pipe junctions, or around old plumbing. The patch would contain a freeze-dried cell-free biosensor embedded in a porous material that activates when it becomes wet. If lead-containing water reaches the patch, the biosensor would produce a visible fluorescent or colorimetric signal that indicates contamination. The patch could be read by eye or with a simple handheld fluorescence viewer. Because the reaction is freeze-dried, storage and deployment would be easy, especially in older buildings, schools, or low-resource settings.
What societal challenge or market need will this address?
This addresses the need for fast, low-cost, decentralized detection of water contamination, especially in aging infrastructure where lead exposure remains a major public health problem. It could be especially valuable in schools, public buildings, rental housing, and remote communities.
How do you envision addressing the limitation of cell-free reactions (e.g., activation with water, stability, one-time use)?
The patch would be packaged in a moisture-protective housing until installation and would be designed as a single-use replaceable sensor. Stability would be improved by lyophilization and sealed storage. Since accidental hydration is the main activation trigger, the patch would only be exposed at the desired monitoring location. One-time use is acceptable here because the material is intended as a cheap diagnostic indicator rather than a reusable electronic sensor.
Homework question from Ally Huang
Mock Genes in Space proposal
Background information (maximum 100 words)
Long-duration space missions depend on safe recycled water and fast biological monitoring, but current detection workflows can be slow, equipment-intensive, or dependent on return-to-Earth analysis. A freeze-dried cell-free biosensor could provide a lightweight, low-maintenance method for detecting microbial contamination on orbit. This is significant for astronaut health, highly relevant for future missions with limited resupply, and scientifically interesting because it combines molecular detection, low-resource biotechnology, and space-compatible synthetic biology.
Molecular or genetic target (maximum 30 words)
A bacterial 16S rRNA-derived sequence amplified from recycled spacecraft water samples.
How your target relates to the space biology question (maximum 100 words)
If bacterial nucleic acids are detected in recycled spacecraft water, that indicates possible contamination or biofilm-related risk within the life-support system. Monitoring a bacterial nucleic acid target is therefore directly relevant to astronaut health and to the reliability of long-duration water recycling infrastructure. A sequence-based target is also practical because it can be amplified and then linked to a cell-free biosensor readout.
Hypothesis or research goal (maximum 150 words)
My hypothesis is that a freeze-dried BioBits® cell-free reaction coupled to a sequence-specific RNA sensing module can provide a simple and space-compatible readout for bacterial contamination in recycled water. I expect that if a bacterial target sequence is first enriched using the miniPCR® thermal cycler, then the amplified product can trigger a cell-free sensor and generate a visible fluorescence output in the P51 Molecular Fluorescence Viewer. The reasoning is that cell-free systems are lightweight, low-maintenance, and compatible with freeze-dried deployment, which makes them attractive for spaceflight where mass, storage, and user complexity are constrained.
Experimental plan (maximum 100 words)
I would test mock water samples containing either bacterial target DNA, non-target DNA, or no DNA. The target region would first be amplified using miniPCR. Amplified material would then be added to a BioBits® reaction containing a sequence-responsive sensing construct and reporter output. Controls would include a positive target control, a negative no-template control, and a non-target sequence control. The main measurements would be fluorescence intensity over time and endpoint signal discrimination between positive and negative samples.
Homework Part B: Individual Final Project
For this week, I focused on defining Aim 1 of my final project.
Final project title
Automated Optimization of a DNAzyme–CRISPR Amplified Lead Sensor
Aim 1
Design and computationally optimize a lead-responsive DNAzyme-to-Cas12a signal transduction architecture before wet-lab screening.
Aim 1 rationale
The first objective is to establish a robust in silico framework for the biosensor before experimental optimization. This includes designing the DNAzyme substrate and release trigger, tuning the coupling between DNAzyme cleavage and Cas12a activation, minimizing unintended secondary structures, and selecting reporter architectures that maximize signal gain while minimizing background. By defining these design constraints early, the wet-lab phase can focus on a smaller and more rational set of candidate constructs.
Initial experimental and design focus
Aim 1 will include:
sequence design and secondary structure analysis
trigger and reporter architecture comparison
specificity considerations for Pb²⁺-dependent activation
initial planning for automated parameter screening in later stages
Note
The slide deck submission, final project form, and ordering spreadsheet tasks will be completed through the required external course materials separately.
References
HTGAA 2026 Cell-free Systems Lab. :contentReference[oaicite:22]{index=22}
DNAdots: Cell-free protein synthesis. :contentReference[oaicite:23]{index=23}
Kocalar et al., 2024. Validation of Cell-Free Protein Synthesis Aboard the International Space Station. :contentReference[oaicite:24]{index=24}
In this homework, I analyzed eGFP using LC-MS and MS/MS data to evaluate its intact molecular weight, peptide map, and structural state under native versus denaturing conditions. The goal was to determine whether the measured protein is consistent with the expected eGFP standard, using intact-mass analysis, tryptic peptide mapping, and comparison of native and denatured charge state distributions.
Figure 1. Schematic overview of intact eGFP molecular-weight analysis by LC-MS, highlighting denaturation, charge-state distribution, and the adjacent charge-state method used to estimate protein molecular weight.
Waters Part 1 — Molecular Weight
The eGFP sequence provided in the assignment contains a linker and a C-terminal His tag. Based on the amino acid sequence, the calculated molecular weight is approximately 27,875 Da (about 27.875 kDa).
To estimate the molecular weight experimentally from the intact protein spectrum, I used two adjacent charge states from the BioAccord spectrum:
m/z = 1037.4927
m/z = 1077.3950
Using the adjacent charge-state relationship, these peaks correspond to approximately +27 and +26, respectively.
Using the equation:
MW = z × (m/z − 1.0073)
I obtain:
From the +27 charge state: MW = 27 × (1037.4927 − 1.0073) = 27,985.11 Da
From the +26 charge state: MW = 26 × (1077.3950 − 1.0073) = 27,986.08 Da
Figure 2. Illustration of the adjacent charge-state method used to assign neighboring peaks and calculate the experimental molecular weight of intact eGFP.
The average experimental molecular weight is therefore:
27,985.59 Da
or 27.986 kDa
To estimate mass accuracy relative to the theoretical sequence:
Overall, the intact mass is very close to the expected eGFP mass range, although it appears slightly heavier than the theoretical sequence provided in the assignment. This may indicate a minor proteoform difference or a sequence/formulation-related mass contribution.
Waters Part 2 — Peptide Map Work (Primary Structure)
The eGFP sequence contains:
20 lysines (K)
6 arginines (R)
Using the PeptideMass workflow described in the assignment with Trypsin, 0 missed cleavages, and filtering peptides above 500 Da, the expected number of tryptic peptides is:
19 peptides
From the LC-MS chromatogram in Figure 3a, I counted the chromatographic peaks between 0.5 and 6.0 minutes and observed:
21 peaks
Therefore, the number of observed chromatographic peaks is slightly higher than the number of predicted tryptic peptides. This suggests that some peaks may correspond to additional peptide species such as modified peptides, partially digested species, adducts, or chromatographic separation of closely related forms.
For the peptide shown in Figure 3b, the main observed ion is:
m/z = 525.76712
From the isotope spacing, the peak is consistent with a +2 charge state, since isotope spacing is approximately 1/z and the peak pattern is consistent with a doubly charged peptide.
To calculate the singly charged form [M+H]+:
[M+H]+ = z × (m/z) − (z − 1) × 1.0073
[M+H]+ = 2 × 525.76712 − 1.0073 = 1050.53 Da
So the peptide mass is:
[M+H]+ ≈ 1050.53 Da
Comparing this measured value with the predicted tryptic peptide masses, the best match is:
FEGDTLVNR
Its theoretical [M+H]+ mass is approximately:
1050.52 Da
Therefore, the mass error is very small, on the order of only a few ppm, indicating an excellent match between the observed peptide and the theoretical digest product.
Figure 3. Workflow of tryptic digestion and LC-MS peptide mapping of eGFP, showing cleavage after lysine and arginine residues and the generation of peptide peaks used to confirm primary structure.
Finally, the peptide map coverage shown in Figure 5 indicates that the identified peptides confirm:
88% amino acid sequence coverage
This high sequence coverage strongly supports that the analyzed sample is consistent with the expected eGFP standard.
Waters Part 3 — Secondary/Tertiary Structure
Native and denatured mass spectrometry provide information about protein conformation by revealing how many charges a protein can carry in each condition.
Under denaturing conditions, the protein unfolds because of the organic solvent and acidic environment. When the protein unfolds, more basic sites become exposed to solvent and can be protonated. As a result, the protein acquires more charges, giving a broader charge-state distribution and peaks at lower m/z values.
Under native conditions, the protein remains more compact and folded because the solvent system is milder and better preserves noncovalent interactions. Since fewer protonation sites are exposed, the protein acquires fewer charges, which produces a narrower charge-state distribution and peaks at higher m/z values.
This is exactly what is observed in the eGFP spectra. The native spectrum shows fewer charge states at higher m/z, whereas the denatured spectrum shows more charge states distributed across a wider m/z range.
Figure 4. Example of peptide identification by LC-MS/MS, showing the measured precursor ion, charge-state assignment from isotope spacing, and sequence confirmation from fragmentation analysis.
For the zoomed-in native peak around 2800 m/z in Figure 7, the charge state is approximately:
z = +10
This can be determined from the isotope spacing. In electrospray mass spectrometry, the distance between isotope peaks is approximately equal to 1/z. Since the isotopic spacing is about 0.1 m/z, the charge state is consistent with:
z = 10
Overall, the comparison between native and denatured spectra supports the expected behavior of folded versus unfolded eGFP.
Figure 5. Conceptual comparison between native and denatured mass spectrometry of eGFP. Native protein remains compact and exhibits fewer charge states at higher m/z, whereas denatured protein unfolds and displays a broader distribution at lower m/z.
Did I make GFP?
Measurement
Theoretical
Observed/measured on the BioAccord MS
BONUS! Observed/measured on the G3 Q-ToF MS
Molecular weight (kDa)
27.875
27.986
~27.9
Amino acid sequence coverage (%)
N/A
88%
N/A
Yes, the results are consistent with eGFP. The intact molecular weight is in the expected range, the peptide map identifies peptides matching the expected digest, and the sequence coverage reaches 88%, which strongly supports the identity of the protein as the eGFP standard.
Final Project
For my final project, I am developing an automated DNAzyme–Cas12a amplified biosensor for Pb²⁺ detection in water. The goal of the project is to create a modular sensing platform in which a Pb²⁺-responsive DNAzyme cleaves a substrate, releases a nucleic acid trigger, and activates Cas12a collateral cleavage to generate an amplified fluorescent signal.
The main aspects I want to measure in this project are:
Presence or absence of Pb²⁺ in water samples
Fluorescence signal intensity generated after activation of the DNAzyme–Cas12a cascade
ON/OFF signal separation, comparing Pb²⁺-containing samples versus no-target controls
Background leakage, meaning unwanted signal in the absence of Pb²⁺
Sensitivity and limit of detection, especially at low Pb²⁺ concentrations
Selectivity, by comparing Pb²⁺ response against other ions that may interfere
Reaction kinetics, including how quickly the signal appears and how strongly it amplifies over time
Reproducibility across different reaction conditions and replicate experiments
To perform these measurements, I would use a combination of computational design, automated experimental optimization, and fluorescence-based readout.
First, I would use Benchling to annotate and organize all DNA constructs and sensing modules. Then I would use NUPACK to evaluate nucleic acid folding and identify sequence architectures with lower OFF-state leakage and better trigger accessibility. I would also use ODE-based kinetic modeling to simulate the sensing cascade and predict how DNAzyme cleavage, trigger release, Cas12a activation, and reporter cleavage affect the final fluorescence output.
For experimental measurements, I would use an Opentrons OT-2 liquid handler to run multidimensional optimization screens across parameters such as pH, Mg²⁺ concentration, reporter concentration, and DNAzyme/Cas12a stoichiometry. The main readout would be measured using a fluorescence plate reader or a similar fluorescence detection instrument. If needed, complementary validation could also include gel electrophoresis to verify cleavage products or nucleic acid integrity.
Overall, the key technologies in this project are:
DNA construct design
Nucleic acid secondary-structure analysis
Kinetic simulation and modeling
Automated liquid handling
Fluorescence-based biosensing
Potential future portable assay formats for environmental monitoring
This measurement strategy is designed to evaluate whether the sensor is modular, sensitive, selective, and suitable for future translation into a portable lead-detection platform.
Figure 6. Proposed modular biosensor architecture for Pb2+ detection, in which a Pb2+-responsive DNAzyme releases a nucleic acid trigger that activates Cas12a collateral cleavage and generates an amplified fluorescent readout.
title: ‘Individual Final Project’ weight: 10 description: ‘Automated Optimization of a DNAzyme–Cas12a Amplified Lead Sensor’ Automated Optimization of a DNAzyme–Cas12a Amplified Lead Sensor Abstract Lead contamination in drinking water remains a major public health problem because even low-level chronic exposure can impair neurological development, cardiovascular health, and overall long-term wellbeing. Existing analytical methods such as ICP-MS are highly sensitive, but they usually require centralized laboratory infrastructure, trained personnel, and expensive instrumentation, which limits their accessibility for decentralized or field-based monitoring. The overall goal of this project is to develop a modular environmental biosensing platform that couples a Pb²⁺-responsive DNAzyme with CRISPR-Cas12a signal amplification in order to generate a rapid and amplified fluorescent readout. The central hypothesis is that a DNAzyme-triggered release of a programmable nucleic acid activator can be linked to Cas12a collateral cleavage to improve sensitivity while preserving modularity. To test this idea, the project is structured into three aims: first, computational design and kinetic modeling of the sensing cascade; second, automated experimental optimization using robotic liquid handling; and third, long-term translation into a portable and modular environmental sensing format. The methods include nucleic acid folding analysis, structural plausibility assessment, kinetic simulation, DNA construct design, and future automated wet-lab optimization. Together, this project aims to establish a scalable biosensing framework for environmental monitoring that is adaptable, programmable, and ultimately deployable outside centralized laboratories.
title: ‘Individual Final Project’
weight: 10
description: ‘Automated Optimization of a DNAzyme–Cas12a Amplified Lead Sensor’
Automated Optimization of a DNAzyme–Cas12a Amplified Lead Sensor
Abstract
Lead contamination in drinking water remains a major public health problem because even low-level chronic exposure can impair neurological development, cardiovascular health, and overall long-term wellbeing. Existing analytical methods such as ICP-MS are highly sensitive, but they usually require centralized laboratory infrastructure, trained personnel, and expensive instrumentation, which limits their accessibility for decentralized or field-based monitoring. The overall goal of this project is to develop a modular environmental biosensing platform that couples a Pb²⁺-responsive DNAzyme with CRISPR-Cas12a signal amplification in order to generate a rapid and amplified fluorescent readout. The central hypothesis is that a DNAzyme-triggered release of a programmable nucleic acid activator can be linked to Cas12a collateral cleavage to improve sensitivity while preserving modularity. To test this idea, the project is structured into three aims: first, computational design and kinetic modeling of the sensing cascade; second, automated experimental optimization using robotic liquid handling; and third, long-term translation into a portable and modular environmental sensing format. The methods include nucleic acid folding analysis, structural plausibility assessment, kinetic simulation, DNA construct design, and future automated wet-lab optimization. Together, this project aims to establish a scalable biosensing framework for environmental monitoring that is adaptable, programmable, and ultimately deployable outside centralized laboratories.
Project Aims
Aim 1: Experimental / Short-term Aim
The first aim of my final project is to computationally design and prioritize a modular DNAzyme–Cas12a lead sensor by optimizing nucleic acid architecture, assessing structural plausibility of the Cas12a activation complex, and building an ODE-based kinetic model to predict signal amplification, leakage, and theoretical sensitivity before wet-lab testing.
Aim 2: Development / Medium-term Aim
The second aim of my final project is to experimentally optimize and validate the sensor using automated liquid handling workflows. Following successful in-silico prioritization, this stage will use an Opentrons OT-2 platform to execute multidimensional parameter sweeps across reaction variables such as pH, Mg²⁺ concentration, reporter concentration, and DNAzyme/Cas12a stoichiometry in order to identify conditions that maximize sensitivity and reproducibility in real water samples.
Aim 3: Visionary / Long-term Aim
The third aim of my final project is to develop the sensing platform into a modular and field-deployable environmental monitoring technology. In the long term, the assay could be adapted into decentralized formats such as lyophilized or paper-based systems and extended to detect additional toxic metals by replacing the upstream recognition module while preserving the downstream CRISPR-based amplification architecture
Recent literature supports the use of DNAzymes as mechanistically well-defined and highly selective recognition elements for lead sensing. Brown et al. described a lead-dependent 8-17 DNAzyme with a two-step catalytic mechanism, providing an important biochemical basis for Pb²⁺-responsive cleavage. Structural work later showed that RNA-cleaving DNAzymes such as 8-17 adopt compact active conformations with a pre-organized Pb²⁺ binding pocket, strengthening the connection between sequence, folding, and catalysis.
Beyond mechanistic studies, DNAzymes have also been successfully translated into sensing platforms. Li et al. reported a single-stranded fluorescent Pb²⁺ DNAzyme sensor with good performance across a broad temperature range, highlighting the practicality of DNAzyme-based environmental sensing. More recently, He et al. developed a DNAzyme-based CRISPR/Cas12a fluorescence sensor for Pb²⁺ detection, directly demonstrating the feasibility of coupling metal-responsive DNAzyme cleavage to CRISPR-mediated signal amplification. Together, these studies support the conceptual basis of my project while also revealing opportunities for improved modularity, automation, and optimization.
Brown, A. K., Li, J., Pavot, C. M.-B., & Lu, Y. (2003). A lead-dependent DNAzyme with a two-step mechanism. Biochemistry, 42(23), 7152–7161.
Liu, H. et al. (2017). Crystal structure of an RNA-cleaving DNAzyme. Nature Communications.
Li, H. et al. (2012). Single-stranded DNAzyme-based Pb²⁺ fluorescent sensor that can work well over a wide temperature range. Biosensors and Bioelectronics, 34(1), 159–164.
He, S. et al. (2025). A DNA concatemer-encoded CRISPR/Cas12a fluorescence sensor for sensitive detection of Pb²⁺ based on DNAzymes. Analyst, 150(9), 1778–1784.
Ethical Implications
This project raises several ethical considerations related to environmental health, public communication, and responsible biosensor development. At its core, the project is motivated by beneficence, because it aims to improve access to lead monitoring tools that could support earlier detection of unsafe water conditions and reduce long-term exposure to a major public health hazard. It also relates to justice, since communities with fewer resources are often the ones most affected by environmental contamination while also having the least access to centralized analytical testing. At the same time, the principle of non-maleficence is especially important here, because an inaccurate sensor could produce false negatives that give users unjustified confidence in contaminated water, or false positives that generate unnecessary alarm. Since the project is based on a modular synthetic biology sensing architecture, it must also be guided by responsibility in how claims are made, how performance is validated, and how limitations are communicated.
To ensure that this project is ethical, several measures should be taken both during research and in any future real-world deployment. First, the sensor should never be presented as a replacement for certified analytical methods unless its performance has been rigorously benchmarked under realistic environmental conditions. Second, all results should be reported transparently, including background leakage, false activation risks, matrix effects, and any uncertainty in the predicted or measured limit of detection. Third, the project should include safe handling and disposal practices for all reagents, especially if future versions use CRISPR components, fluorogenic reporters, lyophilized reaction mixes, or field-deployable formats. A further ethical requirement is to avoid overpromising accessibility if the assay still depends on conditions or materials that are difficult to standardize outside the laboratory. Alternatives such as using the platform only for preliminary screening, while confirming results with certified methods, should remain part of the deployment strategy. In this way, the project can remain aligned with public health goals while minimizing the risk of misuse, misinterpretation, or premature application.
Ethical Implications
This project raises several ethical considerations related to environmental health, public communication, and responsible biosensor development. At its core, the project is motivated by beneficence, because it aims to improve access to lead monitoring tools that could support earlier detection of unsafe water conditions and reduce long-term exposure to a major public health hazard. It also relates to justice, since communities with fewer resources are often the ones most affected by environmental contamination while also having the least access to centralized analytical testing. At the same time, the principle of non-maleficence is especially important here, because an inaccurate sensor could produce false negatives that give users unjustified confidence in contaminated water, or false positives that generate unnecessary alarm. Since the project is based on a modular synthetic biology sensing architecture, it must also be guided by responsibility in how claims are made, how performance is validated, and how limitations are communicated.
To ensure that this project is ethical, several measures should be taken both during research and in any future real-world deployment. First, the sensor should never be presented as a replacement for certified analytical methods unless its performance has been rigorously benchmarked under realistic environmental conditions. Second, all results should be reported transparently, including background leakage, false activation risks, matrix effects, and any uncertainty in the predicted or measured limit of detection. Third, the project should include safe handling and disposal practices for all reagents, especially if future versions use CRISPR components, fluorogenic reporters, lyophilized reaction mixes, or field-deployable formats. A further ethical requirement is to avoid overpromising accessibility if the assay still depends on conditions or materials that are difficult to standardize outside the laboratory. Alternatives such as using the platform only for preliminary screening, while confirming results with certified methods, should remain part of the deployment strategy. In this way, the project can remain aligned with public health goals while minimizing the risk of misuse, misinterpretation, or premature application.
Ethical Implications
This project raises several ethical considerations related to environmental health, public communication, and responsible biosensor development. At its core, the project is motivated by beneficence, because it aims to improve access to lead monitoring tools that could support earlier detection of unsafe water conditions and reduce long-term exposure to a major public health hazard. It also relates to justice, since communities with fewer resources are often the ones most affected by environmental contamination while also having the least access to centralized analytical testing. At the same time, the principle of non-maleficence is especially important here, because an inaccurate sensor could produce false negatives that give users unjustified confidence in contaminated water, or false positives that generate unnecessary alarm. Since the project is based on a modular synthetic biology sensing architecture, it must also be guided by responsibility in how claims are made, how performance is validated, and how limitations are communicated.
To ensure that this project is ethical, several measures should be taken both during research and in any future real-world deployment. First, the sensor should never be presented as a replacement for certified analytical methods unless its performance has been rigorously benchmarked under realistic environmental conditions. Second, all results should be reported transparently, including background leakage, false activation risks, matrix effects, and any uncertainty in the predicted or measured limit of detection. Third, the project should include safe handling and disposal practices for all reagents, especially if future versions use CRISPR components, fluorogenic reporters, lyophilized reaction mixes, or field-deployable formats. A further ethical requirement is to avoid overpromising accessibility if the assay still depends on conditions or materials that are difficult to standardize outside the laboratory. Alternatives such as using the platform only for preliminary screening, while confirming results with certified methods, should remain part of the deployment strategy. In this way, the project can remain aligned with public health goals while minimizing the risk of misuse, misinterpretation, or premature application.
Ethical Implications
This project raises several ethical considerations related to environmental health, public communication, and responsible biosensor development. At its core, the project is motivated by beneficence, because it aims to improve access to lead monitoring tools that could support earlier detection of unsafe water conditions and reduce long-term exposure to a major public health hazard. It also relates to justice, since communities with fewer resources are often the ones most affected by environmental contamination while also having the least access to centralized analytical testing. At the same time, the principle of non-maleficence is especially important here, because an inaccurate sensor could produce false negatives that give users unjustified confidence in contaminated water, or false positives that generate unnecessary alarm. Since the project is based on a modular synthetic biology sensing architecture, it must also be guided by responsibility in how claims are made, how performance is validated, and how limitations are communicated.
To ensure that this project is ethical, several measures should be taken both during research and in any future real-world deployment. First, the sensor should never be presented as a replacement for certified analytical methods unless its performance has been rigorously benchmarked under realistic environmental conditions. Second, all results should be reported transparently, including background leakage, false activation risks, matrix effects, and any uncertainty in the predicted or measured limit of detection. Third, the project should include safe handling and disposal practices for all reagents, especially if future versions use CRISPR components, fluorogenic reporters, lyophilized reaction mixes, or field-deployable formats. A further ethical requirement is to avoid overpromising accessibility if the assay still depends on conditions or materials that are difficult to standardize outside the laboratory. Alternatives such as using the platform only for preliminary screening, while confirming results with certified methods, should remain part of the deployment strategy. In this way, the project can remain aligned with public health goals while minimizing the risk of misuse, misinterpretation, or premature application.
Results & Quantitative Expectations
What aspect of the project did I choose to validate?
For this stage of the project, I chose to validate the design and computational prioritization workflow of the DNAzyme–Cas12a sensing cascade rather than a fully assembled wet-lab assay. This validation focuses on whether the sensing architecture can be rationally designed in a way that minimizes unwanted folding, preserves trigger accessibility, and supports a plausible downstream Cas12a activation logic. I selected this aspect because it is directly achievable within the current scope of the course and because a poor sequence architecture would undermine all later experimental optimization.
Validation protocol
I defined the overall sensing architecture as a modular cascade composed of a Pb²⁺-responsive DNAzyme, a cleavable substrate, a released trigger strand, a Cas12a-crRNA activation module, and a fluorescent reporter output.
I selected literature-supported DNAzyme designs relevant to Pb²⁺ sensing and used them as the mechanistic basis for the upstream recognition module.
I drafted candidate trigger-release strategies in which cleavage of the substrate would expose or release a DNA sequence capable of activating the downstream CRISPR module.
I annotated project-relevant sequence elements and organized the design logic in Benchling.
I evaluated sequence-level folding behavior using NUPACK to identify unwanted secondary structures that could interfere with cleavage, trigger release, or Cas12a activation.
I compared candidate designs by qualitatively prioritizing those with better trigger accessibility and lower predicted risk of OFF-state leakage.
I translated the sensing cascade into a reaction-level kinetic framework suitable for ODE-based simulation.
I defined the major kinetic steps as DNAzyme cleavage, trigger release, Cas12a activation, reporter cleavage, and background leakage.
I used the model structure to define which variables would most strongly affect sensitivity, including cleavage efficiency, trigger concentration, activation kinetics, reporter concentration, and background activity.
I documented a DNA design workflow compatible with future synthesis and screening steps, including Benchling annotation and plasmid-level documentation.
What synthetic biology techniques did I use in this validation?
This validation used several synthetic biology techniques even though it did not yet include a full wet-lab assay. The first was DNA construct design, because the project depends on clearly defined sequence modules and activation logic rather than on a vague conceptual pathway. The second was computational nucleic acid analysis, especially folding-based evaluation, because secondary structure directly affects accessibility and leakage in sequence-programmed sensing systems. The third was model-based analysis, since the reaction cascade must be understood not only structurally but also dynamically. Finally, the validation included Benchling-based design documentation and a Twist-compatible DNA workflow, which are essential for translating the concept into experimentally testable constructs later in the project.
What data will I present?
The main data for this stage of the project will be computational and design-derived data rather than experimental fluorescence measurements. These data may include NUPACK structure predictions, ranked candidate architectures, annotated sequence maps, and simulated kinetic trajectories from the ODE model. Together, these outputs will serve as an evidence-based justification for selecting one or more sensing architectures for future experimental optimization.
Quantitative expectations
At this stage, my quantitative expectations are focused on relative performance trends rather than final environmental performance claims. I expect that useful candidate designs will show lower predicted OFF-state leakage, improved trigger accessibility, and stronger separation between simulated ON and OFF trajectories compared with less optimized alternatives. In the next phase of the project, these computational outputs would be used to prioritize experimental conditions for automated screening and to narrow the design space before wet-lab validation.
Challenges, limitations, and alternative strategies
A major limitation of the current stage is that computational prioritization cannot prove that the full sensing cascade will behave as expected in real reaction conditions. Nucleic acid folding predictions and structural plausibility assessments are helpful, but they do not fully capture the complexity of reaction kinetics, matrix effects, incomplete cleavage, or unintended interactions between components. Another challenge is that the released trigger may be theoretically accessible in silico while still performing poorly in practice due to concentration effects, sequence context, or competing structures.
A second limitation is that the project currently depends on simplified assumptions about Cas12a activation and background behavior. These assumptions are useful for building an initial model, but they may underestimate leakage or overestimate amplification efficiency. To address this, future versions of the project should compare multiple trigger architectures and explicitly include background-cleavage scenarios in the modeling framework.
An additional challenge is that real environmental water samples may contain salts, competing ions, inhibitors, or contaminants that reduce the performance of both the DNAzyme and the CRISPR module. A promising alternative strategy would be to first optimize the system in buffered model solutions and only then move into increasingly complex matrices. Another useful alternative would be to compare several Pb²⁺-responsive DNAzyme configurations rather than relying on a single canonical design from the beginning.
References
Brown, A. K., Li, J., Pavot, C. M.-B., & Lu, Y. (2003). A lead-dependent DNAzyme with a two-step mechanism. Biochemistry, 42(23), 7152–7161.
Liu, H., Yu, X., Chen, Y., et al. (2017). Crystal structure of an RNA-cleaving DNAzyme. Nature Communications, 8, 2006.
Li, H., Zhang, Q., Cai, Y., Kong, D.-M., & Shen, H.-X. (2012). Single-stranded DNAzyme-based Pb²⁺ fluorescent sensor that can work well over a wide temperature range. Biosensors and Bioelectronics, 34(1), 159–164.
He, S., Lin, W., Liu, X., et al. (2025). A DNA concatemer-encoded CRISPR/Cas12a fluorescence sensor for sensitive detection of Pb²⁺ based on DNAzymes. Analyst, 150(9), 1778–1784.
HTGAA 2026 Genetic Circuits II Lab Protocol.
HTGAA Spring 2026 — Week 2: DNA Read, Write, & Edit.
HTGAA 2026: Final Project Selection.
HTGAA 2026: Individual Final Project Documentation.
Supply List and Budget
Core reagents and supplies
Pb²⁺-responsive DNAzyme oligonucleotides
Cleavable substrate oligonucleotides
Trigger strand oligonucleotides
crRNA for Cas12a activation
Cas12a enzyme
Fluorogenic ssDNA reporter
Reaction buffers
MgCl₂ and other salts for optimization
Nuclease-free water
Microcentrifuge tubes
PCR tubes or 96-well plates
Plate seals
Pipette tips
Benchling/Twist-compatible DNA design materials
Optional lyophilization consumables for future deployment studies
Equipment
Micropipettes
Mini centrifuge
Fluorescence plate reader or qPCR-style fluorescence instrument
Thermal block or incubator
Computer for design, simulation, and sequence analysis
Optional Opentrons OT-2 liquid handler for automated optimization
Estimated budget categories
Oligonucleotides: medium
Cas12a enzyme and reporter reagents: medium to high
Buffers and consumables: low to medium
Plate-based fluorescence readout: depends on local instrumentation access
Automation cost: low if institutional OT-2 access is available, high if new acquisition is required
Practical note
The most cost-sensitive components of this project are likely to be the CRISPR reagents, custom oligonucleotide sets, and any repeated optimization screens. Costs can be reduced by beginning with a computationally prioritized shortlist of designs before expanding into multidimensional wet-lab screening.