Week 1 HW: Principles and Practices

Biological Engineering Application

I am developing a computational design pipeline to miniaturize enzymes, specifically to find the shortest amino acid sequence that retains a functional fold. This builds on the work that I do as a part of my PhD, specifically the protein design pipeline also known as BAGEL (see published paper and github repo). For this course, I want to experimentally test the pipeline on VioA, a tryptophan oxidase in the violacein pathway. Violacein is purple, so this assay readout is very trivial, and will be beneficial to validate my computational approach. Down the line, the long-term target is Cas9. At over 1,300 residues, SpCas9 is a large protein, difficult to fit into AAV vectors, making it a real bottleneck for practical use in gene therapy. If I validate my approach with VioA, I would be more confident in pursuing this further in a therapeutics context to find mini-Cas9 variants.

Governance Goals

Goal 1: Biosafety and dual-use risk. The pipeline I am developing is general-purpose — it can miniaturize any enzyme given a defined active site. This is useful, but it also means someone could use it to miniaturize harmful proteins (e.g., toxins) or make dangerous enzymes easier to deliver. One sub-goal is to figure out whether to implement some form of screening on the input enzyme sequence, similar to how DNA synthesis companies screen orders against pathogen sequences.

Goal 2: Equitable access to gene therapy. Currently, SpCas9 does not fit in a single AAV vector, forcing dual-vector or split-intein approaches that double manufacturing complexity and cost. A miniaturized Cas9 that packages into one AAV would simplify production, which matters especially for rare diseases and lower-resource healthcare settings where the economics of gene therapy are hardest to justify. To support this, I plan to keep both the pipeline and the generated sequences open-source, so that academic groups and smaller biotechs can use and build on the work without licensing barriers. However, I recognise this is in direct tension with the goal above — open-sourcing helps access but also maximises dual-use risk — and navigating that tradeoff is itself a policy challenge.

Governance Actions

Action 1: Build sequence screening into the pipeline itself

Purpose: Right now, my protein design pipeline takes any enzyme as input and miniaturizes it — there is no check on what is being miniaturized. I propose adding an automated screening step that flags inputs matching known toxins or select agent-associated enzymes, similar to how DNA synthesis companies screen orders against pathogen databases before synthesising them.

Design: This would sit at the pipeline level, so I would implement it as a pre-processing check in my computational pipeline. It would need a curated database of flagged sequences (something like the select agents list, but for protein sequences), and would need buy-in from the open-source community maintaining the tool. The analogy here is to 3D printers that detect currency patterns and refuse to print — the check is baked into the tool itself.

Assumptions: I am assuming a database of “dangerous” enzymes can be meaningfully defined, which is harder than it sounds. Many enzymes are dual-use by nature — a protease is a protease whether you use it in detergent or something harmful. The screen might also be trivially bypassed by someone who forks the code and removes the check.

Risks: If it works too aggressively, it blocks legitimate research (e.g., someone miniaturizing a toxin for vaccine development). If it is too easy to circumvent, it is security theatre that adds friction for honest users without stopping bad actors.

Action 2: Tiered open-source release for generated sequences

Purpose: In our work-in-progress paper, we open-source all generated mini-variant sequences. This is great for reproducibility and access, but for more sensitive targets (e.g., a future mini-Cas9 or enzymes with clear dual-use potential), a fully open release might not be appropriate. I propose a tiered model: pipeline code is fully open, but pre-computed sequences for sensitive targets are gated behind a lightweight access agreement.

Design: This follows the precedent set by AI model releases — Meta releases Llama weights under a responsible-use licence, and users agree to terms before downloading. For us, this could mean hosting sensitive sequences on a platform like Zenodo with access requests that require institutional affiliation and a brief statement of intended use. The academic community would need to adopt this as a norm, and funding bodies like the NIH or EPSRC could incentivise it through data-sharing policy requirements.

Assumptions: I am assuming that gating access actually deters misuse, which is debatable — determined bad actors can likely get sequences elsewhere or just run the pipeline themselves. I am also assuming the community would adopt this voluntarily, which requires consensus on what counts as “sensitive.”

Risks: If it fails, it just slows down legitimate researchers without stopping anyone malicious. If it succeeds too well, it fragments the field — labs with access publish faster, labs without access cannot reproduce results, and it undermines the open-science ethos that makes this work valuable in the first place.

Action 3: Include computational miniaturization in institutional biosafety review

Purpose: Institutional biosafety committees (IBCs) currently review work involving pathogens, select agents, and gain-of-function research, but there is no trigger for “we computationally miniaturized an enzyme to make it easier to deliver.” I propose that miniaturization of certain enzyme classes — particularly nucleases, toxins, or anything with a clear delivery-advantage implication — should flag enhanced IBC review, analogous to how gain-of-function research on influenza now triggers additional oversight.

Design: This would require federal regulators (e.g., NIH in the US, EPSRC/HSE in the UK) to update biosafety guidelines to include a category for computationally redesigned proteins with altered delivery properties. IBCs would need to be trained on what miniaturization means and why it matters. Funding agencies could enforce this by requiring a miniaturization risk assessment as part of grant applications, similar to how dual-use research of concern (DURC) assessments are currently required.

Assumptions: I am assuming regulators can keep up with the pace of computational tools, which historically they have not. I am also assuming IBCs have the expertise to evaluate these risks, which is a stretch — most IBCs are set up to assess wet-lab biosafety, not computational protein engineering outputs.

Risks: If it fails, it is because the rules exist on paper but nobody enforces them, or IBCs rubber-stamp approvals because they do not understand the technology. If it succeeds, it could slow down legitimate research with bureaucratic overhead, particularly in academic settings where IBC review is already a bottleneck. There is also a jurisdiction problem — if the UK regulates this and the US does not, research just moves.

Scoring

Does the option:	Option 1: Sequence Screening	Option 2: Tiered Release	Option 3: IBC Review
Enhance Biosecurity
• By preventing incidents	2	3	1
• By helping respond	3	2	1
Foster Lab Safety
• By preventing incident	n/a	n/a	1
• By helping respond	n/a	n/a	1
Protect the environment
• By preventing incidents	n/a	n/a	n/a
• By helping respond	n/a	n/a	n/a
Other considerations
• Minimizing costs and burdens to stakeholders	1	2	3
• Feasibility?	1	2	3
• Not impede research	1	2	3
• Promote constructive applications	2	1	3

Recommendation

I would recommend prioritising Options 1 and 2 in combination, while deferring Option 3 to a later stage.

Option 1 can be implemented immediately at negligible cost. While it is not robust against deliberate circumvention, it establishes a normative expectation within the user community and introduces a meaningful friction point — analogous to the screening protocols adopted by DNA synthesis providers, which are widely acknowledged as imperfect but nonetheless valuable as a baseline safeguard. Option 2 provides a natural complement: the pipeline code remains fully open to preserve reproducibility, while pre-computed sequences for high-risk targets are gated behind a lightweight access mechanism. This recommendation is directed at funding bodies such as EPSRC and NIH, who could integrate tiered release requirements into existing data-sharing policies without the need for new legislation.

Option 3, while important in principle, is premature. Institutional biosafety committees currently lack the expertise to evaluate computational protein engineering outputs, and updating federal biosafety frameworks is a multi-year process. This option becomes more pressing once miniaturized protein variants begin entering preclinical development.

The central tradeoff in this recommendation is accepting weaker biosecurity guarantees in order to avoid impeding research at a stage where the technology remains early and the associated risks are largely theoretical. It should also be noted that pipeline-level controls address only one entry point; comparable outcomes could be achieved using alternative computational tools. Community-wide norms are therefore necessary in the longer term, though considerably more difficult to coordinate.

Reflections

A recurring ethical concern this week was equitable access — making sure the tools we develop in synthetic biology are provided in such a way that even low-income countries can benefit from them. This is not a new idea, but it became more concrete to me during the class discussion.

Hearing from one of the participants from Kenya on the Zoom call, it also became more salient to me how important public discourse is for proposing ideas from synthetic biology and making them acceptable in the wider community. We as scientists also have the duty to engage in public policy and outreach, such that the wider populace is able to appreciate these technologies and support them. In the end, if we do not get public support, whatever we do end up doing in the lab will have limited impact, as none of these technologies will come to fruition.

I do feel like we often forget ourselves in the lab, or behind the computer, believing that the policy makers will take care of making the right decisions. But it is up to us, and only us. One governance action that could help here is creating public policy programmes and fellowships that bring technical people into government — similar to what the UK is currently doing with ARIA and the AI Safety Institute. If scientists with hands-on experience in synthetic biology were embedded in regulatory bodies, the resulting policy would be far better informed, and the gap between what we build in the lab and what reaches the public would narrow considerably.

Week 2 Lecture Prep

Homework Questions from Professor Jacobson

Nature’s machinery for copying DNA is called polymerase. What is the error rate of polymerase? How does this compare to the length of the human genome. How does biology deal with that discrepancy? Polymerase has an error rate of 1 in 10^{6 base pairs (bps). The human genome is ~3.2 Gbp (giga referring to 10}9). That means if the polymerase would have to replicate the entire genome, it would produce at least 3,200 bp mutations. The polymerase is, however, error-correcting: its 3’→5’ exonuclease activity (proofreading) detects and removes misincorporated bases, reducing the error rate to roughly 1 in 10^{7. Post-replication, mismatch repair enzymes further scan and correct errors, bringing the effective rate down to approximately 1 in 10}9–10^10, or roughly 0.3–3 mutations per genome per cell division.

How many different ways are there to code (DNA nucleotide code) for an average human protein? In practice what are some of the reasons that all of these different codes don’t work to code for the protein of interest? The average human protein is ~450 amino acids. With ~3 synonymous codons per amino acid on average, there are roughly 3^{450 ≈ 10}215 possible DNA sequences encoding the same protein. In practice, most do not work well because of codon usage bias (organisms prefer certain codons matched to their tRNA pools), mRNA secondary structures that stall translation, extreme GC content affecting stability, and accidental introduction of cryptic regulatory signals like splice sites or polyadenylation sequences.

Homework Questions from Dr. LeProust

What is the most commonly used method for oligo synthesis currently? Phosphoramidite chemistry, developed by Caruthers in 1981. It is a four-step cycle — coupling with a phosphoramidite monomer, capping unreacted sites, oxidation, and deblocking — repeated N times on a solid support (originally CPG, now also silicon chips at Twist). This has been the industry standard since ABI built the first automated synthesizer around it in 1983.

Why is it difficult to make oligos longer than 200nt via direct synthesis? Because coupling efficiency is less than 100%. Even at 99% per-step efficiency, the fraction of full-length product decays exponentially: 0.99^200 ≈ 13%. Beyond ~200nt, most of what you harvest is truncated or deletion-containing sequences, not full-length oligo. Twist has recently pushed this to 700-mers with enhanced chemistry (97% full-length material).

Why can you not make a 2000bp gene via direct oligo synthesis? The same coupling efficiency problem makes it mathematically impossible. At 2,000 cycles, even 99.5% per-step efficiency gives 0.995^2000 ≈ 0.005% full-length yield — essentially zero. Instead, genes are assembled from short overlapping oligos (~40-mers) via overlap-extension PCR, as described by Stemmer (1995). You synthesize many short oligos that tile the gene with overlapping ends, anneal them, and extend with PCR to build the full-length product.

Homework Question from George Church

What are the 10 essential amino acids in all animals, and how does this affect your view of the “Lysine Contingency”? The 10 essential amino acids are histidine, isoleucine, leucine, lysine, methionine, phenylalanine, threonine, tryptophan, valine, and arginine — animals cannot synthesise these and must obtain them from their diet. The “Lysine Contingency” in Jurassic Park (Crichton, 1990) was a biocontainment strategy where the dinosaurs were engineered to be unable to synthesise lysine. This is fundamentally flawed: lysine is already essential in all animals and is abundant in normal food sources, so the dinosaurs would simply obtain it from their diet, rendering the containment useless.

Homework

Weekly homework submissions:

Subsections of Homework