Week 1 HW: Principles and Practices

First, describe a biological engineering application or tool you want to develop and why. This could be inspired by an idea for your HTGAA class project and/or something for which you are already doing in your research, or something you are just curious about.

I want to develop a biological engineering tool that combines biology, artificial intelligece, and space science to understand how life survives in extreme environments and to use that knowledge to both protect life on Earth and prepare for life beyond it.

My long-term vision is to build AI-system that can analyze extremophile microorganisms and predict how they adapt to harsh conditions like radiation, high salinity, drought, and low nutrients conditions similar to Mars and other extraterrestrial environments.

This tool would integrate, • Genetic data • Morphological traits • Environmental stress factors • Machine learning models to identify survival patterns and adaptive mechanisms.

The purpose is: • Earth protection By understanding how extremophiles survive, we can discover new biological compounds and mechanisms that help with bioremediation, climate resilience, sustainable agriculture, and medicine.

• Space Exploration The same data can guide astrobiology research, helping scientists predict whether life could survive on Mars, Europa, Tital, or exoplanets, and how humans might one day safely live.

This idea is inspired by my ongoing research in astrobiology, molecular biology and environmental biotechnology, and by my passion for NASA’s mission of exploring the unknown. My goal is not just to study life, but to use science and AI to protect it on Earth and Beyond.

Next, describe one or more governance/policy goals related to ensuring that this application or tool contributes to an “ethical” future, like ensuring non-malfeasance (preventing harm). Break big goals down into two or more specific sub-goals. Below is one example framework (developed in the context of synthetic genomics) you can choose to use or adapt, or you can develop your own. The example was developed to consider policy goals of ensuring safety and security, alongside other goals, like promoting constructive uses, but you could propose other goals for example, those relating to equity or autonomy.

• This tool combines biological data, artificial intelligence, and space research. It must be governed by clear ethical principles to ensure it is used to protect life, not exploit or harm it. The main governance goal is to ensure non-malfeasance, safety, equity, and responsible scientific use, while preventing misuse in harmful or unethical ways.

Ensure safety, security, and non-malfeasance This tool mush never be used to design harmful organisms, support bioweapons, or damage ecosystems.

Access control & user authentication Only verified researchers, educators, and institutions should be allowed to upload genetic or environmental data. Public users can view educational outputs but cannot manipulate sensitive biological models.
Misuse detection & content filtering AI models should automatically block outputs that could suggest harmful genetic modifications, dangeoirus pathogens, or unethical biological experiments.
Ethical review integration Any high-risk project using the system must require approval from an institutional ethics committee (IRB or biosafety board) before analysis is allowed.

Promote constructive and peaceful use The tool must support science for life, sustainability, and space exploration, not military or exploitative purposes.

Application restrictions The system will prohibit use for military bioengineering, weaponization, or ecological manipulation.
Transparent purpose declarations All users must declare the purpose of their research, which is logged and reviewd to ensure alignment with peaceful, scientific, and humanitarian goals.

Protect biodiversity, locan communities, and indigenous environments

Benefit sharing policies Any discoveries from local microbial data must credit and benefit the region of origin through shared publications, funding, or conservation programs.
Environmental consent protocols Sample collection and data use must follow national biodiversity laws and local permissions.

Equity, transparency, and responsible AI

Open access educational version Students and researchers from developing countries should have free access to learning and visualization features.
Explainable AI models Predictions should be transparent and interpretable, so users understand why the system gives certain results.

Long term planetary responsibility

Planetary protection All space research outputs must follow COSPAR and planetary protection standards to avoid contaminating other worlds.
Dual use Regular audits must be performed to ensure the tool is not being repurposed for harmful dual use research.

Next, describe at least three different potential governance “actions” by considering the four aspects below (Purpose, Design, Assumptions, Risks of Failure & “Success”). Try to outline a mix of actions (e.g. a new requirement/rule, incentive, or technical strategy) pursued by different “actors” (e.g. academic researchers, companies, federal regulators, law enforcement, etc). Draw upon your existing knowledge and a little additional digging, and feel free to use analogies to other domains (e.g. 3D printing, drones, financial systems, etc.). Purpose: What is done now and what changes are you proposing? Design: What is needed to make it “work”? (including the actor(s) involved - who must opt-in, fund, approve, or implement, etc) Assumptions: What could you have wrong (incorrect assumptions, uncertainties)? Risks of Failure & “Success”: How might this fail, including any unintended consequences of the “success” of your proposed actions?

Action - 1: Mandatory ethical access licensing system purpose Most bioinformatics tools are open access or lightly regulated, meaning sensitive genetic data and AI models can be used by anyone, including harmful purposes.

Therefore, I propose that create a tiered licensing system for users, similar to drone pilot licensing or controlled chemical access, where only approved users can access high risk features.

Design Start from universities, biotech companies, space agencies, and government regulators. Users must complete bioethics & biosafety training. Institutions verify users and issue digital licenses. Platform enforeces role based permisisons and logs all activity. Funded through research grants and institutional subscriptions.

Assumptions Institutions will cooperate and share responsibility. Users will not try to bypass the system. Ethical training meaningfully changes behavior.

Risk of failure and success Failure - Black market versions of the tool could emerge. Success risk - Over regulation could slow down innovation or exclude researchers from low income regions.

Action - 2: Automated dual use risk monitoring Purpose Dual use risks are often discovered only after harm occurs.

Therefore, I propose, embed an AI-monitoring system, like financial fraud detection or airport security screening, that flags suspicious biological queries or outputs.

Design Start from platform developers, cybersecurity teams, independent ethics boards. The AI scans queries for risky patterns. High risk activity triggers human review. Regular third-party audits.

Assumptions AI can accurately distinguish legitimate from harmful research. Researchers accept monitoring as a safety feature.

Risk of failure & success Failure - False positivies could block real research. Success risk - Surveillance concerns or misuse of logs by authorities.

Action - 3: Planetary protection & diversity compilance gateway Purpose Planetry protection and biodiversity laws exist but are not built into digital research tools.

Therefore, I propose to integrate legal and ethical compliance checks into the platform, similar to export control systems or medical ethics approvals.

Design It can be environmental agencies, space agencies (NASA, COSPAR), universities, and government Users must declare sample origin, purpose, and destination of data. System blocks projects that violate conservation or planetary protection rules. Requires global policy alignment.

Assumptions Countries will share standards. Researches will truthfully declare intentions.

Risks of failure and success Failure - Users may falsify data. Success risk - Strict rules may discourage open science.

Next, score (from 1-3 with, 1 as the best, or n/a) each of your governance actions against your rubric of policy goals. The following is one framework but feel free to make your own:

Does the option:	Option 1	Option 2	Option 3
Enhance Biosecurity
• By preventing incidents	1	1	2
• By helping respond	2	1	2
Foster Lab Safety
• By preventing incident	1	2	2
• By helping respond	2	2	1
Protect the environment
• By preventing incidents	2	2	1
• By helping respond	2	2	1
Other considerations
• Minimizing costs and burdens to stakeholders	3	2	2
• Feasibility?	2	2	3
• Not impede research	2	2	3
• Promote constructive applications	1	1	2

Option 1 - Ethical Licensing Option 2 - AI Option 3 - Compliance Gateway

Last, drawing upon this scoring, describe which governance option, or combination of options, you would prioritize, and why. Outline any trade-offs you considered as well as assumptions and uncertainties. For this, you can choose one or more relevant audiences for your recommendation, which could range from the very local (e.g. to MIT leadership or Cambridge Mayoral Office) to the national (e.g. to President Biden or the head of a Federal Agency) to the international (e.g. to the United Nations Office of the Secretary-General, or the leadership of a multinational firm or industry consortia). These could also be one of the “actor” groups in your matrix.

Audience International research agencies, space agencies, and global science

Recommended strategy - A hybrid governance model Based on the scoring matrix, I would prioritize a combination of Option 1 (Ethical access licensing) and option 2 (AI), with option 3 (Complinace gateway) implemented gradually as an international standard.

This hybrid model provides the stringest balance between biosecurity, lab safety, environmental protection, and scientific freedom.

Why this combination? Ethical access licensing This option scored best for preventing biosecurity and lab safety incidents. It ensures that only trained, verified users can access powerful features.

Prevention is more ethical and cost-effective than response. This system creates a culture of responsibility before harm can occur.

AI Use This option scored highest for helping respond to threats and promoting constructive use. It acts like a biosafety firewall that adapts as new risks emerge.

Even well-trained users can make mistakes. This system provides continuous protection without requiring constant human oversight.

Compliance Gateway Although it scored lower for feasibility, it is essential for planetary protection and biodiversity ethics. Ith should be phased in through international agreements.

IT requires legal alignment across nations and may slow innovation if enforced too early.

Homework Questions from Professor Jacobson

1. Nature’s machinery for copying DNA is called polymerase. What is the error rate of polymerase? How does this compare to the length of the human genome. How does biology deal with that discrepancy?

Error Rate of DNA Polymerase

In biological DNA replication, the primary enzymatic machinery is an error-correcting DNA polymerase, which operates via template-dependent 5’-3’ primer extension, supplemented by 5’-3’ error-correcting exonuclease and 3’-5’ proofreading exonuclease activities. This system achieves an error rate of approximately 1:10^{6 (one error per 10}6 nucleotides incorporated). This fidelity is attained at a throughput of 10 milliseconds per base addition, in stark contrast to chemical synthesis methods, which exhibit an error rate of 1:10^2 and lack inherent correction mechanisms.

Comparison to the Human Genome Length

The human genome is quantified in the slides as approximately 3.2 gigabase pairs (Gbp), equivalent to 3.2 × 10^{9 base pairs. Applying the polymerase error rate of 1:10}6, a single replication cycle would theoretically introduce circa 3.2 × 10^{3 errors (3.2 × 10}9 / 10^{6). This disparity highlights a critical vulnerability: uncorrected errors at this scale could precipitate deleterious mutations, oncogenic transformations, or cellular inviability. The slides underscore biology’s adaptive advantage through a throughput-error rate product differential of ~10}8 relative to chemical approaches, facilitating the replication of extensive genomes with minimal disruption.

Biological Mechanisms for Error Mitigation

To reconcile this discrepancy, biological systems deploy multifaceted error-correction strategies, reducing the effective error rate to ~1:10^9 or lower in vivo.

Mechanisms include:

Intra-synthetic Proofreading The 3’-5’ proofreading exonuclease excises mismatched nucleotides concurrently with polymerization.
Post-incorporation Repair The 5’-3’ exonuclease activity enables excision and resynthesis of erroneous segments.
Ancillary Repair Pathways Mismatch repair systems, such as the MutS complex (Lamers et al., 2000, as cited in the error correction section), perform post-replicative surveillance and rectification.

These processes render biological synthesis inherently “error-correcting,” in opposition to the open-loop paradigms of chemical methods (e.g., phosphoramidite cycles). Consequently, organisms can faithfully replicate large genomes, such as the human 3.2 Gbp, sustaining cellular and evolutionary viability where raw error rates would otherwise prove untenable.

2. How many different ways are there to code (DNA nucleotide code) for an average human protein? In practice what are some of the reasons that all of these different codes don’t work to code for the protein of interest?

Number of Synonymous DNA Encodings for an Average Human Protein

The genetic code, characterized in the lecture slides as “Life’s Operating System,” consists of 64 codons specifying 20 amino acids and 3 termination signals, with degeneracy enabling multiple nucleotide triplets to encode identical amino acids. The slides indicate that an average human protein comprises 1,036 base pairs (bp), corresponding to approximately 345 codons (1,036 / 3 ≈ 345, excluding the stop codon).

The cardinality of synonymous DNA sequences for such a protein is contingent upon the amino acid composition, with individual residues encoded by 1–6 codons (e.g., 1 for methionine, 6 for leucine). Employing an average degeneracy of ~3 codons per amino acid (reflective of the code’s overall distribution), the theoretical number of encodings approximates 3^{345, or on the order of 10}164 variants (log10(3^345) ≈ 164).

Practical Limitations Precluding Functionality of Many Encodings

Organismal preferences for synonymous codons can attenuate translation efficiency, inducing ribosomal stalling or suboptimal tRNA utilization. Recoding discussions in the slides (e.g., for phage resistance) imply that non-preferred codons may engender expression failure in heterologous hosts.
Sequences predisposed to deleterious folding can impede ribosomal procession or mRNA integrity. Illustrative cases from the slides depict minimum free energy (MFE) configurations at 25°C across GC contents of 10%, 50%, and 90%.

Low GC content (e.g., 10%) yields labile structures prone to degradation.
Elevated GC content (e.g., 90%) fosters hyperstable hairpins or loops, obstructing translation initiation.
These phenomena are governed by base-pairing free energies (A/T ≈ -1.2 kcal/mol; G/C ≈ -2.0 kcal/mol), with GC-rich motifs exacerbating folding propensity and hindering mRNA processing.

Motifs susceptible to endonucleases, such as RNase III in Escherichia coli, precipitate premature mRNA degradation.
Cryptic elements (e.g., promoters, terminators, or splice junctions) can disrupt transcriptional or post-transcriptional regulation.
Gene assembly challenges, particularly for repetitive or GC-biased sequences, amplify inaccuracies in chemical or enzymatic synthesis.
Factors such as tRNA abundance or cellular milieu can preclude functional proteogenesis, necessitating optimization for applications like pharmaceutical or biofuel production.

Homework Questions from Dr. LeProust

1. What’s the most commonly used method for oligo synthesis currently?

The most commonly used method for oligonucleotide (oligo) synthesis currently is solid-phase phosphoramidite chemistry. This approach, developed by Marvin Caruthers in 1981, involves a cyclic process of deblocking, coupling with a phosphoramidite nucleoside, capping unreacted sites, and oxidation, repeated for each nucleotide addition. It is performed on a solid support, such as controlled pore glass (CPG), enabling automated synthesis and high efficiency for short to medium-length oligos.

2. Why is it difficult to make oligos longer than 200nt via direct synthesis?

Synthesizing oligos longer than 200 nucleotides (nt) via direct chemical synthesis is challenging primarily due to the imperfect coupling efficiency in each cycle, typically around 98-99%. As the chain length increases, the yield of full-length product decreases exponentially according to the formula: yield ≈ (efficiency)^{(n-1), where n is the number of nucleotides. For example, at 99% efficiency, the yield for a 200 nt oligo is approximately 0.99}199 ≈ 0.135 (13.5%), but for longer sequences, it drops significantly, leading to low yields, increased truncated products, and higher error rates from side reactions or depurination. Additionally, purification becomes more difficult, and secondary structures in long sequences can hinder synthesis.

3. Why can’t you make a 2000bp gene via direct oligo synthesis?

Direct oligo synthesis cannot produce a 2000 base pair (bp) gene because the length far exceeds the practical limits of chemical synthesis methods like phosphoramidite chemistry, where yields become negligible (e.g., 0.99^{1999 ≈ 10}{-8.7}, essentially zero). Genes of this size are double-stranded and require error-free sequences, but direct synthesis accumulates errors and impurities exponentially. Instead, long genes are assembled from shorter oligos (typically 40-300 nt) using enzymatic methods like PCR-based assembly or ligation, as illustrated in classical gene synthesis protocols, to achieve high fidelity and yield.

Homework Question from George Church

Choose ONE of the following three questions to answer; and please cite AI prompts or paper citations used, if any.

What are the 10 essential amino acids in all animals and how does this affect your view of the “Lysine Contingency”?

What code would you suggest for AA:AA interactions?

Given the one paragraph abstracts for these real 2026 grant programs sketch a response to one of them or devise one of your own:

The 10 essential amino acids required by all animals (those that cannot be synthesized endogenously in sufficient quantities and must be obtained through the diet) are: arginine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, threonine, tryptophan, and valine. This list is consistent across various species, including mammals like dogs, pigs, and horses, though some animals (e.g., cats) have an additional requirement for taurine. Slide #4 from Prof. Church’s lecture illustrates the standard genetic code mapping RNA codons to these amino acids (plus others), highlighting the ribosomal translation process that relies on this code to incorporate them into proteins.

The “Lysine Contingency” refers to a fictional genetic failsafe in Jurassic Park, where cloned dinosaurs were engineered to lack the ability to synthesize lysine, forcing dependence on park-supplied supplements to prevent survival if they escaped. Knowing that lysine is one of the 10 essential amino acids reinforces why this approach is inherently flawed: in nature, animals routinely obtain essential amino acids (including lysine) from dietary sources like plants (e.g., beans, soy) or other animals, without needing to synthesize them. Escaped dinosaurs could simply forage or hunt for lysine-rich foods, rendering the contingency ineffective as depicted in the story where they thrive on Isla Nublar. The limitations of single-AA dependency as a safety measure. More robust biocontainment, like full genome recoding to alter multiple codons (as in Syn61Δ3 bacteria), could create true barriers incompatible with wild-type biology, unlike the dietary workaround possible with lysine.