Homework # 1 I would like to train models which can generate genomes and reason or provide information about the function of the genome. Tools like this have already been developed, including Evo 2, which can identify functionally important mutations (for example, mutations in BRCA1 in humans). It has also been used to generate an entire bacteriophage genome.
Professor Jacobson 1. Nature’s machinery for copying DNA is called polymerase. What is the error rate of polymerase? How does this compare to the length of the human genome. How does biology deal with that discrepancy?
The error rate of polymerase is 1 error for every $10^6$ bases. In the human genome there are approximately 3.2 billion base pairs [1]. This means we expect on average 3,200 errors in every cell division. This is extraordinarily high at first glance. However, nature deals with this via an error correction system called the MutS repair system, which helps to repair these errors.
Subsections of Homework
Week 1 HW: Principles and Practices
Homework # 1
I would like to train models which can generate genomes and reason or provide information about the function of the genome. Tools like this have already been developed, including Evo 2, which can identify functionally important mutations (for example, mutations in BRCA1 in humans). It has also been used to generate an entire bacteriophage genome.
Governance Goal and Sub-goals
The primary goal of governance should be to reduce harm (ensure non-malfeasance). An AI model that can understand and generate genomes could potentially be used to design harmful viruses or bacteriophages, including ones that target bacteria in the human microbiome [3][4]. Even if a model is not initially trained for this purpose, a sufficiently capable model could plausibly be adapted through fine-tuning. The overall goal of reducing harm can be broken down into three sub-goals.
First, ensure that researchers who release biological foundation models test for misuse potential and notify relevant oversight bodies prior to wide release. Second, reduce the likelihood that harmful capabilities can be introduced through fine-tuning or downstream adaptation. Third, reduce the ability of bad actors to synthesize or deploy any harmful sequences generated by these models.
Governance Actions
To address the first goal, awareness of potential risk, it is helpful to first understand the policies currently in place. The Evo 2 paper reports experiments evaluating whether the model has predictive or generative capability for viral genomes, but this appears to be an informal academic norm rather than a formal requirement. In recent years, legislation from New York and California have imposed new requirements on frontier language models [5]. However, this has been restricted to the kind of chat-bots created by AI labs, rather than those trained solely on biological data. Because biological models are trained mostly by research labs without significant revenue and may require less compute than the large frontier models, these laws may fail to mitigate the risks imposed by biological models.
I would propose federal legislation establishing oversight for the development and release of biological AI models. This could be housed in an agency such as NIST (through its AI standards work) or HHS (through organizations such as NIH and CDC). The legislation could require a good-faith evaluation of biosecurity risk prior to release, and mandatory notification to the government before widespread release if meaningful risk is identified. This legislation would require collaboration between the federal government, university researchers, and private corporations. Additionally, it will require congress to pass additional funding to conduct oversight. There are two major shortcomings of this approach. First, it neglects international cooperation: models may be trained and distributed across borders, and a risky release in one country can create harm globally. Second, it places substantial responsibility on researchers to identify risks correctly before notifying an oversight body. A remaining open question is what actions researchers and governments should take when a model has harmful capabilities. More discussion and ethical guidelines are needed to determine what should be done in this scenario.
To address the second goal, prevent biological models from being adapted to cause harm, we can take inspiration from the know your consumer laws employed by banks. Because training and fine-tuning large models requires significant compute, we could require that cloud providers and AI chip vendors implement “know your customer” policies: verifying customer identity, intended use, and potentially flagging suspicious activity. Through government collaboration we could generate a “blacklist” of individuals or organizations who should not be allowed to purchase or rent significant amounts of compute. To do this properly, we would need to determine the threshold at which cloud providers and chip companies are required to verify customer identity and intended use. In reality, a simple threshold is likely to be insufficient. Bad actors are practiced at evading restrictions ranging from sanctions to illegal arms sales, and access to compute would likely be no different. To make a simple threshold work, this would require collaboration with the major chip providers: NVIDIA, AMD, and Google, as well as any cloud provider renting large amounts of compute to single users. They would need to integrate technology to learn about who their consumers are, and cross reference against some blacklist. We would need international cooperation to determine who should go on this blacklist and to pursue those who illegally smuggle chips to un-cooperative countries.
There has already been some work to prevent bad actors from synthesizing potentially harmful DNA sequences. The International Gene Synthesis Consortium is a group of gene synthesis companies which have come together to develop a common protocol to screen DNA synthesis orders [1]. The HHS has released guidelines for companies which process gene synthesis orders and companies which provide instruments for gene synthesis [2]. This level of collaboration between governments and industry is heartening to see. Without knowing more about how screening procedures work it is hard to say how this approach might fail. However, if the capabilities of biological AI models improve dramatically, the sequences they generate may not always be caught by existing screening systems. Perhaps, these models will become an important part of the screening in the future.
Scoring Actions
Does the option:
Action 1: Federal Oversight & Model Notification
Action 2: Compute KYC & Access Controls
Action 3: DNA Synthesis Screening
Enhance Biosecurity
By preventing misuse
2
2
1
By enabling detection or response
1
2
1
Foster Lab Safety
By preventing accidents
2
n/a
1
By improving incident response
1
n/a
2
Protect the Environment
By preventing environmental release
2
3
1
By enabling containment or remediation
2
3
2
Other considerations
Minimizing costs and burdens to stakeholders
2
3
2
Feasibility and enforceability
2
3
1
Does not impede research
2
3
1
Promote constructive applications
1
2
1
Priorities
Among these goals, I would prioritize the first goal, creating awareness, standards, and oversight around biosecurity. There is so much unknown about biological AI models. It is important that we have collaboration between the government and researchers so we can avoid unintended harms. As these tools improve, they may lower the barrier to designing biological weapons or other harmful biological agents. Furthermore, the model itself can be easily compressed into a few gigabytes and distributed globally. Once a model is released it is extremely difficult, if not impossible, to recall it.
“Could you please find any spelling or gramatical errors in this homework I am writing”
“Can you remind me of the main requirements in SB53?”
“How are DNA synthesis companies regulated?”
Week 2: Lecture Prep
Professor Jacobson
1. Nature’s machinery for copying DNA is called polymerase. What is the error rate of polymerase? How does this compare to the length of the human genome. How does biology deal with that discrepancy?
The error rate of polymerase is 1 error for every $10^6$ bases. In the human genome there are approximately 3.2 billion base pairs [1]. This means we expect on average 3,200 errors in every cell division. This is extraordinarily high at first glance. However, nature deals with this via an error correction system called the MutS repair system, which helps to repair these errors.
2. How many different ways are there to code (DNA nucleotide code) for an average human protein? In practice what are some of the reasons that all of these different codes don’t work to code for the protein of interest?
The average human protein is 1036 bp and on average each amino acid is coded by 3 possible DNA sequences [2]. So, for each amino acid sequence there are $3^{1036}$ possible distinct DNA sequences for a given protein. These are of course theoretical estimates, in practice DNA sequences that theoretically code for the same protein don’t behave exactly the same. They will have different GC content and therefore different levels of stability, which can affect the structure [3].
Dr. LeProust
1. What’s the most commonly used method for oligo synthesis currently?
Phosphoramidite chemistry
2. Why is it difficult to make oligos longer than 200nt via direct synthesis?
Looking at the DNA synthesis cycle shown in the slides, we add one base pair per iteration, that means if we assume some error rate at each iteration, the probability that you have an error at some point in the 200 cycles increases exponentially.
3. Why can’t you make a 2000bp gene via direct oligo synthesis?
Following my earlier argument if we want to create a 2000bp sequence, we are likely to have many errors. Therefore, it is likely that the gene we create will have enough errors to be non-functional.
Professor Church
What are the 10 essential amino acids in all animals and how does this affect your view of the “Lysine Contingency”?
Histidine (His)
Isoleucine (Ile)
Leucine (Leu)
Lysine (Lys)
Methionine (Met)
Phenylalanine (Phe)
Threonine (Thr)
Tryptophan (Trp)
Valine (Val)
Arginine (Arg)
I am unsure if we are actually discussing the “Lysine contingency” from Jurassic Park, but I am going to assume we are. I don’t know enough about dinosaurs to say if they could or could not produce Lysine, but that’s not particularly important. This is an essential amino acid, meaning modern animals cannot produce it on their own, but modern animals are living. So, obviously there must exist food sources for both carnivores and herbivores to consume lysine. Therefore, a dinosaur would also be able to consume these sources (but perhaps they require more?) Anyway, I am doubtful of this logic. Or perhaps, I am a fan, because as a result of this logic I have control over all lions, tigers, and bears, oh my!