HW1 – Lecture 2: Homework Questions

Questions from Professor Jacobson

1. Polymerase error rate, genome length, and error correction

DNA polymerases copy DNA with a very low intrinsic error rate thanks to built-in proofreading. High-fidelity DNA polymerases typically make ~1 error per 10^6 base pairs incorporated during synthesis.

This error rate appears incompatible with the size of the human genome, which is approximately 3.2 x 10^9 base pairs long. If replication relied solely on polymerase accuracy, each cell division would introduce thousands of mutations, which would be unsustainable.

Biology resolves this discrepancy through layered error-correction mechanisms:

Proofreading activity (3’ to 5’ exonuclease) within DNA polymerases removes misincorporated nucleotides during replication.
Post-replication mismatch repair systems detect and correct remaining base-pairing errors.
Biological redundancy and selection, where many mutations are neutral or eliminated over time.

Together, these mechanisms reduce the effective mutation rate to approximately 1 error per 10^{9 to 10}10 base pairs per replication, allowing large genomes to be copied with high fidelity despite the physical limits of molecular machinery.

2. Coding capacity for an average human protein and practical constraints

An average human protein is encoded by roughly 1,000 base pairs, corresponding to a protein of ~300 to 350 amino acids. Because most amino acids are encoded by multiple synonymous codons, there are an astronomically large number of possible DNA sequences that could, in theory, encode the same protein.

For a typical human protein, the number of valid nucleotide sequences is on the order of 10^170 or more, reflecting the degeneracy of the genetic code.

In practice, however, only a small fraction of these sequences function effectively. Key limiting factors include:

Codon usage bias: Host organisms preferentially use certain codons; rare codons can dramatically reduce translation efficiency.
mRNA secondary structure: GC content and sequence composition can form stable structures that interfere with transcription, ribosome loading, or elongation.
Embedded regulatory elements: Coding sequences may unintentionally contain splice sites, transcription terminators, or RNA processing signals.
Protein folding dynamics: Translation speed influences co-translational folding; synonymous changes can alter protein structure or activity.
DNA synthesis and assembly constraints: Repeats, extreme GC content, and secondary structure increase synthesis errors and reduce yield.

As a result, while the genetic code is formally degenerate, biological, regulatory, and manufacturing constraints drastically limit which DNA sequences are viable for expressing a functional protein in a given system.

Questions from Dr. LeProust

3. What’s the most commonly used method for oligo synthesis currently?

The most commonly used method for oligonucleotide synthesis is solid-phase phosphoramidite chemistry. In this approach, nucleotides are added sequentially to a growing DNA chain that is anchored to a solid support, typically controlled pore glass or polystyrene beads.

Each synthesis cycle consists of four main steps: deprotection, coupling, capping, and oxidation. This chemistry is highly optimized, automated, and scalable, making it the industry standard for producing oligos ranging from short primers to longer synthetic DNA fragments.

4. Why is it difficult to make oligos longer than 200 nt via direct synthesis?

The main limitation arises from cumulative error rates and incomplete coupling efficiency at each synthesis step. Even with very high per-step efficiencies (e.g. 99.5–99.9%), small losses accumulate exponentially as the oligo length increases.

As a result:

The fraction of full-length, error-free molecules drops rapidly beyond ~150–200 nt
Truncations and point errors become dominant
Purification becomes increasingly inefficient and expensive

Additionally, longer oligos are more prone to secondary structure formation during synthesis, further reducing yield and fidelity.

5. Why can’t you make a 2000 bp gene via direct oligo synthesis?

Direct synthesis of a 2000 bp gene is impractical because the compounded error rate would make the proportion of correct full-length molecules vanishingly small. Even at near-ideal coupling efficiencies, the probability of producing an error-free 2000 bp sequence via stepwise synthesis approaches zero.

Instead, long genes are produced by:

Synthesizing shorter oligos (typically 60–200 nt)
Assembling them hierarchically using enzymatic methods (e.g. PCR-based assembly, Gibson assembly)
Applying error correction and selection steps at intermediate stages

This modular approach reflects a core principle of synthetic biology: complex sequences are built from reliable, smaller components rather than synthesized monolithically.

Question from George Church

6. [Option 1.] What are the 10 essential amino acids in all animals, and how does this affect the “Lysine Contingency”?

I chose option (1) from the list of questions proposed by Prof. Church.

The ten amino acids generally considered essential for all animals are:
Histidine, Isoleucine, Leucine, Lysine, Methionine, Phenylalanine, Threonine, Tryptophan, Valine, and Arginine (with arginine being conditionally essential in some species and life stages).

These amino acids cannot be synthesized de novo by animals and therefore must be obtained through diet or via microbial symbiosis. Among them, lysine occupies a particularly important position, as it is often limiting in plant-based staple crops such as cereals.

The concept of the “Lysine Contingency” highlights a systemic vulnerability in food systems: animals (including humans) depend on external biological sources for lysine, while many dominant agricultural crops are intrinsically lysine-poor. This creates nutritional, economic, and geopolitical dependencies that propagate from molecular biology up to global food security.

From a synthetic biology perspective, lysine can be seen as a strategic metabolic bottleneck rather than just a nutrient. This framing supports interventions such as:

Engineering crops with improved lysine biosynthesis or reduced lysine degradation
Microbial production of lysine for feed and food supplementation
Rethinking agrifood systems to reduce reliance on a small set of lysine-poor staples

Overall, the Lysine Contingency illustrates how constraints at the amino-acid level can scale into societal and planetary challenges, making it a compelling lens for applying synthetic biology to agrifood and climate-relevant problems.

Sources and AI assistance

This response was informed by Lecture 2 course materials, established knowledge in nutritional biochemistry, and AI-assisted reasoning used to structure and synthesize widely accepted concepts. AI tools were used to support clarity and integration of ideas; no proprietary or unpublished sources were used.