Week 2 Pre-Lecture: Homework

Homework Questions from Professor Jacobson

Nature’s machinery for copying DNA is DNA polymerase. According to the lecture slides, an error-correcting polymerase has an error rate of approximately 1 error per 10⁶ bases added.

The human genome is about 3.2 × 10⁹ base pairs long. Comparing these numbers, if replication relied only on polymerase accuracy, we would expect on the order of thousands of errors during replication of a single human genome. This highlights a discrepancy between the intrinsic error rate of polymerase and the need to faithfully copy very large genomes.

Biology resolves this by incorporating multiple layers of error correction. DNA polymerases include proofreading activity that detects and removes mismatched nucleotides during synthesis, and additional repair pathways (such as mismatch repair systems shown in the lecture) further correct errors after replication. Together, these mechanisms allow cells to maintain high fidelity despite the large size of the genome.

The lecture states that an average human protein corresponds to about 1036 base pairs. Since codons consist of three nucleotides, this corresponds to roughly a few hundred amino acids. The genetic code is degenerate, meaning that multiple codons can encode the same amino acid. Because there are 64 possible codons but only 20 amino acids, many different DNA sequences can theoretically encode the same protein sequence. The number of possible coding sequences therefore grows exponentially with protein length, so an average human protein can be encoded by a very large number of distinct DNA sequences.

In practice, not all synonymous sequences work equally well. The lecture shows that nucleotide composition (such as GC content) and sequence-dependent secondary structures affect molecular behavior. Different synonymous sequences can produce different RNA folding patterns or energetics, which can influence transcription, translation efficiency, and stability. As a result, biological and physical constraints limit which DNA sequences successfully produce the desired protein, even if they encode the same amino acid sequence.

Homework Questions from Dr. LeProust

The most commonly used method is solid-phase phosphoramidite chemical synthesis. In this approach, nucleotides are added sequentially to a growing DNA chain attached to a solid support. Each cycle consists of coupling a phosphoramidite nucleotide, capping unreacted sites, oxidation, and deprotection, and this cycle is repeated until the desired length is reached.

Direct oligo synthesis proceeds one base at a time, and each chemical addition step is not perfectly efficient. Because the synthesis is iterative, small inefficiencies compound with every cycle. As the sequence length increases:

The fraction of full-length molecules decreases.
Products accumulate.
Overall yield/purity drop significantly.

This makes it increasingly difficult to obtain high-quality long oligos directly.

A 2000 bp gene would require thousands of sequential chemical coupling steps. Since each step has less than 100% efficiency, the probability of producing a perfect full-length molecule becomes extremely low. Errors and truncations would dominate the product mixture.

Instead, long genes are typically made by synthesizing shorter oligos (example around 100–200 nt) and then assembling them enzymatically into longer fragments or full genes. This avoids the exponential loss in yield and accuracy associated with very long direct chemical synthesis.

Homework Question from George Church

Unlike NA:NA base pairing or the NA to AA genetic code, AA:AA interactions are not defined by a strict one-to-one symbolic mapping. Instead, an AA:AA code would be based on physico chemical compatibility between amino acid side chains. Key rules would include charge complementarity (positive interacting with negative residues), hydrogen-bond donor/acceptor matching, hydrophobic residues packing together, and steric shape complementarity for efficient packing. This is similar to lecture notes framing that different biological codes reflect interaction constraints: DNA basepairs emphasize specific pairing rules, while protein interactions emerge from chemical properties and geometry rather than fixed symbolic pairs.