Week 2 HW: Lecture Prep

Homework Questions from Professor Jacobson

Nature’s machinery for copying DNA is called polymerase. What is the error rate of polymerase? How does this compare to the length of the human genome. How does biology deal with that discrepancy?

Polymerase is an enzyme whose function is to produce a chain of nucleic acids by copying another chain that serves as a template. It is basically the enzyme that copies DNA during replication. The error made by DNA polymerase consists of incorporating an incorrect nucleotide, that is, placing a base that is not complementary to that of the template (for example, placing A where C should go).

The error rate of DNA polymerase is approximately 1 error per 10⁵ nucleotides incorporated when acting alone. However, thanks to its proofreading activity and cellular DNA repair mechanisms, the effective error rate during replication is reduced to approximately 1 error per 10⁹–10¹⁰ nucleotides.

The error rate of DNA polymerase, even with correction mechanisms, must be compared to the large length of the human genome, which is approximately 3 × 10⁹ base pairs. If the polymerase did not have correction systems, thousands of errors would occur for each replication of the genome. However, thanks to proofreading and DNA repair mechanisms, the final number of errors is reduced to about one mutation per genome per cell division.

Biology addresses the discrepancy between the error rate of DNA polymerase and the large length of the human genome through various control and correction mechanisms. These include the high specificity of the polymerase for the correct nucleotides, its error correction (proofreading) activity, post-replication error repair systems such as mismatch repair, and the elimination of severely damaged cells through apoptosis. Together, these mechanisms allow a very low mutation rate to be maintained despite the large size of the genome.

How many different ways are there to code (DNA nucleotide code) for an average human protein? In practice what are some of the reasons that all of these different codes don’t work to code for the protein of interest?

An average human protein, with a length of approximately 400 amino acids, can theoretically be encoded by an extremely large number of different DNA sequences. This is due to the degeneracy of the genetic code, meaning that multiple codons can specify the same amino acid. On average, each amino acid is encoded by about three different codons, which implies that a protein of this size could be encoded by on the order of 3¨400 distinct nucleotide sequences, an astronomically large number.

However, in practice, not all of these sequences are functional for encoding the protein of interest. This is because biological systems impose several constraints on gene expression. Factors such as codon usage bias and tRNA availability influence the efficiency and accuracy of translation. In addition, mRNA secondary structures can interfere with translation initiation or elongation. Certain nucleotide sequences may also unintentionally introduce regulatory signals, such as alternative splicing sites or premature polyadenylation signals, which can disrupt proper mRNA processing. Furthermore, GC content, mRNA stability, and the presence of repetitive or toxic sequences further limit the set of sequences that are biologically viable.

Taken together, although many theoretical nucleotide sequences can encode the same human protein, only a small fraction are effective in a real cellular context.

Homework Questions from Dr. LeProust

What’s the most commonly used method for oligo synthesis currently?

The most commonly used method for oligonucleotide synthesis today is solid-phase chemical synthesis using phosphoramidite chemistry.

In this method, the oligonucleotide is constructed base by base on a solid support. Each synthesis cycle includes the deprotection of the terminal nucleotide, the incorporation of an activated nucleotide (phosphoramidite), the oxidation or sulfidation of the bond formed, and blocking steps to prevent unwanted reactions. This process allows for highly automated, rapid, and efficient synthesis.

Why is it difficult to make oligos longer than 200nt via direct synthesis?

It is difficult to manufacture oligonucleotides longer than 200 nucleotides by direct chemical synthesis because the efficiency of each nucleotide incorporation cycle is not 100%. In phosphoramidite synthesis, each step has a small probability of error or coupling failure. As the length of the oligonucleotide increases, these errors accumulate, causing a drastic decrease in the yield of the complete product and an increase in truncated or incorrect sequences.

In addition, long oligonucleotides are more difficult to purify because the size differences between the correct sequence and incomplete sequences are small. Factors such as the formation of secondary structures during synthesis and chemical limitations associated with the stability of reagents and the solid support also play a role. For these reasons, direct synthesis of long oligonucleotides is inefficient and unreliable beyond approximately 200 nucleotides.

Why can’t you make a 2000bp gene via direct oligo synthesis?

A gene of approximately 2000 base pairs cannot be created by direct oligonucleotide synthesis because chemical DNA synthesis is less than 100% efficient in each nucleotide incorporation cycle. In phosphoramidite synthesis, each step has a small probability of failure, and these errors accumulate exponentially as the sequence length increases. As a result, the yield of the complete product decreases dramatically, and truncated or erroneous sequences predominate.

Furthermore, purification of such long fragments is extremely difficult, as the differences between the correct sequence and incomplete sequences are minimal. Added to this are chemical and physical limitations, such as reagent degradation, secondary structure formation, and the instability of long DNA during solid-phase synthesis. For these reasons, direct synthesis is not feasible for long genes, and in practice, genes of this size are obtained by assembling multiple shorter oligonucleotides using molecular biology techniques.

Homework Question from George Church

Given slides #2 & 4 (AA:NA and NA:NA codes)] What code would you suggest for AA:AA interactions?

For AA:AA (amino acid–amino acid) interactions, there is no discrete, symbolic code equivalent to the genetic code. Instead, the most appropriate type of “code” is one based on the physicochemical properties of amino acids, such as electrical charge, polarity, hydrophobicity, and residue size. These properties determine the non-covalent interactions between amino acids, including hydrophobic interactions, electrostatic interactions, hydrogen bonds, and disulfide bonds, which govern protein folding and stability.

Unlike NA:NA or NA:AA codes, where there is a defined correspondence between symbols, AA:AA interactions emerge from the three-dimensional context of the protein and the chemical characteristics of the residues involved. Therefore, a code based on physicochemical properties allows for a more realistic description and prediction of amino acid–amino acid interactions within a protein.

Source: Alberts, B. et al. Molecular Biology of the Cell. Garland Science. Chapters on protein structure and non-covalent interactions .