Week 2 HW: DNA read, write, and edit

Homework Questions from Professor Jacobson:

According to the Lecture 2 slides, the intrinsic error rate of biological DNA polymerase is approximately 1 error per 10⁶ base pairs. The slides also indicate that the human genome is approximately 3.2 × 10⁹ base pairs in length. At this error rate, replication of the human genome would result in thousands of errors per replication cycle if no additional correction mechanisms existed.

The slides explain that biology addresses this discrepancy through error-correcting mechanisms, including proofreading activity associated with DNA polymerase and post-replication mismatch repair systems, such as the MutS pathway. Together, these mechanisms reduce the effective mutation rate and allow large genomes to be stably maintained.

The Lecture 2 slides indicate that an average human protein is approximately 1036 base pairs in length. Because DNA consists of four possible nucleotides, the total number of possible nucleotide sequences of this length is 4¹⁰³⁶, which follows directly from basic combinatorics (four choices at each position).

The slides further show that, in practice, only a small subset of these sequences are usable. Constraints illustrated in the slides include GC content effects, secondary structure formation, and sequence-dependent synthesis limitations, all of which can interfere with DNA synthesis, transcription, or downstream use. As a result, most theoretically possible sequences are not viable in biological or synthetic contexts.

Homework Questions from Dr. LeProust:

According to the Lecture 2 slides, the most commonly used method for oligonucleotide synthesis is solid-phase phosphoramidite chemical synthesis. The slides describe this as a stepwise process in which nucleotides are added sequentially to a growing DNA strand attached to a solid support through repeated chemical cycles.

The historical overview in the slides also notes that the phosphoramidite method, developed by Caruthers (1981), remains the foundation of modern DNA synthesis technologies.

The Lecture 2 slides explain that direct oligonucleotide synthesis becomes difficult beyond approximately 200 nucleotides because errors accumulate with each synthesis cycle. Each nucleotide addition has a finite probability of failure, and as the number of synthesis steps increases, the yield of full-length, correct oligos decreases sharply.

The slides also show that longer oligos suffer from truncation products, base incorporation errors, and sequence-dependent effects, including high GC content and secondary structure formation. These factors reduce both yield and purity, making long oligos impractical to synthesize reliably in a single continuous process.

As shown in the Lecture 2 slides, a 2000 base-pair gene cannot be synthesized directly because the cumulative error rate of chemical synthesis over thousands of nucleotide additions would result in an extremely low fraction of correct, full-length molecules.

Instead, the slides describe classical gene synthesis, in which long genes are assembled from many shorter oligos using enzymatic methods such as PCR-based assembly and ligation. This hierarchical approach allows errors to be managed and corrected during assembly, making long gene construction feasible.

Homework Question from George Church:

Animals require a conserved set of essential amino acids that must be obtained through diet. These include histidine, isoleucine, leucine, lysine, methionine, phenylalanine, threonine, tryptophan, valine, and arginine.

Prof. Church’s Lecture 2 slide #4 highlights the genetic code as a fixed mapping between DNA codons and amino acids. In this context, the Lysine Contingency reflects a fundamental biological constraint: lysine is essential and encoded by the genetic code, yet animals cannot synthesize it. This suggests that biological systems are historically and chemically constrained, and that changing or replacing core amino acids such as lysine would require large-scale re-engineering of both metabolism and the genetic code.

References & Use of Tools:

The primary sources for all homework answers are the:

Jacobson, J. HTGAA Lecture 2: Gene Synthesis (MIT, 2026). Used for questions on DNA polymerase error rates, genome scale, protein length estimates, combinatorics of DNA sequences, GC content effects, secondary structure, and biological error correction.

LeProust, E. HTGAA Lecture 2: Oligonucleotide and Gene Synthesis (MIT, 2026). Used for questions on phosphoramidite oligonucleotide synthesis, limitations of long oligo synthesis, error accumulation, and classical gene assembly strategies.

Church, G. HTGAA Lecture 2: Reading & Writing Life (MIT, 2026), Slide #4. Used for conceptual framing of the genetic code (DNA → mRNA → amino acids) and interpretation of the “Lysine Contingency.”

All interpretations derived from lecture material are explicitly tied to the concepts presented in these slides.

ScienceDirect Topics – Essential Amino Acids https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/essential-amino-acid

ScienceDirect Topics – Lysine https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/lysine

(Used as background confirmation that lysine is classified as an essential amino acid.)

No external references were used for the Jacobson or LeProust questions beyond the lecture slides.

An AI-based writing tool (ChatGPT) was used solely to assist with wording, organization, and clarity. All factual content was derived from the cited lectures and explicitly listed external sources.

Part 0: Basics of Gel Electrophoresis

Gel electrophoresis is a technique used to separate DNA fragments based on their size by applying an electric field. I have performed gel electrophoresis multiple times during molecular biology laboratory work in Iraq, as well as during my internship at DGIST University in South Korea in the Molecular Neuroscience Department. Through these experiences, I became familiar with the practical workflow, while the HTGAA lectures helped reinforce the underlying concepts. In this technique, DNA migrates through an agarose gel toward the positive electrode because DNA is negatively charged. The gel is prepared using agarose, poured into a casting tray, and fitted with a comb to form wells. DNA samples are mixed with a loading dye and carefully loaded into the wells along with a DNA ladder for size reference. After applying an electric current, DNA fragments separate within the gel and are visualized using a gel imaging system to capture and document the results. In my previous lab work, gel electrophoresis was mainly used for genotyping and for verifying DNA samples prior to sequencing, helping confirm fragment size and sample quality before downstream analysis. Revisiting this technique in the context of HTGAA helped me better articulate how gel electrophoresis fits into broader molecular biology and sequencing workflows. In many workflows, DNA samples are PCR-amplified prior to gel electrophoresis to ensure sufficient DNA quantity and to analyze specific target fragments, particularly in genotyping and sequencing preparation.

Part 1: Benchling & In-silico Gel Art

Figure 1:

Figure 2:

Part 3: DNA Design Challenge

3.1. Choose your protein

Amyloid Beta (Aβ), specifically the Aβ(1–42) peptide derived from the human amyloid precursor protein (APP). I chose Amyloid-β because I want to understand how a normal peptide becomes harmful when it misfolds and aggregates, and how that process can disrupt brain function, memory, and overall body function in Alzheimer’s disease. Alzheimer’s affects many older adults and can gradually remove their ability to access their memories and daily independence, so I’m personally motivated to understand the molecular pathway that leads to these changes. In addition, Aβ is strongly linked to the classic Alzheimer’s pathology of extracellular plaques, which makes it a clear and widely studied starting point for connecting sequence → structure → disease mechanism.

Amyloid-beta is not encoded as an independent gene. Instead, it is generated by proteolytic cleavage of the amyloid-beta precursor protein (APP). Therefore, the APP protein sequence was used as the source sequence, with specific focus on the region that produces the Aβ peptide.

https://www.nature.com/articles/aps201728

Source organism: Homo sapiens (human) (Aβ comes from human APP; many mouse Alzheimer’s models express human APP/Aβ).

https://www.uniprot.org/uniprotkb/P05067/entry

Protein sequence source (UniProt):

sp|P05067|A4_HUMAN Amyloid-beta precursor protein OS=Homo sapiens OX=9606 GN=APP PE=1 SV=3 MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLNMHMNVQNGKWDSDPSGTK TCIDTKEGILQYCQEVYPELQITNVVEANQPVTIQNWCKRGRKQCKTHPHFVIPYRCLVG EFVSDALLVPDKCKFLHQERMDVCETHLHWHTVAKETCSEKSTNLHDYGMLLPCGIDKFR GVEFVCCPLAEESDNVDSADAEEDDSDVWWGGADTDYADGSEDKVVEVAEEEEVAEVEEE EADDDEDDEDGDEVEEEAEEPYEEATERTTSIATTTTTTTESVEEVVREVCSEQAETGPC RAMISRWYFDVTEGKCAPFFYGGCGGNRNNFDTEEYCMAVCGSAMSQSLLKTTQEPLARD PVKLPTTAASTPDAVDKYLETPGDENEHAHFQKAKERLEAKHRERMSQVMREWEEAERQA KNLPKADKKAVIQHFQEKVESLEQEAANERQQLVETHMARVEAMLNDRRRLALENYITAL QAVPPRPRHVFNMLKKYVRAEQKDRQHTLKHFEHVRMVDPKKAAQIRSQVMTHLRVIYER MNQSLSLLYNVPAVAEEIQDEVDELLQKEQNYSDDVLANMISEPRISYGNDALMPSLTET KTTVELLPVNGEFSLDDLQPWHSFGADSVPANTENEVEPVDARPAADRGLTTRPGSGLTN IKTEEISEVKMDAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIATVIVITL VMLKKKQYTSIHHGVVEVDAAVTPEERHLSKMQQNGYENPTYKFFEQMQN

https://rest.uniprot.org/uniprotkb/P05067.fasta

3.2. Reverse Translate: Protein → DNA

I used an online reverse translation tool (bioinformatics.org SMS2) to convert the amino acid sequence of my chosen protein (human APP; UniProt P05067) into a corresponding DNA coding sequence. Reverse translation is not unique because the genetic code is degenerate, meaning multiple codons can encode the same amino acid. Therefore, the sequence below represents one valid nucleotide sequence that could encode the same protein.

Reverse-translated DNA sequence (coding DNA; A/T/G/C only):

reverse translation of sp|P05067|A4_HUMAN Amyloid-beta precursor protein OS=Homo sapiens OX=9606 GN=APP PE=1 SV=3 to a 2310 base sequence of most likely codons. atgctgccgggcctggcgctgctgctgctggcggcgtggaccgcgcgcgcgctggaagtg ccgaccgatggcaacgcgggcctgctggcggaaccgcagattgcgatgttttgcggccgc ctgaacatgcatatgaacgtgcagaacggcaaatgggatagcgatccgagcggcaccaaa acctgcattgataccaaagaaggcattctgcagtattgccaggaagtgtatccggaactg cagattaccaacgtggtggaagcgaaccagccggtgaccattcagaactggtgcaaacgc ggccgcaaacagtgcaaaacccatccgcattttgtgattccgtatcgctgcctggtgggc gaatttgtgagcgatgcgctgctggtgccggataaatgcaaatttctgcatcaggaacgc atggatgtgtgcgaaacccatctgcattggcataccgtggcgaaagaaacctgcagcgaa aaaagcaccaacctgcatgattatggcatgctgctgccgtgcggcattgataaatttcgc ggcgtggaatttgtgtgctgcccgctggcggaagaaagcgataacgtggatagcgcggat gcggaagaagatgatagcgatgtgtggtggggcggcgcggataccgattatgcggatggc agcgaagataaagtggtggaagtggcggaagaagaagaagtggcggaagtggaagaagaa gaagcggatgatgatgaagatgatgaagatggcgatgaagtggaagaagaagcggaagaa ccgtatgaagaagcgaccgaacgcaccaccagcattgcgaccaccaccaccaccaccacc gaaagcgtggaagaagtggtgcgcgaagtgtgcagcgaacaggcggaaaccggcccgtgc cgcgcgatgattagccgctggtattttgatgtgaccgaaggcaaatgcgcgccgtttttt tatggcggctgcggcggcaaccgcaacaactttgataccgaagaatattgcatggcggtg tgcggcagcgcgatgagccagagcctgctgaaaaccacccaggaaccgctggcgcgcgat ccggtgaaactgccgaccaccgcggcgagcaccccggatgcggtggataaatatctggaa accccgggcgatgaaaacgaacatgcgcattttcagaaagcgaaagaacgcctggaagcg aaacatcgcgaacgcatgagccaggtgatgcgcgaatgggaagaagcggaacgccaggcg aaaaacctgccgaaagcggataaaaaagcggtgattcagcattttcaggaaaaagtggaa agcctggaacaggaagcggcgaacgaacgccagcagctggtggaaacccatatggcgcgc gtggaagcgatgctgaacgatcgccgccgcctggcgctggaaaactatattaccgcgctg caggcggtgccgccgcgcccgcgccatgtgtttaacatgctgaaaaaatatgtgcgcgcg gaacagaaagatcgccagcataccctgaaacattttgaacatgtgcgcatggtggatccg aaaaaagcggcgcagattcgcagccaggtgatgacccatctgcgcgtgatttatgaacgc atgaaccagagcctgagcctgctgtataacgtgccggcggtggcggaagaaattcaggat gaagtggatgaactgctgcagaaagaacagaactatagcgatgatgtgctggcgaacatg attagcgaaccgcgcattagctatggcaacgatgcgctgatgccgagcctgaccgaaacc aaaaccaccgtggaactgctgccggtgaacggcgaatttagcctggatgatctgcagccg tggcatagctttggcgcggatagcgtgccggcgaacaccgaaaacgaagtggaaccggtg gatgcgcgcccggcggcggatcgcggcctgaccacccgcccgggcagcggcctgaccaac attaaaaccgaagaaattagcgaagtgaaaatggatgcggaatttcgccatgatagcggc tatgaagtgcatcatcagaaactggtgttttttgcggaagatgtgggcagcaacaaaggc gcgattattggcctgatggtgggcggcgtggtgattgcgaccgtgattgtgattaccctg gtgatgctgaaaaaaaaacagtataccagcattcatcatggcgtggtggaagtggatgcg gcggtgaccccggaagaacgccatctgagcaaaatgcagcagaacggctatgaaaacccg acctataaattttttgaacagatgcagaac

3.3 Codon optimization

To optimize the codon usage of the nucleotide sequence obtained in the previous step, I used the VectorBuilder online codon optimization tool. The reverse-translated DNA sequence was entered into the tool and Mus musculus (mouse) was selected as the target organism. Codon optimization is necessary because, although the genetic code is universal, different organisms preferentially use specific codons due to differences in tRNA abundance and translation efficiency. If a gene contains codons that are rarely used in the host organism, translation can be inefficient and protein expression levels may be reduced. Codon optimization replaces rare codons with synonymous codons that are more frequently used by the host, while preserving the amino acid sequence of the protein. After optimization, improvements were observed in sequence quality metrics. The GC content increased slightly from 55.37% to 57.62%, remaining within an optimal range for stability and transcription. In addition, the Codon Adaptation Index (CAI) increased from 0.74 to 0.92, indicating a substantially improved match between the codon usage of the sequence and the translational machinery of the mouse host. These changes suggest a higher likelihood of efficient translation and protein expression in mouse-based experimental systems. Mouse was chosen as the target organism because Alzheimer’s disease research is commonly conducted using mouse models, including APP-related transgenic and knock-in models. Optimizing the codon usage for mouse therefore increases the biological relevance of the designed sequence.

Codon-optimized DNA sequence: Improved DNA[1]: GC=57.62%, CAI=0.92 ATGCTGCCAGGCCTGGCCCTGCTGCTGCTCGCCGCCTGGACAGCCCGGGCCCTGGAAGTGCCAACCGACGGCAACGCTGGACTGCTGGCTGAGCCTCAGATCGCCATGTTTTGTGGGCGGCTGAATATGCACATGAATGTGCAGAACGGAAAGTGGGACTCTGACCCCTCCGGCACCAAAACCTGTATCGATACAAAGGAAGGCATTCTGCAGTACTGTCAGGAGGTGTATCCCGAGCTGCAGATCACCAACGTGGTGGAGGCCAACCAGCCTGTGACCATCCAAAATTGGTGCAAAAGGGGTAGAAAGCAGTGTAAGACACACCCACACTTTGTGATCCCATATAGATGTCTGGTGGGGGAGTTCGTGTCCGACGCCCTGCTGGTGCCCGACAAGTGCAAGTTTCTGCACCAGGAGAGAATGGACGTGTGCGAGACACACCTGCACTGGCACACAGTGGCTAAGGAGACCTGTAGTGAGAAGAGCACCAACCTGCACGACTACGGGATGCTGCTGCCCTGCGGTATCGACAAGTTTAGAGGTGTGGAATTCGTGTGCTGTCCTCTGGCCGAGGAGTCCGACAATGTGGATAGCGCCGACGCCGAGGAGGACGACAGCGACGTGTGGTGGGGCGGCGCCGATACAGACTACGCCGATGGCTCCGAAGACAAGGTGGTGGAGGTGGCCGAGGAAGAGGAAGTGGCCGAGGTGGAGGAGGAGGAGGCTGACGACGACGAGGACGATGAGGACGGCGATGAGGTTGAGGAGGAGGCCGAGGAGCCTTACGAGGAAGCCACCGAGCGGACTACTTCCATTGCTACCACCACCACCACCACTACCGAGAGCGTGGAGGAGGTGGTGAGAGAGGTGTGCAGCGAGCAGGCCGAGACCGGCCCTTGTAGAGCCATGATCTCCCGGTGGTATTTCGATGTGACCGAGGGAAAGTGCGCCCCTTTCTTCTACGGAGGCTGTGGAGGCAACAGGAACAATTTTGACACTGAGGAGTACTGTATGGCCGTGTGTGGCTCCGCCATGAGCCAGTCCCTGCTGAAGACCACTCAGGAGCCCCTGGCACGGGACCCTGTGAAGCTGCCCACCACCGCCGCTAGCACACCCGACGCCGTGGACAAGTATTTGGAGACCCCAGGAGACGAGAATGAGCACGCACACTTTCAGAAGGCTAAGGAGCGCCTGGAGGCTAAGCACCGAGAAAGGATGTCTCAGGTGATGCGCGAGTGGGAGGAAGCCGAGAGGCAGGCTAAGAACCTGCCTAAAGCTGACAAAAAAGCCGTGATCCAGCATTTCCAGGAGAAGGTGGAGAGCCTGGAACAGGAGGCTGCCAACGAGAGACAGCAGCTGGTGGAGACTCACATGGCTCGAGTGGAGGCCATGCTGAACGACAGGAGGAGGCTGGCCCTGGAGAACTACATCACCGCTCTGCAGGCCGTGCCTCCCAGGCCAAGGCATGTGTTTAACATGCTGAAGAAGTACGTGAGGGCAGAACAGAAGGACCGGCAACACACCCTGAAACACTTCGAGCACGTTAGAATGGTGGATCCTAAGAAAGCCGCTCAGATTAGAAGCCAGGTGATGACCCACCTGAGAGTGATTTACGAGAGAATGAACCAAAGCCTGTCTCTGCTGTATAATGTGCCCGCCGTCGCCGAGGAGATCCAGGACGAGGTGGACGAACTGCTGCAGAAGGAGCAAAATTACTCAGATGACGTGCTGGCAAACATGATCAGCGAACCACGCATCTCCTACGGCAACGACGCCCTGATGCCTTCCCTGACCGAAACTAAGACCACTGTGGAGCTGCTCCCAGTGAACGGCGAATTCTCCCTCGACGACCTGCAGCCTTGGCACAGCTTCGGGGCCGACTCCGTGCCTGCAAACACTGAAAACGAGGTGGAGCCTGTGGACGCAAGACCTGCCGCCGATAGAGGACTGACAACAAGACCTGGCAGCGGACTGACCAACATCAAGACCGAGGAGATTAGTGAGGTGAAGATGGATGCCGAGTTCAGGCACGATAGCGGGTACGAGGTACACCACCAGAAGCTGGTGTTCTTCGCTGAGGATGTGGGCAGCAATAAAGGAGCCATTATCGGCCTGATGGTGGGAGGGGTGGTGATCGCCACAGTGATCGTTATCACCCTGGTGATGCTGAAGAAGAAGCAGTACACCTCCATTCACCATGGGGTCGTCGAAGTGGATGCCGCCGTGACTCCAGAGGAGAGACACCTGAGCAAGATGCAGCAGAACGGGTATGAGAACCCAACCTATAAGTTCTTCGAGCAGATGCAGAAC

https://en.vectorbuilder.com/tool/codon-optimization/0e325451-10a9-4a2b-a574-6d246ae5e506.html

3.4. You have a sequence! Now what?

Once a codon-optimized DNA sequence is obtained, the protein can be produced using either cell-dependent or cell-free expression technologies. Cell-dependent expression In a cell-dependent system, the codon-optimized DNA sequence is first inserted into an expression vector that contains a promoter and other regulatory elements required for transcription. This vector is then introduced into host cells, such as mouse or mammalian cells. Inside the cell, the DNA sequence is transcribed into messenger RNA (mRNA) by RNA polymerase. The mRNA is subsequently translated by ribosomes, which read the nucleotide codons and use transfer RNAs (tRNAs) to assemble the corresponding amino acids into the protein. This approach allows protein production in a biologically relevant cellular environment and is commonly used in disease-related research. Cell-free expression Alternatively, the DNA sequence (or the corresponding mRNA) can be used in a cell-free expression system. These systems contain purified ribosomes, enzymes, and translation factors, allowing transcription and translation to occur in vitro without living cells. Cell-free expression enables rapid protein production and precise experimental control, although it may not fully replicate cellular processes such as protein trafficking or degradation.

3.5 (Optional) How does it work in nature / biological systems?

How a single gene can code for multiple proteins In biological systems, a single gene can give rise to multiple proteins through transcriptional and post-transcriptional mechanisms. One major mechanism is alternative splicing, where different combinations of exons are joined from the same pre-mRNA to produce multiple mRNA transcripts. Additional diversity can arise from alternative transcription start sites or alternative polyadenylation. In the case of the amyloid precursor protein (APP), different processing pathways generate distinct fragments, including amyloid-beta peptides, demonstrating how one gene can produce multiple biologically relevant protein products.

DNA → RNA → Protein alignment To visualize the flow of genetic information, the DNA sequence was translated using the ExPASy Translate tool, which displays all possible reading frames. The biologically relevant open reading frame was identified in the 5’→3’ Frame 1, which begins with a start codon (ATG) and produces a continuous amino acid sequence.