Week 2 HW: DNA read, write, and edit

Homework Questions

Nature’s machinery for copying DNA is called polymerase. What is the error rate of polymerase? How does this compare to the length of the human genome. How does biology deal with that discrepancy?

Polymerase has a very low error rate of approximately 10^{-4} to 10^{-6} errors per nucleotide without proofreading. Thus, comparing to the human genome length of approximately 3 billion base pair, it represents a minimal part. Thanks to other molecular features, errors produced by the sythesis machinery can be corrected by mechanisms applied like read proof, including the polymerase’s intrinsic proofreading domain, post-replication mismatch repair (MMR) pathways involving proteins like MutS and MutL that scan and excise mismatches, and DNA damage response mechanisms such as base excision repair (BER) and nucleotide excision repair (NER).

How many different ways are there to code (DNA nucleotide code) for an average human protein? In practice what are some of the reasons that all of these different codes don’t work to code for the protein of interest?

An average human protein is approximately 345 amino acids long (based on an average coding sequence of ~1,036 bp, divided by 3 nucleotides per codon). Given the degeneracy of the genetic code, where most of the 20 standard amino acids are encoded by 2–6 synonymous codons (with exceptions like methionine and tryptophan, each with 1 codon), the theoretical number of distinct DNA sequences encoding such a protein is enormous—typically on the order of 10^{100} to 10^{200} or more, calculated as the product of the number of codon options for each amino acid position. In practice, however, not all sequences function effectively due to factors such as codon usage bias (where certain codons are preferred for efficient translation in specific organisms, affecting tRNA availability and ribosomal speed), mRNA secondary structure formation that can impede translation initiation or elongation, regulatory elements embedded in the coding sequence (e.g., splicing signals, miRNA binding sites, or transcription factor motifs), GC content imbalances leading to replication or expression issues, and potential toxicity from repetitive sequences or unintended open reading frames.

What’s the most commonly used method for oligo synthesis currently?

The most commonly used method for oligonucleotide (oligo) synthesis is solid-phase phosphoramidite chemistry, often performed on microarray chips for high-throughput, multiplex production. This enables the parallel synthesis of up to 1 million oligos per chip at significantly reduced cost

Why is it difficult to make oligos longer than 200nt via direct synthesis?

Synthesizing oligos longer than 200 nucleotides via direct chemical methods is challenging due to cumulative errors from imperfect coupling efficiency (typically 98–99% per nucleotide addition), leading to exponential yield decay and increased heterogeneity in the product pool. Side reactions, such as depurination or incomplete deprotection, further accumulate, resulting in truncated or mutated sequences that require extensive purification.

Why can’t you make a 2000bp gene via direct oligo synthesis?

Direct oligo synthesis cannot produce a 2,000 bp gene because current chemical methods are limited to ~200–300 nt maximum lengths with acceptable yield and fidelity; beyond this, error rates and incomplete extensions make the process infeasible. Instead, longer genes are assembled from shorter oligos using techniques like PCR amplification, ligation, or enzymatic assembly (like Gibson assembly)

What are the 10 essential amino acids in all animals and how does this affect your view of the “Lysine Contingency”?

The 10 essential amino acids required by all animals, which cannot be synthesized endogenously and must be obtained through diet, are arginine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, threonine, tryptophan, and valine. Regarding “Lysine Contingency,” while lysine auxotrophy is a clever single-point failure mechanism, relying on just one could be vulnerable to evolutionary escape or environmental supplementation. A more robust “contingency” might involve engineering dependencies on multiple essentials (e.g., via codon recoding for serine and leucine as in the slides) or all 10, enhancing safety in applications like industrial microbes or gene drives, as escape risks are minimized when multiple biosynthetic pathways are disrupted.

What code would you suggest for AA:AA interactions?

I suggest a code analogous to basepairing but based on complementary physicochemical properties of amino acid side chains. This could be a modular interaction code categorizing the 20 amino acids into groups like charged, hydrophobic, polar, and special. Pairings would follow rules like positive-negative for ionic bonds or hydrophobic-hydrophobic for van der Waals, enabling predictable design of interfaces (e.g., coiled-coils or beta-sheets).

Homework Week 2

Part 1: Benchling & In-silico Gel Art

resembles a cactus figure

Part 2: Gel Art - Restriction Digests and Gel Electrophoresis

No access to lab in person

Part 3: DNA Design Challenge

3.1. Choose your protein

Nitrosomonas marina is a marine ammonia-oxidizing bacterium, and its AmoA (ammonia monooxygenase subunit A) is the catalytic core of the ammonia monooxygenase complex that initiates nitrification by oxidizing ammonia (NH₃) to hydroxylamine. Studying the AmoA amino acid sequence from N. marina is relevant because it is a well-annotated, reference AmoA that is widely used for comparative and phylogenetic analyses of ammonia-oxidizing bacteria. Its sequence provides a reliable benchmark for identifying conserved functional motifs, distinguishing AmoA from closely related PmoA proteins, and comparing canonical aerobic ammonia oxidation with putative alternative or iron-coupled ammonium oxidation pathways in other bacterial systems.

protein sequence

CAB96447.1 amoA, partial [Nitrosomonas marina] YPINFVLPSTMIPGALMLDTIMLLTGNWLITALLGGGFWGLFFYPGNWPIFGPTHLPVVVEGVLLSVADY TGFLYVRTGTPEYVRLIEQGSLRTFGGHTTVIAAFFSAFVSMLMFCVWWYFGKIYCTAFYYVRGERGRIS QKHDVTAFG

3.2. Reverse Translate: Protein sequence to DNA sequence

I used the following tool for reverse translation: https://www.bioinformatics.org/sms2/rev_trans.html

reverse translation of sample sequence to a 447 base sequence of most likely codons. tatccgattaactttgtgctgccgagcaccatgattccgggcgcgctgatgctggatacc attatgctgctgaccggcaactggctgattaccgcgctgctgggcggcggcttttggggc ctgtttttttatccgggcaactggccgatttttggcccgacccatctgccggtggtggtg gaaggcgtgctgctgagcgtggcggattataccggctttctgtatgtgcgcaccggcacc ccggaatatgtgcgcctgattgaacagggcagcctgcgcacctttggcggccataccacc gtgattgcggcgttttttagcgcgtttgtgagcatgctgatgttttgcgtgtggtggtat tttggcaaaatttattgcaccgcgttttattatgtgcgcggcgaacgcggccgcattagc cagaaacatgatgtgaccgcgtttggc

3.3 Codon optimization

Codon optimization is necessary because the genetic code is degenerate, meaning most amino acids are encoded by multiple codons and organisms exhibit significant variation in their synonymous codon usage preferences due to evolutionary pressures related to translational efficiency. Expressing a heterologous gene like Nitrosomonas marina amoA in a foreign host without optimization can lead to poor expression due to several molecular constraints:

the presence of codons that are rare in the host’s genome, which correlate with low abundance of their corresponding cognate tRNAs, leading to ribosomal stalling, translation attenuation, and premature termination
the formation of unfavorable mRNA secondary structures that impede ribosome binding and translocation; and
Extreme GC content that affects transcriptional efficiency and mRNA stability.

For AmoA protein, I incline for optimizing the sequence for expression in Escherichia coli. This choice is justified by the organism’s well-characterized genetics, rapid growth kinetics, high cell density cultivation, and the extensive availability of compatible expression vectors and purification tags. Furthermore, E. coli possesses a highly annotated codon usage database so this allows for precise adjustment of codon frequencies to match its endogenous tRNA pool.

Optimized Sequence (Optimized Sequence Length: 447bp, GC%:52.13%). TATCCCATAAATTTTGTACTACCATCAACAATGATCCCGGGTGCACTGATGTTGGACACCATTATGCTCTTGACCGGCAACTGGCTGATCACTGCGTTGTTAGGTGGTGGCTTC TGGGGTCTGTTTTTCTACCCGGGTAATTGGCCAATTTTTGGTCCGACCCATCTGCCGGTGGTGGTTGAAGGCGTTCTGCTGTCCGTGGCCGATTATACCGGTTTTCTGTATGT GCGCACCGGTACGCCGGAGTATGTTCGTCTGATCGAGCAAGGTAGCCTGCGTACGTTTGGCGGCCACACCACCGTTATTGCGGCGTTCTTTAGCGCATTCGTGAGCATGTTG ATGTTTTGTGTCTGGTGGTATTTCGGCAAGATCTACTGCACGGCTTTCTACTACGTACGCGGTGAAAGAGGCCGTATTTCTCAGAAACACGACGTCACCGCGTTCGGC

3.4 Technologies to produce the protein

The production of a recombinant protein such as AmoA from its corresponding DNA sequence can be achieved through two primary technological platforms: cell-dependent (in vivo) expression systems and cell-free (in vitro) expression systems.

In a typical bacterial expression system, the codon-optimized gene is first cloned into a plasmid vector under the control of a strong inducible promoter. This recombinant plasmid is then introduced into a host bacterium, most commonly Escherichia coli, through a process called transformation. The bacteria are cultivated in nutrient medium within shake flasks or bioreactors until an optimal cell density is reached, at which point an inducer molecule, like IPTG, is added to initiate transcription of the gene by host RNA polymerase. The resulting mRNA transcripts are subsequently bound by bacterial ribosomes, which, with the assistance of transfer RNAs charged with amino acids, catalyze the polymerization of the polypeptide chain according to the codon sequence. Following an appropriate expression period, the bacteria are harvested by centrifugation and lysed to release the intracellular contents, after which the target protein is purified using techniques such acell-free protein synthesis offers a rapid and open in vitro platform for protein production. In this approach, crude cellular extracts are prepared from cultured bacteria (or other organisms) by mechanical lysis and centrifugation to remove cell debris and genomic DNA, retaining the essential macromolecular machinery including ribosomes, tRNA molecules, aminoacyl-tRNA synthetases, and translation factors. A reaction mixture is assembled by combining this extract with the DNA template encoding AmoA, an energy regeneration system (typically containing ATP, GTP, and phosphoenolpyruvate or creatine phosphate), amino acids, and necessary cofactors. Transcription and translation proceed simultaneously in the tube, with the endogenous RNA polymerase transcribing the gene and the ribosomes immediately translating the nascent mRNA. This system offers distinct advantages for challenging proteins like membrane enzymes, as it allows for the direct supplementation of detergents, lipids, or chaperones to facilitate proper folding, and eliminates concerns regarding cellular toxicity or inclusion body formation. Additionally, cell-free reactions can be performed in small volumes for high-throughput screening or scaled up in bioreactor configurations to increase protein yields affinity chromatography, leveraging genetically encoded affinity tags fused to the protein of interest.

On the other hand, cell-free protein synthesis is a rapid and open in vitro platform for protein production. In this approach, crude cellular extracts are prepared from cultured bacteria by mechanical lysis and centrifugation to remove cell debris and genomic DNA, retaining the essential macromolecular machinery including ribosomes, tRNA molecules, aminoacyl-tRNA synthetases, and translation factors. A reaction mixture is assembled by combining this extract with the DNA template encoding AmoA, an energy regeneration system (typically containing ATP and others), amino acids, and necessary cofactors. Transcription and translation proceed simultaneously in the tube, with the endogenous RNA polymerase transcribing the gene and the ribosomes immediately translating the nascent mRNA. This system offers distinct advantages for challenging proteins like membrane enzymes, as it allows for the direct supplementation of detergents, lipids, or chaperones to facilitate proper folding, and eliminates concerns regarding cellular toxicity or inclusion body formation. Additionally, cell-free reactions can be performed in small volumes for high-throughput screening or scaled up in bioreactor configurations to increase protein yield

Part 4: Prepare a Twist DNA Synthesis Order

create a Twist and Benchling account: check! :)
Build Your DNA Insert Sequence

#amoAoptimized-insert tatccgattaactttgtgctgccgagcaccatgattccgggcgcgctgatgctggataccattatgctgctgaccggcaactggctgattaccgcgctgctgggcggcggcttttggggcctgtttttttatccgggcaactggccgatttttggcccgacccatctgccggtggtggtggaaggcgtgctgctgagcgtggcggattataccggctttctgtatgtgcgcaccggcaccccggaatatgtgcgcctgattgaacagggcagcctgcgcacctttggcggccataccaccgtgattgcggcgttttttagcgcgtttgtgagcatgctgatgttttgcgtgtggtggtattttggcaaaatttattgcaccgcgttttattatgtgcgcggcgaacgcggccgcattagccagaaacatgatgtgaccgcgtttggc

Part 5 - DNA Read/Write/Edit

5.1 DNA Read

(i) What DNA would you want to sequence (e.g., read) and why?

To investigate the intrinsic control of neuron formation, I would sequence the FOXG1 gene, including its coding regions, regulatory promoter elements, and conserved non-coding sequences. FOXG1 is a critical transcription factor that acts as a master regulator of telencephalic development . It balances the proliferation of neural progenitors with their differentiation into neurons . Importantly, mutations, copy number variations, or deletions in FOXG1 lead to FOXG1 syndrome, a severe neurodevelopmental disorder characterized by impaired brain development and structural abnormalities . Sequencing this gene can reveal variants that disrupt its regulatory function, thereby affecting neuronal formation and transition.

(ii) In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why?

I would use Illumina sequencing due to its high accuracy and throughput, which is ideal for detecting single nucleotide variants and small insertions/deltions, indels, within the FOXG1 gene across many samples.

Generation: This is a second-generation sequencing (next-generation sequencing) technology. It relies on massively parallel sequencing of clonally amplified DNA fragments . Input: The input is high-quality genomic DNA (gDNA) isolated from the sample of interest (e.g., neural cells or tissue) . Input Preparation: Essential steps include: 1) Fragmentation of gDNA into smaller pieces (200-500 bp), 2) Adapter ligation, where platform-specific oligonucleotides (P5 and P7) are attached to fragment ends, and 3) Library amplification by PCR to enrich for fragments with adapters . Essential Steps of Technology/Base Calling: 1) Clonal amplification: Fragments are amplified on a flow cell via bridge PCR to form distinct clonal clusters . 2) Sequencing by Synthesis (SBS): Fluorescently labeled, terminator-bound nucleotides are incorporated one at a time. After each cycle, the flow cell is imaged to detect the emitted fluorescence from each cluster, identifying the base . The sequencing instrument’s software performs base calling by converting these light signals into nucleotide sequences with quality scores . Output: The primary output is FASTQ files containing millions of high-quality short reads (e.g., 150 bp paired-end), along with quality scores for each base call

I would also use Nanopore sequencing to complement Illumina data, specifically to resolve complex structural variants, copy number variations, and haplotypes in the FOXG1 region that are difficult to detect with short reads.

Generation: This is a third-generation sequencing technology. It sequences single DNA molecules in real time without requiring amplification . Input: The input is high-molecular-weight (HMW) genomic DNA, which is essential for constructing long reads . Input Preparation: Steps involve: 1) Optional fragmentation if a specific size is desired, though shearing is often minimized to preserve long fragments. 2) End-prep to repair DNA ends and add a poly-A tail. 3) Adapter ligation, where sequencing adapters (containing a motor protein) are attached to the template . Essential Steps of Technology/Base Calling: 1) The prepared library is loaded onto a flow cell containing thousands of protein nanopores embedded in a membrane . 2) A motor protein guides a single DNA strand through the nanopore. 3) As the DNA passes through, it disrupts an ionic current in a characteristic way for each nucleotide. 4) Base calling is performed in real-time by sophisticated algorithms (like recurrent neural networks) that interpret these raw electrical current changes to determine the DNA sequence . Output: The output is FASTQ files containing ultra-long reads (often >10 kb, capable of spanning entire repetitive regions)

5.2 DNA Write

Questions (i) What DNA would you want to synthesize (e.g., write) and why? (ii) What technology or technologies would you use to perform this DNA synthesis and why? Also answer the following questions: What are the essential steps of your chosen sequencing methods? What are the limitations of your sequencing method (if any) in terms of speed, accuracy, scalability?

Answer

To investigate the regulatory function of FOXG1 in neuronal development, I would synthesize a compact genetic circuit consisting of the FOXG1 gene under the control of a neural-specific inducible promoter, along with a fluorescent reporter like GFP, linked via a self-cleaving peptide sequence to enable simultaneous visualization of FOXG1-expressing cells. This synthetic construct would be designed to test the hypothesis that precise FOXG1 dosage regulates the balance between neural progenitor proliferation and neuronal differentiation, as FOXG1 haploinsufficiency causes FOXG1 syndrome—a severe neurodevelopmental disorder characterized by impaired telencephalic development . By delivering this circuit into induced pluripotent stem cell-derived neural progenitors and modulating FOXG1 expression levels, one could observe downstream effects on neuronal fate specification and maturation, providing a platform to screen for therapeutic modulators of FOXG1 expression.

For DNA synthesis, I would employ enzymatic DNA synthesis technology rather than traditional chemical synthesis. Enzymatic synthesis, utilizing engineered terminal deoxynucleotidyl transferase (TdT) enzymes, offers superior performance for constructing this genetic circuit due to several advantages . The essential steps of this method include: (1) oligonucleotide design, where the target sequence is computationally designed and codon-optimized for expression in human cells; (2) enzymatic synthesis, where TdT enzymes incorporate modified nucleotides base-by-base onto a growing DNA strand under aqueous conditions, achieving coupling efficiencies above 99.7% ; (3) assembly, where overlapping oligonucleotides are assembled into the full-length gene construct using PCR-based or enzymatic assembly methods ; and (4) sequence verification through Sanger or next-generation sequencing to confirm 100% accuracy before cloning into an expression vector . This approach eliminates the need for a natural DNA template, enabling complete design freedom for incorporating the neural-specific promoter and reporter elements.

The primary limitation of enzymatic synthesis lies in its current commercial availability and throughput compared to established chemical synthesis platforms . While enzymatic methods can produce longer contiguous sequences (exceeding 700 bases) with lower error rates and without toxic waste generation, the technology is still emerging and may not yet match the overall yield of chemical synthesis for very large-scale projects . Additionally, secondary structures in complex sequences, such as the GC-rich regulatory elements in neural promoters, can interfere with enzymatic activity, requiring careful sequence design and validation.

5.3 DNA Edit

Questions

(i) What DNA would you want to edit and why?

(ii) What technology or technologies would you use to perform these DNA edits and why? Also answer the following questions:

How does your technology of choice edit DNA? What are the essential steps?

What preparation do you need to do (e.g. design steps) and what is the input (e.g. DNA template, enzymes, plasmids, primers, guides, cells) for the editing?

What are the limitations of your editing methods (if any) in terms of efficiency or precision?

Answer

I would like to investigate the therapeutic potential of modulating FOXG1 expression in neurodevelopmental disorders, therefore I would edit the endogenous FOXG1 locus in human induced pluripotent stem cells (iPSCs) to correct pathogenic mutations associated with FOXG1 syndrome . FOXG1 syndrome results from heterozygous mutations causing haploinsufficiency, leading to severe brain structural abnormalities including corpus callosum agenesis, hippocampal malformation, and myelination deficits . Correcting the mutated allele to wild-type sequence in patient-derived iPSCs would restore normal FOXG1 dosage and allow subsequent differentiation into neural progenitors and neurons, providing a platform to study whether genetic correction rescues the cellular and molecular phenotypes observed in FOXG1-deficient neural development . This approach has direct relevance to regenerative medicine, as it could establish proof-of-concept for autologous cell replacement therapies and disease modeling.

I would employ the CRISPR/Cas9 system delivered via adeno-associated virus vectors to perform these edits . The technology edits DNA by introducing a targeted double-strand break at the FOXG1 locus, which is then repaired through homology-directed repair using an exogenous donor DNA template containing the corrected sequence .

The essential steps include:

design and synthesis of variant-specific single-guide RNAs (sgRNAs) complementary to the target mutation site and a donor DNA template encoding the wild-type FOXG1 sequence ; (
cloning of sgRNA and donor template into AAV vectors (with AAV9 preferred for neuronal applications)
delivery of AAV-CRISPR components into patient-derived iPSCs or fibroblasts
validation of successful editing through next-generation sequencing of the target locus. The input materials include sgRNAs, Cas9 endonuclease (delivered as protein or encoded in the vector), donor DNA template, and the target cells for editing .

Regarding to limitations, this editing approach include efficiency and precision challenges. Homology-directed repair in iPSCs and neurons is relatively inefficient, with reported correction rates of 20-35% in successfully transfected cells . Additionally, off-target activity remains a concern despite careful sgRNA design, as Cas9 can cleave at genomic sites with sequence similarity to the target . Delivery efficiency varies by cell type and AAV serotype, with AAV9 showing optimal transduction in fibroblasts and iPSC-derived neurons but lower efficiency in undifferentiated iPSCs . Furthermore, the risk of unintended insertions or deletions at the target site from non-homologous end joining competing with homology-directed repair can introduce additional mutations . These limitations necessitate rigorous validation through deep sequencing and clonal selection to ensure precise correction without off-target effects before therapeutic application.