Week 2 HW: DNA Read, Write, & Edit

Part 1: Benchling & In-silico Gel Art

For this exercise, I imported the lambda DNA reference sequence on Benchling, and performed in-silico restriction digests using the required enzymes: EcoRI, HindIII, BamHI, KpnI, EcoRV, SacI, and SalI.

Using Benchling’s digestion simulation tool, I iteratively tested different enzyme combinations to generate fragment patterns with distinct size distributions.

I attach below the image of the resulting gel electrophoresis:

Part 3: DNA Design Challenge

3.1 Choose your protein

For this assignment, I selected the human LINE-1 retrotransposable element ORF2 protein (ORF2p) (UniProt: O00370). ORF2p is the catalytic core of LINE-1 retrotransposition and contains both an endonuclease domain and a reverse transcriptase domain. These enzymatic activities enable target-primed reverse transcription and genomic integration, making ORF2p central to LINE-1 mobility in the human genome.

I chose this protein because my research focuses on engineering LINE-1–based gene writing systems. ORF2p is the key effector molecule that determines integration efficiency, specificity, and safety. Understanding its sequence, structure, and functional constraints is therefore essential for developing programmable genome engineering tools in mammalian systems.

Protein sequence (UniProt O00370.1):

sp|O00370.1|LORF2_HUMAN RecName: Full=LINE-1 retrotransposable element ORF2 protein; Short=ORF2p; Includes: RecName: Full=Reverse transcriptase; Includes: RecName: Full=Endonuclease MTGSNSHITILTLNVNGLNSPIKRHRLASWIKSQDPSVCCIQETHLTCRDTHRLKIKGWRKIYQANGKQKKAGVAILVSDKTDFKPTKIKRDKEGHYIMVKGSIQQEELTILNIYAPNTGAPRFIKQVLSDLQRDLDSHTLIMGDFNTPLSILDRSTRQKVNKDTQELNSALHQTDLIDIYRTLHPKSTEYTFFSAPHHTYSKIDHIVGSKALLSKCKRTEIITNYLSDHSAIKLELRIKNLTQSRSTTWKLNNLLLNDYWVHNEMKAEIKMFFETNENKDTTYQNLWDAFKAVCRGKFIALNAYKRKQERSKIDTLTSQLKELEKQEQTHSKASRRQEITKIRAELKEIETQKTLQKINESRSWFFERINKIDRPLARLIKKKREKNQIDTIKNDKGDITTDPTEIQTTIREYYKHLYANKLENLEEMDTFLDTYTLPRLNQEEVESLNRPITGSEIVAIINSLPTKKSPGPDGFTAEFYQRYKEELVPFLLKLFQSIEKEGILPNSFYEASIILIPKPGRDTTKKENFRPISLMNIDAKILNKILANRIQQHIKKLIHHDQVGFIPGMQGWFNIRKSINVIQHINRAKDKNHVIISIDAEKAFDKIQQPFMLKTLNKLGIDGMYLKIIRAIYDKPTANIILNGQKLEAFPLKTGTRQGCPLSPLLFNIVLEVLARAIRQEKEIKGIQLGKEEVKLSLFADDMYLENPIVSAQNLLKLISNFSKVSGYKINVQKSQAFLYNNNRQTESQIMGELPFTIASKRIKYLGIQLTRDVKDLFKENYKPLLKEIKEDTNKWKNIPCSWVGRINIVKMAILPKVIYRFNAIPIKLPMTFFTELEKTTLKFIWNQKRARIAKSILSQKNKAGGITLPDFKLYYKATVTKTAWYWYQNRDIDQWNRTEPSEIMPHIYNYLIFDKPEKNKQWGKDSLLNKWCWENWLAICRKLKLDPFLTPYTKINSRWIKDLNVKPKTIKTLEENLGITIQDIGVGKDFMSKTPKAMATKDKIDKWDLIKLKSFCTAKETTIRVNRQPTTWEKIFATYSSDKGLISRIYNELKQIYKKKTNNPIKKWAKDMNRHFSKEDIYAAKKHMKKCSSSLAIREMQIKTTMRYHLTPVRMAIIKKSGNNRCWRGCGEIGTLVHCWWDCKLVQPLWKSVWRFLRDLELEIPFDPAIPLLGIYPKDYKSCCYKDTCTRMFIAALFTIAKTWNQPNCPTMIDWIKKMWHIYTMEYYAAIKNDEFISFVGTWMKLETIILSKLSQEQKTKHRIFSLIGGN

The full-length ORF2p protein is approximately 1275 amino acids in length.

3.2 Reverse Translation

To determine a nucleotide sequence corresponding to the ORF2p amino acid sequence, I used Benchling’s reverse translation tool guided by human codon usage preferences. Because the genetic code is degenerate, multiple codons can encode the same amino acid, meaning that many distinct DNA sequences can theoretically encode ORF2p.

Reverse translation produces one valid open reading frame that corresponds to the protein sequence. Given the length of ORF2p (~1275 amino acids), the coding DNA sequence is approximately 3.8 kilobases in length. At this stage, the sequence is functionally correct but not yet optimized for efficient expression in a specific host system.

3.3 Codon Optimization

Although multiple DNA sequences can encode the same protein, not all synonymous codons are used equally across organisms. Codon optimization is therefore necessary to improve expression efficiency in a chosen host.

I optimized the ORF2 coding sequence for expression in human cells, specifically HEK293T cells, which are commonly used in LINE-1 retrotransposition assays. Codon optimization enhances translation efficiency, improves mRNA stability, and reduces the likelihood of ribosomal pausing caused by rare codons. It also allows removal of undesirable sequence features such as cryptic splice sites, extreme GC content, internal polyadenylation signals, or problematic restriction enzyme recognition sites.

In addition, optimization can eliminate Type IIS restriction enzyme sites (such as BsaI or BsmBI), facilitating modular cloning strategies and downstream synthetic assembly.

3.4 From DNA to Protein

Once synthesized, the codon-optimized ORF2 sequence can be expressed using either cell-dependent or cell-free systems.

In a cell-dependent system, the ORF2 coding sequence would be cloned into a mammalian expression plasmid under a strong promoter, such as the CMV promoter. Following transfection into HEK293T cells, the host RNA polymerase II transcribes the DNA into mRNA. The mRNA is then exported to the cytoplasm, where ribosomes translate it into ORF2 protein. The protein folds and, when expressed in the context of a full LINE-1 element, participates in retrotransposition via target-primed reverse transcription.

Alternatively, the optimized ORF2 sequence could be expressed in a cell-free transcription–translation system. In this case, the DNA template is transcribed in vitro into mRNA, which is subsequently translated into protein in a controlled biochemical environment. This approach enables direct characterization of ORF2 enzymatic activity, such as reverse transcriptase function, without the complexity of genomic integration.

Together, these approaches illustrate how a protein sequence can be reverse translated, optimized, synthesized, and ultimately expressed for functional investigation.

Part 4: Prepare a Twist DNA Synthesis Order

For this exercise, I designed a bacterial expression construct to express the human LINE-1 ORF2 protein in E. coli.

Insert Design (Benchling)

In Benchling, I created a linear DNA insert containing:

Constitutive promoter (BBa_J23106)
RBS (BBa_B0034)
Start codon (ATG)
Codon-optimized ORF2 coding sequence
C-terminal 7×His tag
Stop codon (TAA)
Double terminator (BBa_B0015)

Each region was annotated, and the final sequence was exported as a FASTA file.

Twist Order

On Twist, I selected:

Genes → Clonal Genes

I uploaded the nucleotide sequence and chose the pTwist Amp High Copy backbone, which provides:

Ampicillin resistance
High-copy origin of replication

This results in a circular plasmid ready for transformation into E. coli for ORF2 expression and purification.

Final Construct

The final plasmid contains the ORF2 expression cassette cloned into the pTwist Amp High Copy vector.

Plasmid map attached below:

5.1 DNA Read

(i) What DNA would you want to sequence and why?

I would aim to sequence LINE-1 (L1) retrotransposon insertions in engineered human cell lines expressing synthetic L1-based gene writing systems. The objective would be to map insertion sites genome-wide, quantify insertion frequency, and detect potential off-target integrations or structural rearrangements. Because L1 retrotransposition can generate full-length insertions, 5′ truncations, and genomic rearrangements, detailed genomic characterization is essential to assess both efficiency and safety. Understanding the integration landscape at high resolution would provide insights into insertion bias, genomic context preferences, and risks associated with programmable genome writing systems.

(ii) What technology would you use and why?

To achieve this, I would use a hybrid sequencing strategy combining third-generation long-read sequencing (Oxford Nanopore Technologies) with second-generation short-read sequencing (Illumina). Nanopore sequencing is particularly suitable for detecting large insertions and structural variants because it generates long reads capable of spanning entire retrotransposon insertions together with their flanking genomic regions. Illumina sequencing would complement this approach by providing high base-calling accuracy and reliable validation of insertion breakpoints and small variants.

For Nanopore sequencing, the input would consist of high molecular weight genomic DNA extracted from HEK293T cells transfected with L1 constructs. Preparation includes DNA extraction preserving long fragments, optional size selection, end repair and dA-tailing, adapter ligation, and loading onto a flow cell. As DNA passes through a nanopore, changes in ionic current are measured and decoded by neural network-based base-calling algorithms.

For Illumina sequencing, genomic DNA would be fragmented, end-repaired, ligated to adapters, PCR amplified, and loaded onto a flow cell for cluster generation. Sequencing-by-synthesis uses fluorescently labeled reversible terminator nucleotides, and fluorescence signals are recorded at each cycle to determine the incorporated base.

The output includes FASTQ files containing base calls and quality scores, alignment files (SAM/BAM), and downstream analyses identifying insertion sites and structural variants.

5.2 DNA Write

(i) What DNA would you want to synthesize and why?

I would synthesize a full-length LINE-1 retrotransposition reporter construct designed to compare the activity of species-specific ORF2 sequences (for example, primate versus whale variants) in HEK293T cells. The construct would contain the canonical L1 architecture, including the 5′ UTR with its promoter, ORF1, and a species-specific ORF2 encoding the endonuclease and reverse transcriptase domains, followed by the 3′ UTR.

Within the 3′ UTR, in antisense orientation, I would incorporate an eGFP-based retrotransposition cassette disrupted by a γ-globin intron under the control of a CMV promoter. In this system, GFP expression occurs only if the L1 transcript is correctly spliced, reverse transcribed, and integrated into chromosomal DNA, allowing quantitative measurement of retrotransposition events. The plasmid backbone would include bacterial propagation elements and a puromycin resistance marker for mammalian cell selection.

The goal of synthesizing this construct is to systematically compare retrotransposition efficiency across different ORF2 variants while keeping the reporter system constant, isolating the contribution of ORF2 sequence variation.

(ii) What technology would you use and why?

Given the length and repetitive nature of the L1 construct, I would use solid-phase phosphoramidite oligonucleotide synthesis combined with enzymatic DNA assembly methods such as Gibson Assembly or Golden Gate cloning. Because multi-kilobase constructs cannot be synthesized directly due to cumulative error rates, the sequence would be divided into smaller fragments for chemical synthesis. These fragments would be assembled enzymatically into a plasmid backbone, transformed into bacteria for amplification, sequence-verified by full plasmid sequencing, purified, and transfected into HEK293T cells.

Limitations of this approach include increased error rates with fragment length, instability of repetitive elements in bacterial hosts, assembly challenges in highly repetitive regions, and increasing cost with construct size and complexity.

5.3 DNA Edit

(i) What DNA would you want to edit and why?

I would edit both engineered LINE-1 ORF2 sequences and specific genomic loci in human cells to investigate determinants of retrotransposition efficiency and integration specificity. At the construct level, I would introduce targeted mutations into ORF2, including alterations within the endonuclease and reverse transcriptase domains or reconstructed ancestral variants, to identify residues and motifs that modulate activity.

At the genomic level, I would edit defined safe-harbor loci, such as AAVS1-like regions, to introduce landing pads or sequence contexts that might bias L1 integration. This would allow controlled testing of whether retrotransposition can be redirected or made more predictable, contributing to the development of programmable genome writing systems.

(ii) What technology would you use and why?

I would primarily use CRISPR-based genome editing technologies. For precise insertions or locus engineering, I would use CRISPR-Cas9 combined with homology-directed repair (HDR). For subtle sequence modifications within ORF2, base editing or prime editing would be preferable due to their ability to introduce targeted changes without generating double-strand breaks.

In the CRISPR-Cas9 approach, a guide RNA is designed to target a specific genomic locus. Cas9 is delivered together with the guide RNA, creating a double-strand break. If a donor DNA template is provided, the cell repairs the break via HDR, incorporating the desired sequence. Base editors or prime editors use modified Cas9 variants fused to enzymatic domains that directly convert or rewrite specific nucleotides.

Preparation includes guide RNA design with off-target analysis, donor template design when required, preparation of CRISPR components, transfection into HEK293T cells, and validation using PCR and sequencing.

Limitations of CRISPR-based editing include variable HDR efficiency, potential off-target effects, locus-dependent variability, and technical challenges in achieving high efficiency for large insertions. Despite these constraints, CRISPR technologies offer the most versatile and precise framework for engineering and interrogating retrotransposon-based genome systems.