Week 2 HW: DNA Reading, writing and editing

Part 1

DNA reading, editing and design

Gel Art: Restriction Digests and Gel Electrophoresis

Opening an account in Benchling and importing Lambda

To open my account in Benchling I used my institutiona e-mail.

To Import the Lambda DNA, I visited the website provided and looked at the FASTA file. Then I selected DNA/RNA Sequence in Benchling and copy-pasted the FASTA information in there.

Simulating Restriction Enzyme Digestion with the following Enzymes

For this excercise, I did the digestion one by one and then all along. This to be able to understand the hole process and how different digestions create different patterns in the digital gel. The results are the following:

Part 2

Digestion art

I really like butterflies, and I thought that something symmetrical might be the easiest one to generate a propper art piece with gels. So I created this butterfly using the Lambda sequence:

Part 3

DNA design challenge

In recitation, we discussed that you will pick a protein for your homework that you find interesting. Which protein have you chosen and why? Using one of the tools described in recitation (NCBI, UniProt, google), obtain the protein sequence for the protein you chose.

Protein selection

For this, I selected the TAP1 protein, as it is important in my context of work. I'm currently working with the MHCI and II associated peptides for AML (Acute Myeloid Leukemia) and I'm trying to understand how two different drugs affect the transport of MHCI and II peptides for their presentation in the cell surface.

The sequence is:

>tr|A0A8V8TM76|A0A8V8TM76_HUMAN ABC-type antigen peptide transporter OS=Homo sapiens OX=9606 GN=TAP1 PE=1 SV=1

MASSRCPAPRGCRCLPGASLAWLGTVLLLLADWVLLRTALPRIFSLLVPTALPLLRVWAV

GLSRWAVLWLGACGVLRATVGSKSENAGAQGWLAALKPLAAALGLALPGLALFRELISWG

APGSADSTRLLHWGSHPTAFVVSYAAALPAAALWHKLGSLWVPGGQGGSGNPVRRLLGCL

GSETRRLSLFLVLVVLSSLGEMAIPFFTGRLTDWILQDGSADTFTRNLTLMSILTIASAV

LEFVGDGIYNNTMGHVHSHLQGEVFGAVLRQETEFFQQNQTGNIMSRVTEDTSTLSDSLS

ENLSLFLWYLVRGLCLLGIMLWGSVSLTMVTLITLPLLFLLPKKVGKWYQLLEVQVRESL

AKSSQVAIEALSAMPTVRSFANEEGEAQKFREKLQEIKTLNQKEAVAYAVNSWTTSISGM

LLKVGILYIGGQLVTSGAVSSGNLVTFVLYQMQFTQAVEVLLSIYPRVQKAVGSSEKIFE

YLDRTPRCPPSGLLTPLHLEGLVQFQDVSFAYPNRPDVLVLQGLTFTLRPGEVTALVGPN

GSGKSTVAALLQNLYQPTGGQLLLDGKPLPQYEHRYLHRQVAAVGQEPQVFGRSLQENIA

YGLTQKPTMEEITAAAVKSGAHSFISGLPQGYDTEVDEAGSQLSGGQRQAVALARALIRK

PCVLILDDATSALDANSQLQSLMKQRVCGEVLRMGNVGVLGVVSRASSDPVRWSSSCTKA

LSGTPAQCFSSPSTSAWWSRLTTSSFWKEALSGRGEPTSSSWRKRGATGPWCRLLQMLQN

ESLLRPAHSISLPFLLSVVENHSCRVGSCLQDELLEICLECVTSFPSSS

Reverse Translate: Protein (amino acid) sequence to DNA (nucleotide) sequence.

For this, the step by step can be found in the image below, but the explanation is: I used the resources in NCBI for proteins and selected BLAST. In BLAST, I selected tblastn, which translates protein to the translated nucleotide. I copied the FASTA sequence in above for the TAP1, and selected the Human RefSeqGene sequences(RefSeq_Gene). I did the search and selected the first sequence as it had the best match. The origin of this protein is:

1 ctcaggtgga gcagctcctg tacgaaagcc ctgagcggta ctcccgctca gtgcttctca

61 tcacccagca cctcagcctg gtggagcagg ctgaccacat cctctttctg gaaggaggcg

121 ctatccggga ggggggaacc caccagcagc tcatggagaa aaaggggtgc tactgggcca

181 tggtgcaggc tcctgcagat gctccagaat gaaagccttc tcagacctgc gcactccatc

241 tccctccctt ttcttctctc tgtggtggag aaccacagct gcagagtagg cagctgcctc

301 caggatgagt tacttgaaat ttgccttgag tgtgttacct cctttccaag ctcctcg

Codon optimization

Once a nucleotide sequence of your protein is determined, you need to codon optimize your sequence. You may, once again, utilize google for a “codon optimization tool”. In your own words, describe why you need to optimize codon usage. Which organism have you chosen to optimize the codon sequence for and why?

For this optimization, I used human (Homo sapiens) as a organism, as the cells I'm currently working with are cell lines derived from human patients and will be more usefull for me in the context of my current work. The tool I used was benchling and the step by step is in the figure below.

You have a sequence! Now what?

What technologies could be used to produce this protein from your DNA? Describe in your words the DNA sequence can be transcribed and translated into your protein. You may describe either cell-dependent or cell-free methods, or both.

To have a good production of this selected protein (TAP1), I would choose a cell-dependent approach. For this, I would use recombinant expression in mammalian cells. This can be done using HEK293 cells, CHO or HeLa. I would use cDNA cloning optimized in a expression vector like a CMV promoter, we can also do transfecction by lipofection or electroporation and we can also use a stable selection or a transcient expression.

There are more approaches like the Baculovirus expression system, or the bacerial expression using E. coli, but it has some cons as some practices to extract it are incompatible with further experimentation with AML cell lines.

How the DNA sequence becomes TAP1? First we have a transcription step, where the optimized TAP1 DNA sequence is inserted into an expression vector under the control of a strong promoter (CMV for example). Inside the cell, the RNA polymerase binds to the promoter, then the DNA template strand is read from 2'-5' and a complementary mRNA strand is synthesized from 5'-3'. In eukaryotic systems, the pre-mRNA undergos for 5' capping, polyadenylation and splicing. Then, the mature mRNA is exported to the cytoplasm. The Second Step is translation. In this, the ribosome binds to the 5' cap and scans for the start codon (AUG), and tRNAs match codons with anticodons, and peptide bonds are formed in the ribosome, using the peptidyl transferase center. Finally, the growing polypeptide emerges co-translationally. After this step, the protein folds, associates with TAP2 and becomes part of the antigen-processing machinery.

Part 4

Prepare a Twist DNA Synthesis Order

By following the instructions, I created my expression cassette. I then followed the instructions to create the Twist file for the protein of interest and generated the final product. You can find the final construct here. Also, you can see the results in the figures below.

Part 5

DNA Read/Write/Edit

DNA Read

(i) What DNA would you want to sequence (e.g., read) and why? This could be DNA related to human health (e.g. genes related to disease research), environmental monitoring (e.g., sewage waste water, biodiversity analysis), and beyond (e.g. DNA data storage, biobank).

My main DNA of interest are genes coming from Tumor-Asociated-Antigens, in order to be able to quickly diagnose cancer in some patients. At the same time, I would also love to do rapid bacterial DNA sequencing (coliforms) for remote places like the countryside of my country (Guatemala) for rapid diagnosis for infections. Diarrhea is still part of the top1 reasos of infant death in my country and is normally coming from undiagnosed infections.

(ii) In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why? Also answer the following questions:

Is your method first-, second- or third-generation or other? How so?
What is your input? How do you prepare your input (e.g. fragmentation, adapter ligation, PCR)? List the essential steps.
What are the essential steps of your chosen sequencing technology, how does it decode the bases of your DNA sample (base calling)?
What is the output of your chosen sequencing technology?

Because I am interested in two different applications — (1) tumor-associated antigen (TAA) genes for cancer diagnostics and (2) rapid detection of coliform bacteria in remote areas of Guatemala — I would choose two complementary sequencing technologies, each optimized for a different context: high-accuracy clinical genomics and portable field diagnostics.

For cancer-related sequencing, I would use Illumina Sequencing-by-Synthesis (SBS) whidely implemented in platforms like NovaSeq or NextSeq. This is a second-generation sequencing technology. The input would be Genomic DNA from tumor biopsy or cfDNA from liquid biopsy.

The steps needed to do this are:

DNA extraction
Fragmentation (mechanical or enzymatic)
End repair and A-tailing
Adapter ligation
PCR amplification (optional depending on protocol)
Cluster generation on flow cell (bridge amplification)

The ilumina technology works with sequencing-by-synthesis with reversible terminators. Each nucleotide has a fluorescent tag and a reversible terminator blocking extension (each base has a different color). In the first step of the sequencing with Illumina, one base is incorporated and the cell is imaged. Thanks to the individual fluorescent base, the fluorescent signal is identified and the terminator and dye are cleaved. This cycle repeats until we have the sequence, because the base calling is performed by detecting the fluorescence at each cycle.

As an output, we obtain millions to billions of short reads and FASTQ files with high-accuracy variant detection. This is what makes this technology to identify mutations in TAA genes. The technology is perfect for detecting SNVs, low-frequency tumor mutations and high quality sequencing.

For rapid bacterial DNA sequencing in remote areas, I think the perfect technology to use is the Oxford Nanopore Technologies "MinIon". The input used for this technology are bacterial genomic DNA from water samples and possible 16S amplicons or the whole metagenomic DNA needed. This technology is a third generation sequencing, as it sequences single molecules, doesn't require clonal amplification, can read very long DNA fragments and happens in real time.

To do this, it is important to first prepare a library (depending on the protocol to use). For the library preparation, it is important to do a DNA extraction with a rapid field kit, use optional PCR amplification for the 16S, prepare the adapter ligation and load it onto a flow cell.

The steps for this sequencing technology are:

A DNA molecule passes through a protein nanopore.
An electric current runs across the pore.
Each k-mer alters the ionic current in a characteristic (unique) way and this current change is measured, which leads to the detection of the specific base.
Machine learning algorithms are used to decode the current signals into base sequences, which allows a direct reading of the DNA in real time, detection of modified bases, and generating very long reads.

As an output, we will obtain long-read sequences, real-time data, FASTQ files and immediate taxonomic classification. This is ideal for coliform detection as it allows rapid pathogen detection, field use and can work in low infrastructure envirnments.

DNA Write

(i) What DNA would you want to synthesize (e.g., write) and why? These could be individual genes, clusters of genes or genetic circuits, whole genomes, and beyond. As described in class thus far, applications could range from therapeutics and drug discovery (e.g., mRNA vaccines and therapies) to novel biomaterials (e.g. structural proteins), to sensors (e.g., genetic circuits for sensing and responding to inflammation, environmental stimuli, etc.), to art (DNA origamis). If possible, include the specific genetic sequence(s) of what you would like to synthesize! You will have the opportunity to actually have Twist synthesize these DNA constructs! :).

In line with what I would like to accomplish in the future, I would like to synthesize a genetic sensing circuit capable of detecting if a tumor-associated-antigen peptide in an RNA-based cancer vaccine is being correctly expressed and presented on MHC class I (or II) molecules. The specific purpose of this construct would be to function as a biological validation system and confirm the correct antigen processing and presentation of the antigen of interest. This would help labs (like mine) as a quality-control step and quantify the functional antigen presentation of the peptide of interest.

(ii) What technology or technologies would you use to perform this DNA synthesis and why? Also answer the following questions: 1. What are the essential steps of your chosen sequencing methods? 2. What are the limitations of your sequencing method (if any) in terms of speed, accuracy, scalability?