Week 2 HW: DNA Read, Write, and Edit

Homework

Part 0: Basics Of Gel Electrophoresis

Attended lecture and watched recitation video

Part 1: Benchling & In-silico Gel Art

Link to Benchling Project , not sure how can see this link I asked to join HTGAA group but doesn’t seem like my invite was accepted yet?
My drawing of an “E Gel Person” and associated enzymes in each lane are also in screenshot below

Part 2: Gel Art - Restriction Digests and Gel Electrophoresis

I don’t have access to these enzymes and DNA in my local makerspace lab.

Part 3: DNA Design Challenge

Protein Choice

For the frugal bioreactor project I thought it would be ideal to have an organism and protein combination that are easily detected to help people (including myself) learn how to do yield optimization experiments. To that end I am choosing a protein that has a easily visible color, so intensity of color can be used to judge protein expression. To find a list of proteins I

Asked Gemini “Are there proteins that have visible color and can be expressed by e coli?”.
Selected EforRed from the list, since it will probably be easiest to measure intentisity of primary colors
Found EforRead in FPBase though common alias is apparently eforCP. There is spectrum data on FPBase also which could be useful for calibration and detection.

1 MSVIKQVMKT KLHLEGTVNG HDFTIEGKGE GKPYEGLQHM KMTVTKGAPL PFSVHILTPS HMYGSKPFNK YPADIPDYHK 
81 QSFPEGMSWE RSMIFEDGGV CTASNHSSIN LQENCFIYDV KFHGVNLPPD GPVMQKTIAG WEPSVETLYV RDGMLKSDTA 
161 MVFKLKGGGH HRVDFKTTYK AKKPVKLPEF HFVEHRLELT KHDKDFTTWD QQEAAEGHFS PLPKALP

Reverse Translate

Encdoe the protein as a DNA string. There are lots of tools to this, but for fun I just wrote python code using a mapping from RNA to Amino Acids I took out of some old Rosalind code that mapped RNA to proteins.

# Want to find a DNA encoding of a protein
# This is RNA to DNA mapping from Rosalind
prot_string = """UUU F             
UUC F             
UUA L             
UUG L             
UCU S             
UCC S             
UCA S             
UCG S             
UAU Y             
UAC Y             
UAA Stop          
UAG Stop          
UGU C             
UGC C             
UGA Stop          
UGG W             
CUU L 
CUC L 
CUA L 
CUG L 
CCU P 
CCC P 
CCA P 
CCG P 
CAU H 
CAC H 
CAA Q 
CAG Q 
CGU R 
CGC R 
CGA R 
CGG R 
AUU I 
AUC I 
AUA I 
AUG M 
ACU T 
ACC T 
ACA T 
ACG T 
AAU N 
AAC N 
AAA K 
AAG K 
AGU S 
AGC S 
AGA R 
AGG R 
GUU V
GUC V
GUA V
GUG V
GCU A
GCC A
GCA A
GCG A
GAU D
GAC D
GAA E
GAG E
GGU G
GGC G
GGA G
GGG G"""

# Do the reverse mapping just let the last on mapped win
prot_rna_map = {x.split(" ")[1]: x.split(" ")[0] for x in prot_string.split('\n')}
# skip spaces
prot_rna_map[" "] = ""
print(len(prot_rna_map), prot_rna_map)

my_protein = """MSVIKQVMKT KLHLEGTVNG HDFTIEGKGE GKPYEGLQHM KMTVTKGAPL PFSVHILTPS HMYGSKPFNK YPADIPDYHK QSFPEGMSWE RSMIFEDGGV CTASNHSSIN LQENCFIYDV KFHGVNLPPD GPVMQKTIAG WEPSVETLYV RDGMLKSDTA MVFKLKGGGH HRVDFKTTYK AKKPVKLPEF HFVEHRLELT KHDKDFTTWD QQEAAEGHFS PLPKALP"""
my_rna = "".join([prot_rna_map[l] for l in my_protein])
print("-----")
print(my_rna)

my_dna = my_rna.replace("U", "T")
print("-----")
print(my_dna)

The result I got (out of the many possible encodings) was:

ATGAGCGTGATAAAGCAGGTGATGAAGACGAAGCTGCACCTGGAGGGGACGGTGAACGGGCACGACTTCACGATAGAGGGGAAGGGGGAGGGGAAGCCGTACGAGGGGCTGCAGCACATGAAGATGACGGTGACGAAGGGGGCGCCGCTGCCGTTCAGCGTGCACATACTGACGCCGAGCCACATGTACGGGAGCAAGCCGTTCAACAAGTACCCGGCGGACATACCGGACTACCACAAGCAGAGCTTCCCGGAGGGGATGAGCTGGGAGAGGAGCATGATATTCGAGGACGGGGGGGTGTGCACGGCGAGCAACCACAGCAGCATAAACCTGCAGGAGAACTGCTTCATATACGACGTGAAGTTCCACGGGGTGAACCTGCCGCCGGACGGGCCGGTGATGCAGAAGACGATAGCGGGGTGGGAGCCGAGCGTGGAGACGCTGTACGTGAGGGACGGGATGCTGAAGAGCGACACGGCGATGGTGTTCAAGCTGAAGGGGGGGGGGCACCACAGGGTGGACTTCAAGACGACGTACAAGGCGAAGAAGCCGGTGAAGCTGCCGGAGTTCCACTTCGTGGAGCACAGGCTGGAGCTGACGAAGCACGACAAGGACTTCACGACGTGGGACCAGCAGGAGGCGGCGGAGGGGCACTTCAGCCCGCTGCCGAAGGCGCTGCCG

Codon Optimization

For a given protein string of amino acids there are many possible DNA codes that could generate that protein. While all these codes are equivalent in an abstract sense in reality the RNA machinery like ribosomes and T-RNA in each organism is differen and the physical characterstics of equivalent codons can be different which can impact an organism’s ability to actual translate the DNA into a protein. For example the organism could be more efficient with certain T-RNA encodings than others impacting the rate or expresion or an organism could give special meanings to sequences of codons that it doesn’t use in its own proteiens. To avoid this you want to optimize your encoding for expression in your chosen organism.

I tried to optimize my sequence on twist, but I got 404 whenever I went to their codon optimization calculator. Instead I googled for codon optimization calculator and used another free one, VectorBuilder to get below optimization. I used E. Coli as my organism and didnt’ specify any restriction enzymes to avoid:

Pasted Sequence: GC=62.26%, CAI=0.60

ATGAGCGTGATAAAGCAGGTGATGAAGACGAAGCTGCACCTGGAGGGGACGGTGAACGGGCACGACTTCACGATAGAGGGGAAGGGGGAGGGGAAGCCGTACGAGGGGCTGCAGCACATGAAGATGACGGTGACGAAGGGGGCGCCGCTGCCGTTCAGCGTGCACATACTGACGCCGAGCCACATGTACGGGAGCAAGCCGTTCAACAAGTACCCGGCGGACATACCGGACTACCACAAGCAGAGCTTCCCGGAGGGGATGAGCTGGGAGAGGAGCATGATATTCGAGGACGGGGGGGTGTGCACGGCGAGCAACCACAGCAGCATAAACCTGCAGGAGAACTGCTTCATATACGACGTGAAGTTCCACGGGGTGAACCTGCCGCCGGACGGGCCGGTGATGCAGAAGACGATAGCGGGGTGGGAGCCGAGCGTGGAGACGCTGTACGTGAGGGACGGGATGCTGAAGAGCGACACGGCGATGGTGTTCAAGCTGAAGGGGGGGGGGCACCACAGGGTGGACTTCAAGACGACGTACAAGGCGAAGAAGCCGGTGAAGCTGCCGGAGTTCCACTTCGTGGAGCACAGGCTGGAGCTGACGAAGCACGACAAGGACTTCACGACGTGGGACCAGCAGGAGGCGGCGGAGGGGCACTTCAGCCCGCTGCCGAAGGCGCTGCCG

Improved DNA[1]: GC=51.84%, CAI=0.95

ATGAGCGTGATTAAACAGGTAATGAAAACCAAACTGCATCTGGAAGGCACGGTGAACGGCCATGATTTCACCATTGAAGGCAAAGGCGAAGGCAAACCGTATGAAGGCCTGCAGCATATGAAAATGACCGTGACCAAAGGCGCGCCGCTGCCGTTTAGCGTGCATATTCTGACGCCGAGCCACATGTACGGCAGCAAACCGTTTAACAAATACCCGGCGGACATCCCGGATTATCACAAACAAAGCTTCCCGGAAGGCATGAGCTGGGAACGTAGCATGATCTTCGAGGATGGCGGCGTGTGCACTGCGAGCAATCATAGCAGCATTAACCTGCAGGAAAACTGCTTCATTTACGATGTGAAATTTCATGGCGTGAATCTGCCGCCGGATGGTCCGGTGATGCAGAAAACCATTGCGGGCTGGGAACCGAGCGTGGAAACCCTGTACGTGCGTGATGGCATGCTGAAAAGCGATACCGCCATGGTGTTTAAACTGAAAGGCGGCGGCCATCATCGCGTGGACTTCAAAACCACCTATAAAGCGAAAAAACCGGTGAAACTGCCGGAATTTCACTTTGTGGAACATCGCCTGGAACTGACCAAACATGATAAAGATTTCACCACCTGGGATCAGCAGGAAGCGGCGGAAGGCCATTTTTCACCGCTGCCGAAAGCGCTGCCG

I knew GC was ratio of those bases in the sequence which varies form species to species and can impact structure, but I didn’t know what CAI was so I googled it and learned it is the “Codon Adaption Index”, which is a measure for a specific codon of how likely your codon’s are to be the organism’s preferred choice in terms of T-RNA frequency.

Now What

I still have a lot to learn about how to express proteins, but at a high level I know with cell culture

Pick an organism to express, I will asumme it is a bacteria like E. Coli
Manufacture this DNA segment (with a service like twisted) in side a plasmid in a way that

The plasmid has some other trait like antibiotic resistance that we can use to select for bacteria that have taken up the plasmid
Some promoter that we can use to make sure our protein is expressed.

Use some kind of transformation method to get the plasmid inside of the bacteria
Select for bacteria that have the plasmid by culturing and then putting on antibiotic gel and selecting survivors
Culture and grow these bacteria at scale and triggering conditions (if any) for our protein to be expressed.

Part 4: Twist Synthesis Order

Followed the homework directions pretty closely here using the same prefix/suffix sequences they recommend even the purification histadines though I probably don’t need to extract my color protein? This is my linear sequence.

After that I followed the instructions to create a twist order, downloaded in genebank format, and imported plasmid into benchling

I used Gemini freely to understand Benchling and standard practices in DNA construction and editing. Prompts where:

“IN benchlig once I create a DNA sequence, can I add more DNA to it in benchling?”
“When I right click it asks me if I want to insert bases or parts. What is the difference?”
“When we talk about DNA sequences are we typically using the 5 to 3 or 3 to 5 strand? What is each called and which one is the primary thing we are editing and viewing in benchling?”
Is it best practice to separately annotate start and stop codons on their strands?

Part 5: DNA Read, Write, Edit

DNA Read

What DNA?

If I could sequence and DNA I would sequence Valonia Ventriscosa which is an where a single cell (multiple nuclei) that can grow to be centimeters across! My local DIY bio lab has just started working on doing synthetic biology with them and the first step would be to sequence the entire genome which hasn’t been done yet.

What Sequencing?

I think to assmeble the overall structure of the genome from scratch we need the longest possible reads, so we would want to use a third generation sequencing method like nanapore for that. Once we have the overall structure we might want to use a high-bandwidth sequencing like illumini to get better resolution and coverage, including how much variation there is between indviduals.
To get started we need a decent amount of DNA. This is roughly the amount in 1 million nuclei? We have estimated (assuming that a 1mm cell has ~1000 nuclei) that we need around 1000 1mm cells. Once we have the cells we would need to separate out the DNA from the rest of the cell.
Since the lecture didn’t go into much detail on prep with a nanapore I asked gemini “What kind of preparation do you need to use a nanapore sequencer”. The summary is:
- Gentle separation of DNA so you keep the strands long
- High purity (non phenols/salts)
- Repair the DNA ends so taht they are blunt and have a single A overhang
- Attach a motor protein to DNA to move strand through the membrane.
- Setup the flow cell with buffer
- Put your sample in
Again I didn’t know the output format so I asked Geminin “What is the raw output format of nanapore”. Summary is
- The data is large (can be a terabyte)
- Nanapore recrods the raw signal as electrical voltage changes as DNA goes through pore. The wiggles are stored in in binary files in either POD4 or FAST5 format
- The DNA sequence implied from raw electrical is also output as FASTQ files

DNA Write

If I was going to synthesize DNA independent of editing I would encode wikipedia as DNA. In particular, it would be fun to encode the main text of the wikipedia page about DNA as a DNA strand. Using wc shell utility on a copy of the file I see it is about 64K bytes, which is something that could be encoded as DNA.
I checked with Gemini “What are the abilities and limitations of cell-free DNA assembly? How long can strands be?” and it looks like this size is possible to do with cell-free assmbly and pushing the upper end. I think from the lecture to pull this off you would need to synthesize small pieces (in the range of 500 bps) with olgio synthesis so that they have correctly overlapping ends and then repeatedly use Gibson assembly. If you arrange the reactions you can take advantage of the fact that fragments double in size to do this in 9 rounds of repeated glueing fromt the intiail 128 segements you start with

DNA Edit

If I had the capability to do any DNA edits I wanted, I would want to try to edit a Eukaroyte like algae or yeast do genetically support symbosis with some kind of hydrogen oxidizing bacteria. This would be a massive amount of editing to both the bacteria and the host, but would unlock the ability to directly produce interesting products at scale with only water, sunglight, and atomsphere to produce proteins at scale. For example it would enable things like celluar agiculture and space habitats to functions without existing agriculture inputs.
This is such a massive edit I am not even sure what technologies would work. On the bacteria side you would probalby want to use the edit techniques used in the construction of the minimal living cell though you would want to possibly cut even deeper because you want to make sure some of the bacterias fundamental functions are produced by the host. My earlier Gemini query on cell-free assembly volunteered that the Minimal Bacterial Genom project used a combination of cell-free assembly and yeast-based (TAR Cloning). I don’t know anything about TAR cloning so I asked Gemini “Can you tell me more about TAR cloning?” and it sounds like it the pefect thing to do large scale and complex construction and edits of DNA, so it is probably what would end up being used.