Week 2 HW: DNA Read, Write, and Edit

Part 1: Benchling Gel Art

Virtual restriction digest of Lambda DNA (J02459) using EcoRI, HindIII, BamHI, PstI, SalI, and XhoI, visualized in NEBcutter.

Part 3: DNA Design Challenge

3.1 Choose Your Protein

I chose endoglucanase A (CelCCA) from the bacterium Clostridium cellulolyticum. This organism has since been reclassified as Ruminiclostridium cellulolyticum. The crystal structure of its catalytic domain is in the PDB as entry 1EDG. It was solved at 1.6 angstrom resolution.

Why this protein? Cellulases break down cellulose, the most abundant organic polymer on Earth. They are important for biomass conversion, biofuel production, and the paper and textile industries. CelCCA belongs to glycosyl hydrolase family 5. It folds into a classic (alpha/beta)8 TIM barrel, one of the most common enzyme folds in nature. The protein has clear industrial relevance. It also has a high-resolution structure that will be useful for computational analysis in later weeks.

The protein sequence comes from the RCSB PDB (entry 1EDG, Chain A, UniProt: P17901). The catalytic domain is 380 amino acids long:

>pdb|1EDG|A Endoglucanase A catalytic domain, Ruminiclostridium cellulolyticum H10
MYDASLIPNLQIPQKNIPNNDGMNFVKGLRLGWNLGNTFDAFNGTNITNELDYETSWSG
IKTTKQMIDAIKQKGFNTVRIPVSWHPHVSGSDYKISDVWMNRVQEVVNYCIDNKMYVIL
NTHHDVDKVKGYFPSSQYMASSKKYITSVWAQIAARFANYDEHLIFEGMNEPRLVGHANE
WWPELTNSDVVDSINCINQLNQDFVNTVRATGGKNASRYLMCPGYVASPDGATNDYFRMP
NDISGNNNKIIVSVHAYCPWNFAGLAMADGGTNAWNINDSKDQSEVTWFMDNIYNKYTSR
GIPVIIGECGAVDKNNLKTRVEYMSYYVAQAKARGILCILWDNNNFSGTGELFGFFDRRS
CQFKFPEIIDGMVKYAFGLIN

3.2 Reverse Translate: Protein to DNA

I converted the 380-amino-acid sequence back into DNA. The genetic code is degenerate, meaning most amino acids can be encoded by multiple codons. There is no single “correct” reverse translation. I used the most common E. coli codons for each amino acid.

The resulting coding sequence is 1,140 bp (380 aa x 3 nt/aa). Adding a TAA stop codon gives 1,143 bp total:

ATGTATGATGCGAGCCTGATTCCGAACCTGCAGATTCCGCAGAAAAACATTCCGAACAAC
GATGGTATGAACTTCGTGAAAGGTCTGCGTCTGGGTTGGAACCTGGGTAACACCTTCGAT
GCGTTCAACGGTACCAACATTACCAACGAACTGGATTATGAAACCAGCTGGAGCGGTATT
AAAACCACCAAACAGATGATTGATGCGATTAAACAGAAAGGTTTCAACACCGTGCGTATT
CCGGTGAGCTGGCATCCGCATGTGAGCGGTAGCGATTATAAAATTAGCGATGTGTGGATG
AACCGTGTGCAGGAAGTGGTGAACTATTGCATTGATAACAAAATGTATGTGATTCTGAAC
ACCCATCATGATGTGGATAAAGTGAAAGGTTATTTTCCGAGCAGCCAGTATATGGCGAGC
AGCAAAAAATATATTACCAGCGTGTGGGCGCAGATTGCGGCGCGTTTCGCGAACTATGAT
GAACATCTGATTTTCGAAGGTATGAACGAACCGCGTCTGGTGGGTCATGCGAACGAATGG
TGGCCGGAACTGACCAACAGCGATGTGGTGGATAGCATTAACTGCATTAACCAGCTGAAC
CAGGATTTCGTGAACACCGTGCGTGCGACCGGTGGTAAAAACGCGAGCCGTTATCTGATG
TGCCCGGGTTATGTGGCGAGCCCGGATGGTGCGACCAACGATTATTTCCGTATGCCGAAC
GATATTAGCGGTAACAACAACAAAATTATTGTGAGCGTGCATGCGTATTGCCCGTGGAAC
TTCGCGGGTCTGGCGATGGCGGATGGTGGTACCAACGCGTGGAACATTAACGATAGCAAA
GATCAGAGCGAAGTGACCTGGTTCATGGATAACATTTATAAACAAATATACCAGCCGTGGT
ATTCCGGTGATTATTGGTGAATGCGGTGCGGTGGATAAAAAACAACCTGAAAACCCGTGTG
GAATATATGAGCTATTATGTGGCGCAGGCGAAAGCGCGTGGTATTCTGTGCATTCTGTGG
GATAACAACAACTTCAGCGGTACCGGTGAACTGTTCGGTTCTTCGATCGTCGTAGCTGC
CAGTCAAATTCCCGGAAATTATTGATGGTATGGTGAAATATGCGTTCGGTCTGATTAAC
TAA

3.3 Codon Optimization

Why codon optimization is needed: Different organisms prefer different codons for the same amino acid. A codon that is common in C. cellulolyticum might be rare in E. coli. Rare codons cause the ribosome to stall because the matching tRNA is scarce. This slows translation and reduces protein yield. Codon optimization swaps rare codons for ones the host uses frequently. This keeps the ribosome moving and increases output.

Other factors also matter. Stable mRNA hairpins near the start codon can block ribosome binding. Extreme GC content reduces expression. Certain restriction enzyme sites need to be removed for cloning compatibility.

Organism chosen: I optimized for Escherichia coli K-12. E. coli is the standard host for recombinant protein production. It grows fast, is cheap to culture, and has well-characterized genetics. CelCCA has been successfully expressed in E. coli before (Fierobe et al., 1991), so this is a validated choice.

The optimized sequence uses the highest-frequency E. coli codon for each amino acid. GC content is 47.4%, which falls in the ideal range of 40-60% for E. coli. The sequence does not contain recognition sites for common Type IIs restriction enzymes like BsaI, BsmBI, or BbsI. This keeps the sequence compatible with Golden Gate cloning.

3.4 You Have a Sequence! Now What?

I would produce CelCCA using a cell-based expression system in E. coli. Here is the process:

Step 1: Build an expression construct. The codon-optimized gene goes into an expression cassette. The cassette has a promoter, a ribosome binding site (RBS), a start codon, the coding sequence, a His-tag, a stop codon, and a terminator. This cassette is cloned into a plasmid vector with an antibiotic resistance gene and an origin of replication.

Step 2: Transform into E. coli. The plasmid is introduced into competent E. coli cells through heat shock or electroporation. Cells that take up the plasmid survive on antibiotic selection plates.

Step 3: Transcription. RNA polymerase binds to the promoter. It reads the template DNA strand from 3’ to 5’ and builds a complementary mRNA strand from 5’ to 3’. Thymine (T) in DNA becomes uracil (U) in the mRNA. With a constitutive promoter like BBa_J23106, transcription runs continuously. No inducer is needed.

Step 4: Translation. The ribosome binds the RBS on the mRNA and starts at the AUG start codon. It reads the mRNA in triplets (codons). Each codon is matched by a tRNA carrying the right anticodon and amino acid. The ribosome adds each amino acid to the growing chain. Translation stops when the ribosome reaches the UAA stop codon. The finished protein is then released.

Step 5: Protein folding and purification. The polypeptide folds into its functional 3D structure (the TIM barrel). The C-terminal His-tag enables purification by immobilized metal affinity chromatography (IMAC). Ni-NTA resin binds the histidine residues. The protein is eluted with imidazole.

Alternative: Cell-free expression. The same construct could be used in a cell-free TX-TL system. These systems use cell extracts containing ribosomes, tRNAs, RNA polymerase, and energy sources. Cell-free expression is faster (hours instead of days) and works for toxic proteins. However, yields are lower and costs are higher at scale.

Part 4: Twist Order

For the Benchling and Twist exercise, here are the components of my expression cassette:

Component	Part	Length
Promoter	BBa_J23106	35 bp
RBS	BBa_B0034 (with spacers)	22 bp
Start Codon	ATG	3 bp
Coding Sequence	CelCCA catalytic domain (codon-optimized)	1,137 bp
His Tag	7x His	21 bp
Stop Codon	TAA	3 bp
Terminator	BBa_B0015	129 bp
Total insert		~1,350 bp

I selected pTwist Amp High Copy as the vector. It carries ampicillin resistance and a high-copy-number origin (pUC ori) for good plasmid yields.

Part 5: DNA Read/Write/Edit

5.1 DNA Read

(i) What DNA would you want to sequence and why?

I would sequence environmental metagenomes from extreme environments. Good sources include the rumen of herbivorous animals and hot springs where cellulose is actively broken down. These places harbor thousands of uncultured microorganisms that make novel cellulases. Most of these organisms cannot be grown in the lab, so their enzymes remain hidden. Sequencing the total DNA from these environments reveals new enzyme variants with useful properties. These include higher thermostability, different pH optima, and better activity on crystalline cellulose.

This connects to my protein of interest. CelCCA works under moderate conditions. But industrial biomass conversion often needs enzymes that tolerate 60-80 C and low pH. Metagenomic sequencing is the fastest way to find such enzymes without culturing each organism one by one.

(ii) What sequencing technology would you use?

I would use two platforms together: Illumina short-read sequencing for accuracy and Oxford Nanopore long-read sequencing for assembly.

Illumina (second-generation sequencing):

Illumina uses sequencing-by-synthesis. The input is fragmented DNA, typically 300-600 bp fragments for metagenomic work. Library preparation adds sequencing adapters to both ends and amplifies the fragments by PCR.

The essential steps are:

Adapter-ligated fragments bind to complementary oligos on a flow cell surface.
Each fragment undergoes bridge amplification. This creates a cluster of about 1,000 identical copies.
Fluorescent nucleotides with reversible terminators are washed over the flow cell. One nucleotide is added per strand per cycle. A camera records which color lights up at each cluster. Then the terminator is removed so the next cycle can proceed.
This repeats for 150-300 cycles. The result is paired-end reads of 150-300 bp from each end of each fragment.

The output is millions to billions of short reads in FASTQ format. Each read has per-base quality scores. Error rates are about 0.1%. This is great for detecting variants and measuring abundance. The main weakness is read length. Short reads make it hard to assemble full genes from complex samples where many organisms share similar sequences.

Oxford Nanopore (third-generation sequencing):

Nanopore sequences single DNA molecules. It passes each strand through a protein pore in an electrically resistant membrane. Each base disrupts the ionic current in a specific way. A neural network translates these current patterns into nucleotide sequences.

Input preparation is simple. Adapters are ligated to native DNA. No PCR is needed. This is a big advantage because it avoids GC-bias and preserves DNA methylation patterns.

Reads can be very long. Typical reads are 10 kb, and records reach 4 Mb. A whole cellulase gene cluster (often 10-30 kb) can fit in one read. This removes assembly ambiguity. The downside is accuracy. Raw single-read accuracy is about 92-98%. Recent chemistry improvements (R10.4 pores) and better base callers now push consensus accuracy above 99%.

The output is FASTQ or FAST5 files with base calls and raw signal data.

Why combine them? Nanopore reads provide scaffolding for complete gene cluster assembly. Illumina reads then “polish” the assembly to fix remaining errors. This hybrid approach gives both the length of long reads and the accuracy of short reads.

5.2 DNA Write

(i) What DNA would you want to synthesize and why?

I would synthesize a library of CelCCA variants with designed mutations. I would target residues near the catalytic glutamates (Glu170 and Glu307) and the aromatic residues lining the substrate-binding cleft. The goal is to engineer variants with better thermostability and broader substrate range for industrial use.

Rather than making one gene at a time, I would design 50-100 variants as a gene fragment library. Each variant would carry 3-10 mutations predicted by computational tools (covered in HTGAA Week 4). Each sequence would be about 1,140 bp, codon-optimized and flanked by standard assembly overlaps.

(ii) What synthesis technology would you use?

I would use chip-based oligonucleotide synthesis followed by enzymatic assembly. Twist Bioscience is one company that offers this service.

Essential steps:

Oligo synthesis: Short oligos (150-300 nt) are made in parallel on a silicon chip using phosphoramidite chemistry. Each cycle adds one nucleotide in four steps. First, the DMT protecting group is removed from the 5’-OH. Second, the next phosphoramidite monomer is coupled. Third, unreacted chains are capped to prevent deletions. Fourth, the new bond is oxidized for stability. This cycle repeats once per base.
Assembly: Overlapping oligos are combined and joined by overlap-extension PCR or Gibson Assembly. Adjacent oligos share complementary overlaps. They anneal together and polymerase extends them to build the full-length gene.
Error correction: Each coupling step is about 99.0-99.5% efficient. Errors accumulate in longer oligos. Enzymatic mismatch cleavage or sequencing-based selection removes bad sequences. The final gene is cloned and verified by sequencing.

Limitations:

Length is the main constraint. Individual oligos work up to about 200-300 nt. Full genes up to about 5 kb can be assembled, but cost rises with length. My 1,140 bp CelCCA gene is well within the standard range.

Accuracy is also a factor. Synthesis error rates are about 1 in 300 bases before correction. After correction and clonal selection, you get essentially perfect sequences. This verification adds time and cost.

Cost has dropped a lot. Twist currently charges about $0.07 per bp for standard genes. My 1,140 bp gene costs about $80 per variant. A library of 100 variants would run about $8,000.

Turnaround is typically 2-3 weeks for clonal genes.

5.3 DNA Edit

(i) What DNA would you want to edit and why?

I would edit the genome of Clostridium cellulolyticum to improve its biomass conversion ability. The wild-type strain already makes a cellulosome, a large multi-enzyme complex on its cell surface that degrades cellulose. But its productivity is limited. I would make three types of edits:

Knock out carbon catabolite repression genes. This lets the organism use cellulose and other sugars at the same time instead of preferring one carbon source. It would speed up overall biomass conversion.
Insert a metabolic pathway for ethanol or butanol production. This would turn C. cellulolyticum into a consolidated bioprocessing organism. It could both break down cellulose and ferment the sugars in one step. Current industrial processes need separate organisms for each step.
Modify the cellulosome scaffolding protein (CipC). Adding slots for more enzyme types would let the organism degrade a wider range of plant polymers.

These edits would advance next-generation biofuel production. Going directly from raw plant waste to fuel in one organism would cut the cost of cellulosic biofuels significantly.

(ii) What editing technology would you use?

I would use CRISPR-Cas9 with homology-directed repair (HDR).

How CRISPR-Cas9 works:

Design a guide RNA (sgRNA). The first 20 nucleotides are the spacer. They match the target DNA site. The rest of the RNA forms a scaffold that binds Cas9. The target must sit next to a PAM sequence. For SpCas9, the PAM is NGG (any nucleotide then two guanines).
Prepare the components. The sgRNA and Cas9 protein (or a plasmid encoding both) are delivered into the cell. For C. cellulolyticum, delivery would use electroporation or conjugation. A homology donor template is also provided. This template carries the desired edit flanked by 500-1000 bp homology arms matching the regions around the cut site.
Cas9 cuts the DNA. The Cas9-sgRNA complex scans the genome for matching sequences next to a PAM. When it finds a match, it unwinds the DNA and checks for complementarity. If the match is good, Cas9 makes a double-strand break 3 bp upstream of the PAM.
Repair introduces the edit. The cell repairs the break using one of two pathways. Non-homologous end joining (NHEJ) is error-prone. It creates random insertions or deletions, useful for knockouts. Homology-directed repair (HDR) uses the donor template as a blueprint. This allows precise insertions, replacements, or corrections.

Inputs needed: Cas9 protein or plasmid, the sgRNA (designed computationally and synthesized), the homology donor template, and competent cells. Guide design uses tools like Benchling or CRISPOR. These tools pick sites with high on-target activity and low off-target risk.

Limitations:

Efficiency varies by organism. CRISPR works well in E. coli and many model organisms. Clostridia are harder to edit. They have low transformation efficiency and restriction-modification systems that destroy foreign DNA. Genetic tools are also limited. Recent work on Clostridial CRISPR systems (using Cas9 or Cas12a on shuttle vectors) has improved results. But editing efficiency is still around 10-50% per target, compared to 50-90% in model organisms.

Off-target cutting is another concern. The Cas9-sgRNA complex can tolerate a few mismatches. It might cut at unintended sites. This is managed by careful guide design, high-fidelity Cas9 variants (like eSpCas9 or HiFi Cas9), and whole-genome sequencing of edited clones.

For the multiplex edits I described, I would do multiple rounds of editing in sequence. Each round targets one edit and selects for success before moving on. An alternative for E. coli would be MAGE (Multiplex Automated Genome Engineering), which makes many edits at once. But MAGE is not established in Clostridia yet, so sequential CRISPR is the practical approach.