Week 2 HW: Gel Electrophoresis & DNA Design

Part 0: Basics of Gel Electrophoresis

(Completed via lecture and recitation videos.)

Part 1: Benchling & In-silico Gel Art

(In progress — to be submitted after lab access.)

Part 3: DNA Design Challenge

3.1 — Protein Choice

Protein: Haloacid dehalogenase — Rhodococcus jostii RHA1 (gene: RHA1_ro00230) UniProt accession: Q0SK70

I chose this enzyme because it is directly relevant to my proposed project: a living biofilter for PFAS/PFOS transformation. R. jostii RHA1 is one of the best-characterized bacteria capable of initial defluorination of persistent fluorinated compounds. The haloacid dehalogenase catalyzes the cleavage of carbon–halogen bonds, making it a candidate enzyme for the first transformation step of PFAS molecules. Identifying and expressing this gene in a safe chassis is the core engineering goal of my project.

Protein sequence (from UniProt):

>tr|Q0SK70|Q0SK70_RHOJR Haloacid dehalogenase OS=Rhodococcus jostii
(strain RHA1) OX=101510 GN=RHA1_ro00230 PE=1 SV=1
MAGVPFRSPSTGRNVRAVLFDTFGTVVDWRTGIATAVADYAARHQLEVDAVAFADRWRAR
YQPSMDA ILSGAREFVTLDILHRENLDFVLRESGIDPTNHDSGELDELARAWHVLTPWPD
SVPGLTAIKAE YIIGPLSNGNTSLLLDMAKNAGIPWDVIIGSD INRKYKPDPQAYLRTAQ
VLGLHPGEVMLAAA HNGDLEAAHATGLATAFIL RPVEHGPHQTDDLAPTGSWDISATDIT
DLAAQLRAGSTGFR

3.2 — Reverse Translation: Protein → DNA

The DNA sequence below corresponds to the haloacid dehalogenase gene (RHA1_ro00230) from the R. jostii RHA1 genome, obtained via NCBI (GenBank accession: ABG92066.1).

ATGGCGGGCGTGCCGTTTCGTAGCCCGAGCACCGGCCGTAACGTGCGTGCGGTGCTGTTT
GATACCTTTGGCACCGTGGTGGATTGGCGTACCGGCATTGCGACCGCGGTGGCGGATTAT
GCGGCGCGTCATCAGCTGGAAGTGGATGCGGTGGCGTTTGCGGATCGTTGGCGTGCGCGT
TATCAGCCGAGCATGGATGCGATTCTGAGCGGCGCGCGTGAATTTGTGACCCTGGATATT
CTGCATCGTGAAAACCTGGATTTGTGCTGCGTGAAAGCGGCATTGATCCGACCAACCATG
ATAGCGGCGAACTGGATGAACTGGCGCGTGCGTGGCATGTGCTGACCCCGTGGCCGGATA
GCGTGCCGGGCCTGACCGCGATTAAAGCGGAATATATTATTGGCCCGCTGAGCAACGGCA
ACACCAGCCTGCTGCTGGATATGGCGAAAAACGCGGGCATTCCGTGGGATGTGATTATTG
GCAGCGATATTAACCGTAAATATAAA CCGGATCCGCAGGCGTATCTGCGTACCGCGCAGG
TGCTGGGCCTGCATCCGGGCGAAGTGATGCTGGCGGCGGCGCATAACGGCGATCTGGAAG
CGGCGCATGCGACCGGCCTGGCGACCGCGTTTATCCTGCGTCCGGTGGAACATGGCCCGC
ATCAGACCGATGATCTGGCGCCGACCGGCAGCTGGGATATT AGCGCGACCGATATTACCG
ATCTGGCGGCGCAGCTGCGTGCGGGCAGCACCGGCTTTCGT

3.3 — Codon Optimization

Tool used: Twist Bioscience Codon Optimization Tool Chassis organism: Escherichia coli K-12

Why codon optimization is necessary:

The genetic code is degenerate — multiple codons can encode the same amino acid, but different organisms preferentially use certain codons over others (codon usage bias). R. jostii RHA1 has a genomic GC content of ~67%, so its native genes are heavily biased toward GC-rich codons. E. coli K-12, with a GC content of ~51%, uses a very different codon distribution. If the native Rhodococcus gene is expressed in E. coli without optimization, the ribosome will frequently encounter rare codons, causing translation to slow, stall, or produce misfolded protein. Codon optimization replaces rare codons with the E. coli-preferred synonymous codons, improving translation efficiency without changing the amino acid sequence.

Why E. coli K-12 as chassis: E. coli K-12 was chosen over B. subtilis because its GC content (~51%) is more compatible with the donor sequence (~67%), requiring fewer codon substitutions. It is BSL-1, extremely well characterized for heterologous protein expression, and all major codon optimization tools have highly curated E. coli codon tables. This aligns with the safety and containment goals outlined in my PFAS biofilter project proposal.

Codon-optimized DNA sequence for E. coli K-12:

ATGGCGGGCGTGCCGTTTCGTAGCCCGAGCACCGGCCGTAACGTGCGTGCGGTGCTGTTT
GATACCTTTGGCACCGTGGTGGATTGGCGTACCGGCATTGCGACCGCGGTGGCGGATTAT
GCGGCGCGTCATCAGCTGGAAGTGGATGCGGTGGCGTTTGCGGATCGTTGGCGTGCGCGT
TATCAGCCGAGCATGGATGCGATTCTGAGCGGCGCGCGTGAATTTGTGACCCTGGATATT
CTGCATCGTGAAAACCTGGATTTGGTGCTGCGTGAAAGCGGCATTGATCCGACCAACCAT
GATAGCGGCGAACTGGATGAACTGGCGCGTGCGTGGCATGTGCTGACCCCGTGGCCGGAT
AGCGTGCCGGGCCTGACCGCGATTAAAGCGGAATATATCATTGGCCCGCTGAGCAACGGC
AACACCAGCCTGCTGCTGGATATGGCGAAAAACGCGGGCATTCCGTGGGATGTGATCATT
GGCAGCGATATTAACCGTAAATATAAACCGGATCCGCAGGCGTATCTGCGTACCGCGCAG
GTGCTGGGCCTGCATCCGGGCGAAGTGATGCTGGCGGCGGCGCATAACGGCGATCTGGAA
GCGGCGCATGCGACCGGCCTGGCGACCGCGTTTATCCTGCGTCCGGTGGAACATGGCCCG
CATCAGACCGATGATCTGGCGCCGACCGGCAGCTGGGATATTAGCGCGACCGATATTACCG
ATCTGGCGGCGCAGCTGCGTGCGGGCAGCACCGGCTTTCGT

GC content after optimization: 60.6% (reduced from ~67% in the native Rhodococcus sequence)

3.4 — From DNA to Protein: Production Methods

Once the codon-optimized expression cassette is assembled (Promoter → RBS → ATG → optimized gene → 7×His-tag → STOP → Terminator), the haloacid dehalogenase protein can be produced by two main approaches:

Cell-based (in vivo) — E. coli K-12: The plasmid carrying the expression cassette is transformed into chemically competent E. coli K-12 cells. RNA polymerase recognizes the promoter and transcribes the gene into mRNA. The ribosome binds at the Shine-Dalgarno sequence (RBS), positions itself at the ATG start codon, and translates the mRNA codon-by-codon until the TAA stop codon. The polypeptide folds into the functional enzyme. The 7×His-tag enables purification by nickel-affinity chromatography (Ni-NTA).

Cell-free (in vitro): The plasmid is added directly to a cell-free transcription-translation system (e.g., PURExpress, NEB). The gene is transcribed and translated without any living cell, producing functional protein in a tube. This is faster than cell-based methods, avoids containment concerns for engineered organisms, and is especially useful for early-stage functional testing — directly aligned with the biosafety goals of my PFAS biofilter project.

3.5 — How Does It Work in Biological Systems? (Optional)

How a single gene codes for multiple proteins:

In biological systems, a single gene can give rise to multiple distinct protein products through several mechanisms:

Alternative transcription start sites: RNA polymerase can initiate transcription at different promoters, producing transcripts with different 5’ ends and therefore different protein N-termini.
Alternative splicing (eukaryotes): The spliceosome can include or exclude different exon combinations, generating multiple mRNA isoforms and protein variants from the same gene. The human DSCAM gene can theoretically produce over 38,000 isoforms this way.
Translational frameshifting: In prokaryotes and viruses, the ribosome can slip into a different reading frame, producing a distinct protein from the same mRNA — as seen in the MS2 phage lysis protein discussed in class.
Post-translational cleavage: A single polyprotein precursor can be cleaved by proteases into multiple functional subunits.

DNA → RNA → Protein alignment (first 15 codons of Q0SK70):

Every T in DNA becomes U in mRNA during transcription. Each 3-nucleotide codon encodes exactly one amino acid.

DNA:  ATG GCG GGC GTG CCG TTT CGT AGC CCG AGC ACC GGC CGT AAC GTG
mRNA: AUG GCG GGC GUG CCG UUU CGU AGC CCG AGC ACC GGC CGU AAC GUG
AA:    M   A   G   V   P   F   R   S   P   S   T   G   R   N   V

#	DNA Codon	mRNA Codon	AA	Name
1	ATG	AUG	M	Met (Start)
2	GCG	GCG	A	Ala
3	GGC	GGC	G	Gly
4	GTG	GUG	V	Val
5	CCG	CCG	P	Pro
6	TTT	UUU	F	Phe
7	CGT	CGU	R	Arg
8	AGC	AGC	S	Ser
9	CCG	CCG	P	Pro
10	AGC	AGC	S	Ser
11	ACC	ACC	T	Thr
12	GGC	GGC	G	Gly
13	CGT	CGU	R	Arg
14	AAC	AAC	N	Asn
15	GTG	GUG	V	Val

Full protein N-terminus confirmed: M-A-G-V-P-F-R-S-P-S-T-G-R-N-V-R-A-V-L-F-D… — matches Q0SK70 exactly.

Part 4: Twist DNA Synthesis Order

(Pending — will be submitted after feedback on expression cassette design.)

Part 5: DNA Read / Write / Edit

5.1 — DNA Read

(i) What DNA would I sequence and why?

In the context of my PFAS biofilter project, I would sequence the metagenome of PFAS-contaminated soil and water microbiomes — specifically from sites near industrial discharge zones, firefighting training areas, and agricultural land treated with PFAS-containing biosolids.

The rationale: we currently know very little about which microorganisms naturally perform any degree of PFAS transformation. Shotgun metagenomic sequencing of these communities would reveal novel genes and enzymatic pathways associated with defluorination, without needing to culture every organism individually. This is a “read first, engineer later” strategy — sequencing environmental DNA to discover what biology is already doing, then extracting candidate genes (such as haloacid dehalogenases) for transfer into a safe chassis. This also serves environmental monitoring and biodiversity assessment goals directly relevant to the governance framework in my project proposal.

(ii) Technology: Illumina Short-Read Sequencing (Second-Generation)

Generation: Second-generation (Next-Generation Sequencing, NGS).

Why: For metagenomic samples containing millions of different organisms and genes, Illumina offers the best balance of throughput, accuracy, and cost. It generates billions of 150 bp paired-end reads per run with an error rate of ~0.1% — sufficient for reliable gene identification and annotation across diverse communities.

Input preparation — essential steps:

Environmental DNA extraction from soil/water samples (e.g., Qiagen PowerSoil kit).
Mechanical fragmentation by sonication to ~300–500 bp fragments.
End repair and A-tailing to generate blunt, phosphorylated ends.
Adapter ligation — double-stranded adapters containing primer binding sites and multiplexing indices are ligated to both ends.
Limited-cycle PCR to enrich adapter-ligated fragments.
Library QC by Bioanalyzer to verify fragment size distribution.
Loading onto flow cell for sequencing.

How base calling works (Sequencing by Synthesis, SBS): DNA fragments are loaded onto a glass flow cell and amplified in place by bridge amplification, creating clusters of identical copies. Fluorescently labeled reversible terminator nucleotides are added one at a time. A laser excites the fluorophore on each incorporated base; a camera captures the emission color (A, T, G, or C); the terminator is chemically cleaved; and the next cycle begins. This repeats for every position in the read, decoding the sequence base by base.

Output: FASTQ files containing billions of short reads (150 bp paired-end), each with a per-base Phred quality score. These are assembled or aligned against reference genomes and annotated for gene function.

5.2 — DNA Write

(i) What DNA would I synthesize and why?

The DNA I would synthesize is the full expression cassette for the codon-optimized haloacid dehalogenase (Q0SK70) in E. coli K-12, as designed in Part 3. Beyond this gene, I would also synthesize a complete genetic circuit consisting of:

A biosensor module — a fluoride-responsive riboswitch (crcB riboswitch, Rfam: RF01734) that activates gene expression in response to intracellular fluoride ions released during PFAS defluorination.
The haloacid dehalogenase downstream of the riboswitch, so enzyme expression is automatically induced when PFAS are present in the bioreactor.
A sfGFP reporter as a real-time readout of system activity, detectable by fluorescence under UV.

This complete circuit — sense PFAS → express dehalogenase → report activity — is the minimal viable design for the living biofilter described in my project proposal.

The codon-optimized haloacid dehalogenase sequence is provided in Part 3.3. The crcB riboswitch sequence is publicly available in Rfam (RF01734).

(ii) Technology: Twist Bioscience Clonal Gene Synthesis

Essential steps:

Sequence design and codon optimization — the target sequence is processed to remove secondary structures and avoid restriction sites (BsaI, BsmBI, BbsI).
Computational division into overlapping ~200 bp oligonucleotides.
Silicon chip-based parallel oligo synthesis at high density.
PCR-based assembly of overlapping oligos into the full-length gene.
Cloning into the selected vector backbone (e.g., pTwist Amp High Copy).
Transformation into E. coli and Sanger sequencing verification.

Limitations:

Maximum insert length ~5 kb per clonal gene; larger constructs require fragment ordering and in-lab assembly.
Highly repetitive sequences or extreme GC content can reduce synthesis success.
Turnaround time is typically 2–3 weeks.
Per-base cost is low but scales with construct length.

5.3 — DNA Edit

(i) What DNA would I edit and why?

For my PFAS biofilter project, the most valuable editing target is the haloacid dehalogenase gene itself — to engineer improved variants with higher catalytic efficiency toward long-chain perfluorinated PFAS substrates. The native R. jostii enzyme has modest activity on PFAS; directed evolution or rational active-site redesign could produce variants that defluorinate PFAS more rapidly and across a broader range of chain lengths.

A second target is the E. coli K-12 chromosome — specifically, deleting competing non-specific dehalogenases that might generate toxic intermediates, and inserting the biosensor-dehalogenase circuit into a stable chromosomal locus rather than a plasmid, to prevent plasmid loss during long-term bioreactor operation.

(ii) Technology: CRISPR-Cas9

How it edits DNA — essential steps:

gRNA design — a 20-nucleotide spacer sequence is designed to match the target genomic site, adjacent to a PAM sequence (5’-NGG-3’ for SpCas9).
Construct preparation — the gRNA is cloned into a plasmid expressing both the guide and Cas9. A donor DNA template (the desired insert or correction, flanked by ~500 bp homology arms) is also prepared.
Delivery — the Cas9/gRNA plasmid and donor template are co-transformed into E. coli by electroporation.
Cleavage and HDR — Cas9 binds the target via the gRNA and makes a double-strand break. The cell repairs the break by homology-directed repair (HDR) using the donor template, precisely inserting the sequence of interest.
Screening — colonies are screened by colony PCR and confirmed by Sanger sequencing.

Inputs required:

Designed 20 nt gRNA sequences
Cas9 expression plasmid
Donor DNA with homology arms (~500 bp each side)
Competent E. coli cells
Selectable markers

Limitations:

Off-target cuts at sites with partial gRNA complementarity (mitigated by high-fidelity Cas9 variants such as eSpCas9 or HiFi Cas9).
HDR efficiency in E. coli is lower than in eukaryotes; lambda Red recombineering is commonly combined with CRISPR to improve efficiency.
Multiplexing multiple edits simultaneously increases risk of unintended genomic rearrangements.
Delivery remains challenging in organisms without established transformation protocols.