Week 2 HW: DNA Read, Write, & Edit

My protein this week is rhodopsin (RHO, UniProt P08100) Iti is a photon-sensing G-protein-coupled receptor in rod cells of the retina. As someone who works professionally in photography, this protein is basically my biological counterpart: a single 11-cis-retinal molecule sits in the middle of a 7-transmembrane GPCR and isomerizes to all-trans on absorbing one photon, triggering the entire phototransduction cascade. It is the sensor in the world’s oldest and most refined camera.

Part 0: Basics of Gel Electrophoresis

Negatively charged DNA is pulled toward the +electrode through an agarose mesh; shorter fragments thread through faster and travel farther. Run a ladder in parallel and you can read fragment sizes off the photo.

Part 3: DNA Design Challenge

3.1. Choose your protein

Rhodopsin (RHO, UniProt P08100, Homo sapiens, 348 aa). I chose it because:

It is the cleanest example of biological signal transduction I know. one photon in, one G-protein activation out, with an amplification cascade behind it that lets a dark-adapted rod cell detect a single photon.
The chromophore (11-cis-retinal) is bound covalently via a Schiff base to Lys296. The photon does not act on rhodopsin directly; it isomerizes the retinal, and the retinal then strains the protein into its active conformation (metarhodopsin II). The protein is a mechanical lever, not a photon absorber. That distinction was a real “oh” moment for me.
Mutations in RHO cause autosomal-dominant retinitis pigmentosa (~25% of adRP cases), so this is also a tractable target for gene therapy, which connects nicely to Part 5.3 below.

>sp|P08100|OPSD_HUMAN Rhodopsin OS=Homo sapiens OX=9606 GN=RHO PE=1 SV=1
MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLY
VTVQHKKLRTPLNYILLNLAVADLFMVLGGFTSTLYTSLHGYFVFGPTGCNLEGFFATLG
GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSRYIP
EGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFTVKEAAAQQQES
ATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSAAI
YNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA

3.2. Reverse Translate

Reverse-translated the protein sequence using the EMBOSS Backtranseq tool, then verified against the native human RHO mRNA (NCBI RefSeq NM_000539.3) for sanity. The native mRNA already has a real codon usage profile, but I used the back-translated sequence as a starting point so that codon optimization in 3.3 is a clean step rather than a re-use of the existing one.

A short prefix of the back-translated DNA (first 60 aa worth, 180 nt) looks like this before optimization:

ATGAATGGTACTGAAGGTCCTAATTTTTATGTTCCTTTTAGTAATGCTACTGGTGTTGTT
CGTAGTCCTTTTGAATATCCTCAATATTATTTAGCTGAACCTTGGCAATTTAGTATGTTA
GCTGCTTATATGTTTTTATTAATTGTTTTAGGTTTTCCTATTAATTTTTTAACTTTATAT

(Full 1,047 nt sequence in the Benchling file RHO_backtranslated.)

3.3. Codon optimization

I codon-optimized for E. coli K-12 as the expression chassis, because:

It is the chassis we will actually be using in lab (cell-free lysate + plasmid transformation).
Human codon usage is significantly different from E. coli in several places – notably Arg (AGG/AGA are rare codons in E. coli and abundant in human), Leu (CTA is rare in E. coli), and Ile (ATA is rare). Without optimization, ribosomes stall at rare-codon tracts and you get truncated products or no expression at all.
E. coli also has GC-content preferences (~50%) that differ from human (~60% in coding regions). Skewed GC content can cause hairpins and slow translation.

I used the Twist Codon Optimization Tool, set host = E. coli K-12, and excluded recognition sites for BsaI, BsmBI, BbsI (Type IIS enzymes used in Golden Gate assembly) and also EcoRI, HindIII, BamHI so my downstream cloning options stay open. The optimizer also smoothed out long homopolymer runs (>6 nt) and removed internal Shine-Dalgarno-like motifs that could cause internal translation starts.

Caveat about rhodopsin specifically: rhodopsin is a 7-transmembrane integral membrane GPCR. It does not natively express well in E. coli without engineering – the bacterial membrane lacks the right lipid composition, and there is no machinery for the disulfide bond (Cys110-Cys187) or palmitoylation (Cys322/323). In a real project I would either (a) express in HEK293 / Sf9 cells, or (b) express only a soluble cytoplasmic loop in E. coli for an antibody-generation experiment. For the purpose of this homework, codon-optimizing for E. coli is the assigned exercise; in practice I would optimize for Spodoptera frugiperda (Sf9) or human cells.

3.4. Two pathways to get protein from the optimized DNA:

Cell-dependent (in vivo): Clone the optimized RHO ORF into an expression vector with a T7 promoter, RBS, start codon, and terminator (the pTwist Amp High Copy vector from Part 4 works for cloning; for expression I’d move it into pET-28a in BL21(DE3)). Transform into competent E. coli, plate on Amp, pick colonies, grow to OD600 ~0.6, induce with IPTG, harvest, lyse, and purify via the His-tag on Ni-NTA. The cell does transcription via T7 RNAP and translation via its ribosomes; the protein folds (or, for rhodopsin, mostly misfolds into inclusion bodies, which then need refolding with retinal added in vitro to reconstitute the holoprotein).

Cell-free (in vitro): Use a TXTL lysate-based system like the one from Week 11. Add the linear or circular DNA template directly to the lysate + energy mix, and transcription/translation happen in the tube over 1-20 h. For a membrane protein like rhodopsin, the cell-free pathway has a real advantage: you can supplement the reaction with nanodiscs or detergent micelles so the nascent rhodopsin inserts into a membrane-mimetic environment rather than aggregating. Add 11-cis-retinal to the reaction and the holoprotein reconstitutes in situ. This is increasingly how membrane GPCRs are produced for structural biology.

3.5. How does it work in nature?

(a) transcriptional/post-transcriptional mechanisms:

Alternative splicing. One pre-mRNA can be spliced into many mature mRNAs by including or skipping exons. DSCAM in Drosophila notoriously produces >38,000 isoforms from one gene. For opsins specifically, the Drosophila ninaE gene uses alternative splicing to produce variants with different spectral tuning.
Alternative promoters. Different transcription start sites produce mRNAs with different 5’ UTRs and sometimes different N-terminal coding regions.
Alternative polyadenylation. Different 3’ ends change mRNA stability and localization.
RNA editing. ADAR enzymes convert A->I (read as G), which can change codons. Notably common in cephalopod opsins, where it tunes spectral sensitivity to local light environments.

(b) DNA -> RNA -> Protein alignment (first 60 nt / 20 aa of the codon-optimized RHO):

DNA:     ATG AAC GGC ACC GAA GGC CCG AAT TTC TAT GTG CCG TTC AGC AAC GCG ACC GGC GTG GTG
RNA:     AUG AAC GGC ACC GAA GGC CCG AAU UUC UAU GUG CCG UUC AGC AAC GCG ACC GGC GUG GUG
Protein:  M   N   G   T   E   G   P   N   F   Y   V   P   F   S   N   A   T   G   V   V
Pos:      1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20

Part 5: DNA Read / Write / Edit

5.1 DNA Read

(i) What DNA would I want to sequence and why?

I would sequence the RHO gene in patients with autosomal-dominant retinitis pigmentosa (adRP) to identify the specific causative mutation in each patient before deciding on a therapy. There are >150 known pathogenic RHO mutations, and they fall into mechanistically distinct classes (Class I = trafficking-defective, Class II = misfolding, etc.). The right therapy – allele-specific knockdown, base editing, prime editing, or gene replacement – depends on which class the mutation belongs to. Sequencing is the upstream diagnostic that determines everything downstream.

Beyond clinical use, I would also like to do metagenomic sequencing of cephalopod skin to find new opsins. Octopus skin appears to “see” through dermal opsins, and characterizing the opsin diversity across cephalopods could yield new optogenetic tools.

(ii) Which technology and why?

I would use Oxford Nanopore (third generation, long-read) for the clinical RHO case, complemented by Illumina short-read for accuracy.

Generation: Nanopore is third-generation – single-molecule, real-time, no amplification, long reads (often 10-100 kb, can exceed 1 Mb). Sanger is first generation (single-read, dye-terminator, ~800 bp); Illumina is second generation (massively parallel short reads, ~150-300 bp, requires PCR amplification).
Input prep:
- Extract genomic DNA from a blood draw (column-based or magnetic-bead based extraction).
- Skip fragmentation for long reads – the whole point is to preserve length.
- Adapter ligation: ligate Nanopore’s sequencing adapters (which include a motor protein that controls translocation speed) to the ends of the genomic DNA.
- Optional: enrich for the RHO locus by Cas9 cleavage + adapter ligation (targeted long-read sequencing) so you don’t waste reads on the rest of the genome.
- Load onto the MinION/PromethION flow cell.
Base calling: DNA is pulled through a protein nanopore in a membrane by an applied voltage. As each base passes through the pore constriction, it modulates the ionic current uniquely (A vs T vs G vs C give different current signatures, and modified bases like 5mC give yet different signatures). A neural network (Guppy / Dorado) reads the current trace and translates it to a base sequence. This is called “basecalling” and is essentially audio-to-text transcription, where the audio is the ionic current.
Output: FASTQ files (sequence + per-base quality scores), plus optionally direct methylation calls (5mC, 6mA) without bisulfite conversion. For my use case, I align the reads to the human reference, call variants in the RHO locus, and report the patient’s genotype.

5.2 DNA Write

(i) What DNA would I want to synthesize and why?

I would synthesize a codon-engineered RHO variant carrying silent mutations across the entire coding region, designed to be invisible to a shRNA / siRNA that targets the wild-type sequence. This is the classic “knockdown and replace” strategy for adRP:

The dominant-negative mutant RHO allele in the patient is silenced by an shRNA that targets a region of the natural mRNA.
A “hardened” replacement RHO – silently re-coded so the shRNA can no longer bind – is co-delivered.
Net result: both alleles of native RHO are silenced, and the hardened replacement provides wild-type protein.

The replacement allele needs ~30+ synonymous changes across the shRNA target site, which is exactly the kind of thing Twist’s gene synthesis is good at. The replacement is ~1 kb, well within clonal gene size.

Beyond therapeutics, I would also love to synthesize opsin variants with shifted spectral sensitivity – e.g., a red-shifted human rhodopsin built by transplanting microbial opsin tuning residues, for optogenetic use in deep-tissue stimulation.

(ii) Technology – which synthesis platform?

I would use silicon-based microarray DNA synthesis (Twist’s platform) – this is the standard for gene-length synthesis at high throughput and low error.

Essential steps:
- Phosphoramidite chemistry: starting from a solid support, add one nucleotide at a time using a 4-step cycle (deblock -> couple -> cap -> oxidize), repeated for each base.
- On a silicon chip, thousands to millions of distinct oligos are synthesized in parallel, each in a tiny well, with the order of bases controlled by where reagents are spotted.
- Oligos (typically ~200 nt each) are then cleaved off the chip, pooled by gene, and assembled into full genes via PCR-based methods (Gibson, Golden Gate, or polymerase cycling assembly).
- Error correction by enzymatic mismatch cleavage (e.g., T7 endonuclease I) or by sequencing-and-cherry-picking error-free clones.
Limitations:
- Speed: ~4-7 business days for express clonal genes at Twist; ~10 days standard. Faster than the old column-based oligo + PCR workflow, but still not “instant.”
- Accuracy: ~1 error per 5,000-10,000 bp after error correction. For genes >5 kb, error rate compounds and yield drops; this is why genome-scale synthesis is hard.
- Length: Clonal genes up to ~5 kb routinely; longer constructs require hierarchical assembly. Gene fragments are typically 300 bp - 5 kb.
- Sequence constraints: very high or very low GC, long homopolymers, large repeats, and strong hairpins remain difficult or impossible to synthesize. This is why the codon optimizer flags and edits these.

5.3 DNA Edit

(i) What DNA would I want to edit and why?

I want to edit the P23H mutation in RHO – the single most common cause of adRP in North America, where a CCC (Pro) codon at position 23 is mutated to CAC (His). It is a misfolding mutation: the mutant protein aggregates in the ER, kills the rod cell, and the rod death spreads to cones – leading to progressive blindness over decades.

This is a perfect target for base editing: the C -> T transition needed to revert the codon (CAC -> CCC requires an A->G correction on the antisense strand, which an adenine base editor can do without making a double-strand break).

(ii) Which editing technology?

I would use an adenine base editor (ABE) – specifically ABE8e fused to a SpCas9 nickase (D10A) – delivered as a single AAV5 vector (which is retinal-tropic) by subretinal injection.

How it works:
- The Cas9 nickase is guided to the RHO locus by a single-guide RNA (sgRNA) complementary to a sequence next to the mutated codon, with a PAM (5’-NGG-3’) ~3-15 nt downstream.
- Cas9n binds without cutting both strands; instead it exposes the non-target strand as ssDNA in the “R-loop.”
- The tethered TadA deaminase domain (ABE8e) deaminates a target A on the exposed strand to inosine (I), which DNA polymerase reads as G. So an A:T base pair becomes a G:C base pair.
- The Cas9n nicks the unedited strand to bias mismatch repair toward keeping the edit.
Inputs and preparation:
- Guide RNA design: identify a PAM within ~13-17 nt of the target A, design a 20-nt sgRNA so that the target A falls in the editing window (positions 4-8 from the PAM-distal end). For P23H, find a PAM in the surrounding sequence and design the sgRNA in silico (CRISPOR, Benchling CRISPR tool). Check for off-target sites with mismatch tolerance.
- Editor construct: ABE8e-Cas9n with a tissue-specific promoter (rhodopsin promoter itself, for rod-cell specificity).
- Delivery vector: AAV5 packaged with the editor + sgRNA. Two-vector dual-AAV split-intein systems are needed if the editor is too big for one AAV (~4.7 kb cargo limit).
- Cells/tissue: delivered in vivo to retina by subretinal injection.
Limitations:
- PAM dependence: SpCas9 requires NGG nearby. Not every mutation has one in range. Engineered PAM-flexible Cas9 variants (SpRY, etc.) help but reduce specificity.
- Editing window: ABEs can only flip A in a narrow window. Bystander edits (other As in the window) can introduce silent or unwanted changes – need to check.
- Off-targets: even with high-fidelity Cas9 variants, low-level off-target editing happens at sites with similar sequence. Whole-genome sequencing post-treatment is the gold standard for measuring this.
- Delivery efficiency: AAV reaches only ~5-30% of photoreceptors at typical doses. So even with 100% editing efficiency in transduced cells, you don’t fix every rod.
- One mutation per edit: base editing reverts only the specific A:T -> G:C (or C:G -> T:A for CBE). For different RHO mutations, you need different editors / guides. Prime editing is more flexible but less efficient.