<Abhinav Rajendran> — HTGAA Spring 2026

cover image cover image

About me

Hi guys, I’m Abhi. I’m based in London, joining via the LifeFabs node. I currently spend most of my time computationally designing proteins — building AI tools for protein engineering. I’m taking HTGAA to get hands-on wet lab experience and start translating those computational designs into real biology.

Contact info

X.com

Homework

Labs

Projects

Subsections of <Abhinav Rajendran> — HTGAA Spring 2026

Homework

Weekly homework submissions:

  • Week 1 HW: Principles and Practices

    Application: An AI Agent for Protein and Molecular Design I’m developing an AI agent for protein and molecular design - an autonomous system that can take a high-level design brief (e.g. “design a protein that binds target X with nanomolar affinity”) and execute the full computational design pipeline: searching structure databases, running generative models, evaluating candidates, iterating on designs, and preparing sequences for synthesis. Unlike standalone models, an agent orchestrates multiple tools and makes decisions across the design cycle with minimal human intervention.

  • Week 2 HW: DNA Read, Write, and Edit

    Part 1: Benchling Gel Art Virtual restriction digest of Lambda DNA (J02459) using EcoRI, HindIII, BamHI, PstI, SalI, and XhoI, visualized in NEBcutter. Part 3: DNA Design Challenge 3.1 Choose Your Protein I chose endoglucanase A (CelCCA) from the bacterium Clostridium cellulolyticum. This organism has since been reclassified as Ruminiclostridium cellulolyticum. The crystal structure of its catalytic domain is in the PDB as entry 1EDG. It was solved at 1.6 angstrom resolution.

  • Week 3 HW: Lab Automation

    Part 1: Opentrons Art I designed my artwork using the Automation Art GUI at opentrons-art.rcdonovan.com. I uploaded a bat image and the tool pixelated it into dispensing coordinates for three fluorescent proteins: mClover3 (green, 41 points), mRFP1 (red, 393 points), and Azurite (blue, 40 points). The design uses 0.75 µL droplet sizes at 2.2 mm spacing.

Subsections of Homework

Week 1 HW: Principles and Practices

Application: An AI Agent for Protein and Molecular Design

I’m developing an AI agent for protein and molecular design - an autonomous system that can take a high-level design brief (e.g. “design a protein that binds target X with nanomolar affinity”) and execute the full computational design pipeline: searching structure databases, running generative models, evaluating candidates, iterating on designs, and preparing sequences for synthesis. Unlike standalone models, an agent orchestrates multiple tools and makes decisions across the design cycle with minimal human intervention.

The promise is enormous: compressing weeks of expert computational work into hours, democratising access to protein engineering capabilities, and enabling rapid iteration on drug candidates, industrial enzymes, and biosensors. But agency amplifies dual-use risk. A standalone generative model requires a knowledgeable user to interpret and act on outputs. An agent that autonomously navigates the full design-to-synthesis pipeline lowers the expertise barrier dramatically. In 2022, Urbina et al. demonstrated a related concern — they inverted a drug discovery model’s objective function and generated ~40,000 molecules predicted to be more toxic than VX nerve agent, in under 6 hours. An agentic system could, in principle, not only generate such candidates but evaluate, optimise, and prepare them for ordering — all without the user needing deep domain knowledge.

Policy Goals

Primary Goal: Prevent misuse of generative biological AI while preserving its benefits

Sub-goals:

  1. Biosecurity — Prevent AI-designed biological agents (proteins, toxins, pathogens) from being created or used to cause harm
  2. Maintaining open science — Avoid governance structures so restrictive that they get in the way of legitimate research and fair access to these tools
  3. Accountability — Ensure clear responsibility chains so that when things go wrong, there are mechanisms for tracking where things went wrong

Governance Actions

Action 1: Technical Screening Layer — Automated Hazard Flagging on Agent Outputs

Purpose: Currently, most generative bio-AI systems have no built-in safety filters, including emerging agentic pipelines. A user can instruct an agent to design any sequence or molecule without any check on whether the output is potentially dangerous. Many foundation model providers have some guardrails in place but these mostly police intent rather than dengerous molecules. This is worse with agents than standalone models because the agent may autonomously evaluate, refine, and prepare dangerous designs for synthesis without a human reviewing intermediate steps. I’m proposing a technical screening layer, analogous to content moderation in LLMs, that automatically flags outputs with high predicted toxicity, homology to known threat agents (select agents, toxins), or dual-use concern at multiple checkpoints in the agent’s pipeline.

Design: This requires:

  • A curated database of known threat sequences and molecular scaffolds, drawing from select agent lists and known toxin families
  • Lightweight classifier models trained to flag outputs above a risk threshold
  • Integration at the API level, so screening happens before results are returned
  • Model developers (companies like the one I work at, plus academic labs releasing open models) would need to implement this. Funding could come from existing biosecurity programmes such as UK AISI and US BARDA

Assumptions:

  • That dangerous outputs are detectable computationally. This is partially true (homology to known agents is searchable) but novel threats with no known analogues would slip through
  • That model developers will adopt this voluntarily or can be incentivised to do so
  • That the databases of known threats are comprehensive and kept current

Risks of Failure & “Success”:

  • Failure: Screening is trivially bypassed, for example by users running open-source models locally without the filter. Creates a false sense of security
  • “Success”: Over-sensitive filters block legitimate research. Researchers designing novel antimicrobials might constantly trigger toxicity flags. Could push users toward unfiltered open-source alternatives, defeating the point of the policy

Action 2: Industry API-Gated Access with Tiered Permissions

Purpose: Currently, access to powerful generative bio-AI is relatively open, including agentic systems that can autonomously execute multi-step design pipelines. Many underlying models are available as downloadable weights or through APIs with minimal identity verification. An agent that chains these models together amplifies risk because it reduces the expertise needed to go from intent to synthesis-ready design. I’m proposing a tiered access system where the level of capability scales with the user’s credentials and intended use:

  • Tier 1 (Open): In silico exploration. Anyone can query models for general protein properties, structure prediction, and basic design
  • Tier 2 (Verified): Full generative capability. Requires institutional affiliation, identity verification, and a stated research purpose
  • Tier 3 (Screened): Synthesis-coupled design. When a user wants to order synthetic DNA or protein based on AI-generated designs, synthesis providers (Twist, IDT, etc.) run additional biosecurity screening on the sequences

Design: This requires:

  • Identity verification infrastructure, which could piggyback on existing systems like ORCID for academics or institutional credentials
  • Coordination between AI model providers and DNA synthesis companies. The International Gene Synthesis Consortium (IGSC) already screens orders, but integration with upstream AI tools is new
  • Industry buy-in from model providers to gate their APIs. Companies like Anthropic have shown this is viable for language models (Claude was initially waitlisted)

Assumptions:

  • That tiering is enforceable. If model weights are open-source, gating the API is moot
  • That institutional affiliation is a reasonable proxy for trustworthiness. It’s not perfect, as state-sponsored actors have institutional credentials
  • That synthesis providers are the right chokepoint. This only works if physical synthesis remains the bottleneck, which may not hold as benchtop synthesis becomes easier

Risks of Failure & “Success”:

  • Failure: Determined bad actors route around the system entirely. Tiering only inconveniences legitimate researchers
  • “Success”: Creates a two-tier research ecosystem where well-resourced institutions have full access and smaller labs or Global South researchers are locked out, exacerbating existing inequities in biotech

Action 3: Regulatory Mandatory Dual-Use Review for Generative Bio-AI Publications and Releases

Purpose: Currently, there is no systematic requirement to assess dual-use risk before publishing generative bio-AI models, agentic systems, or their underlying datasets. The Urbina paper was itself a demonstration of how easily a published model could be repurposed, and agentic systems that chain multiple models into autonomous pipelines compound this risk by making misuse more accessible. I’m proposing mandatory dual-use risk assessments, similar to Institutional Biosafety Committee (IBC) review for wet lab work, before any generative bio-AI model, agent framework, training dataset, or capability benchmark is publicly released.

Design: This requires:

  • Expanding the remit of existing biosafety/biosecurity review bodies (such as IBCs or the UK’s ACDP) to cover computational tools, not just physical experiments
  • Developing standardised dual-use risk assessment frameworks specific to AI-bio. The existing frameworks are designed for gain-of-function wet lab work and don’t map cleanly
  • Journals and preprint servers (Nature, bioRxiv) could require evidence of dual-use review as a condition of publication, similar to ethics approval for human subjects research
  • Government funding agencies (UKRI, NIH, DARPA) could mandate dual-use review as a grant condition

Assumptions:

  • That review bodies have the technical expertise to evaluate AI model capabilities. Currently most IBCs do not
  • That pre-publication review is fast enough not to fatally slow down a fast-moving field
  • That the definition of “dual-use” can be operationalised clearly enough for consistent review decisions

Risks of Failure & “Success”:

  • Failure: Review becomes a rubber stamp. Committees lack expertise, approve everything, and the process adds bureaucratic overhead without improving safety
  • “Success”: Slows the pace of open publication enough that research moves to private industry where there’s less oversight. Creates a perverse incentive to not publish, reducing the transparency that currently helps the security community track developments

Scoring

Does the option:Option 1 (Screening)Option 2 (Tiered API)Option 3 (Dual-Use Review)
Enhance Biosecurity
• By preventing incidents212
• By helping respond221
Foster Lab Safety
• By preventing incidentn/an/a2
• By helping respondn/an/an/a
Protect the environment
• By preventing incidents222
• By helping respond332
Other considerations
• Minimizing costs and burdens to stakeholders123
• Feasibility?123
• Not impede research232
• Promote constructive applications122

(1 = best, 3 = worst, n/a = not applicable)

Recommendation

I would recommend prioritising a combination of Actions 1 and 2, technical screening integrated with tiered API access, addressed to an organization like the AI Safety Institute who are in my opinion world leading!

Action 1 (automated screening) scores highest on feasibility and cost because it’s a technical solution that model developers can implement without legislative change. It’s the lowest-friction intervention. However, it’s insufficient alone because it’s bypassable with open-source models.

Action 2 (tiered access) addresses that gap by creating identity-linked accountability, and by integrating with the existing DNA synthesis screening infrastructure (IGSC). Together, these two actions create defence in depth: screening catches inadvertent misuse, and tiered access raises the bar for deliberate misuse.

Action 3 (mandatory dual-use review) scores well on response capability — a paper trail of risk assessments is valuable after an incident — but is the hardest to implement. The expertise gap in review bodies is real, and the risk of pushing research into less transparent private settings is significant. I’d recommend this as a medium-term goal, starting with voluntary frameworks that build capacity before mandating compliance.

Key trade-off: All three actions risk disadvantaging smaller labs and researchers who lack institutional infrastructure. Any implementation should include capacity-building provisions — for example, free verified access tiers for researchers from lower-income institutions.

Key uncertainty: The biggest unknown is how long DNA/protein synthesis remains the effective bottleneck, and also whether it even can be considered a bottlneck in 2026. If benchtop synthesis becomes cheap and accessible, Actions 1 and 2 lose much of their enforcement power, and the governance challenge shifts fundamentally toward the wet lab.

Week 1 Ethical Reflection

Halfpipe of doom was interesting, the observation that powerful technologies simultaneously promise to save and destroy the world. This isn’t new. Nuclear physics gave us both energy and bombs. Every transformative technology has this yin and yang.

This is definitely gogin to accelerate in biology right now, I think we are at the pivot point. The tools we’re learning in this course — DNA synthesis, CRISPR, protein design, autonomous AI agents that chain these together — are the biological equivalent of splitting the atom. The constructive applications are huge, but so is the potential for misuse. And unlike nuclear technology, where the materials and infrastructure required act as natural barriers, the barriers in biology are collapsing so fast. AI compresses the knowledge barrier, synthesis costs keep dropping, and the biological “materials” literally self-replicate.

This reinforces why governance can’t be an afterthought bolted on after the technology matures. It needs to be designed in parallel!

Week 2 Lecture Prep

Jacobson Questions

Q1: What is the error rate of polymerase? How does this compare to the length of the human genome? How does biology deal with that discrepancy?

Polymerase with its built-in proofreading has an error rate of about 1 in 10⁶. The human genome is ~3.2 billion bp, so each replication would introduce ~3,200 errors. Biology fixes this with post-replication mismatch repair systems like MutS, which bring the effective error rate down to roughly 1 in 10⁹–10¹⁰.

Q2: How many different ways are there to code for an average human protein? Why don’t all of these work in practice?

An average human protein is ~345 amino acids. Most amino acids have ~3 synonymous codons, giving roughly 3³45 possible DNA sequences for the same protein. In practice most won’t work because of codon usage bias (organisms prefer codons matched to their tRNA abundance), mRNA secondary structure affecting translation, and RNA cleavage rules.

LeProust Questions

Q1: What’s the most commonly used method for oligo synthesis currently?

Phosphoramidite chemistry, developed by Caruthers in 1981. A four-step cycle (coupling, capping, oxidation, deblocking) repeated for each base. Used in both traditional column synthesisers and modern chip-based platforms like Twist’s silicon platform.

Q2: Why is it difficult to make oligos longer than 200nt via direct synthesis?

Coupling efficiency compounds over length. Even at ~99% per step, (0.99)²⁰⁰ ≈ 13% full-length product. Longer oligos are dominated by truncations and errors.

Q3: Why can’t you make a 2000bp gene via direct oligo synthesis?

At 2000 cycles the full-length yield is essentially zero, and with ~1:200 per-base error rate you’d average ~10 errors per molecule. Instead, genes are built by assembling shorter overlapping oligos (60–200nt) using methods like Gibson assembly, then error-corrected.

Church Question

Q1: What are the 10 essential amino acids in all animals, and how does this affect your view of the “Lysine Contingency”?

The 10 essential amino acids are: histidine, isoleucine, leucine, lysine, methionine, phenylalanine, threonine, tryptophan, valine, and arginine. Animals cannot synthesise these and must get them from diet.

The “Lysine Contingency” in Jurassic Park was a biocontainment strategy where dinosaurs were engineered to not synthesise lysine. The problem is that lysine is already essential in all animals. Plus lysine is abundant in normal food sources, making it useless as containment.

Week 2 HW: DNA Read, Write, and Edit

Part 1: Benchling Gel Art

image.png image.png

Virtual restriction digest of Lambda DNA (J02459) using EcoRI, HindIII, BamHI, PstI, SalI, and XhoI, visualized in NEBcutter.

Part 3: DNA Design Challenge

3.1 Choose Your Protein

I chose endoglucanase A (CelCCA) from the bacterium Clostridium cellulolyticum. This organism has since been reclassified as Ruminiclostridium cellulolyticum. The crystal structure of its catalytic domain is in the PDB as entry 1EDG. It was solved at 1.6 angstrom resolution.

Why this protein? Cellulases break down cellulose, the most abundant organic polymer on Earth. They are important for biomass conversion, biofuel production, and the paper and textile industries. CelCCA belongs to glycosyl hydrolase family 5. It folds into a classic (alpha/beta)8 TIM barrel, one of the most common enzyme folds in nature. The protein has clear industrial relevance. It also has a high-resolution structure that will be useful for computational analysis in later weeks.

The protein sequence comes from the RCSB PDB (entry 1EDG, Chain A, UniProt: P17901). The catalytic domain is 380 amino acids long:

>pdb|1EDG|A Endoglucanase A catalytic domain, Ruminiclostridium cellulolyticum H10
MYDASLIPNLQIPQKNIPNNDGMNFVKGLRLGWNLGNTFDAFNGTNITNELDYETSWSG
IKTTKQMIDAIKQKGFNTVRIPVSWHPHVSGSDYKISDVWMNRVQEVVNYCIDNKMYVIL
NTHHDVDKVKGYFPSSQYMASSKKYITSVWAQIAARFANYDEHLIFEGMNEPRLVGHANE
WWPELTNSDVVDSINCINQLNQDFVNTVRATGGKNASRYLMCPGYVASPDGATNDYFRMP
NDISGNNNKIIVSVHAYCPWNFAGLAMADGGTNAWNINDSKDQSEVTWFMDNIYNKYTSR
GIPVIIGECGAVDKNNLKTRVEYMSYYVAQAKARGILCILWDNNNFSGTGELFGFFDRRS
CQFKFPEIIDGMVKYAFGLIN

3.2 Reverse Translate: Protein to DNA

I converted the 380-amino-acid sequence back into DNA. The genetic code is degenerate, meaning most amino acids can be encoded by multiple codons. There is no single “correct” reverse translation. I used the most common E. coli codons for each amino acid.

The resulting coding sequence is 1,140 bp (380 aa x 3 nt/aa). Adding a TAA stop codon gives 1,143 bp total:

ATGTATGATGCGAGCCTGATTCCGAACCTGCAGATTCCGCAGAAAAACATTCCGAACAAC
GATGGTATGAACTTCGTGAAAGGTCTGCGTCTGGGTTGGAACCTGGGTAACACCTTCGAT
GCGTTCAACGGTACCAACATTACCAACGAACTGGATTATGAAACCAGCTGGAGCGGTATT
AAAACCACCAAACAGATGATTGATGCGATTAAACAGAAAGGTTTCAACACCGTGCGTATT
CCGGTGAGCTGGCATCCGCATGTGAGCGGTAGCGATTATAAAATTAGCGATGTGTGGATG
AACCGTGTGCAGGAAGTGGTGAACTATTGCATTGATAACAAAATGTATGTGATTCTGAAC
ACCCATCATGATGTGGATAAAGTGAAAGGTTATTTTCCGAGCAGCCAGTATATGGCGAGC
AGCAAAAAATATATTACCAGCGTGTGGGCGCAGATTGCGGCGCGTTTCGCGAACTATGAT
GAACATCTGATTTTCGAAGGTATGAACGAACCGCGTCTGGTGGGTCATGCGAACGAATGG
TGGCCGGAACTGACCAACAGCGATGTGGTGGATAGCATTAACTGCATTAACCAGCTGAAC
CAGGATTTCGTGAACACCGTGCGTGCGACCGGTGGTAAAAACGCGAGCCGTTATCTGATG
TGCCCGGGTTATGTGGCGAGCCCGGATGGTGCGACCAACGATTATTTCCGTATGCCGAAC
GATATTAGCGGTAACAACAACAAAATTATTGTGAGCGTGCATGCGTATTGCCCGTGGAAC
TTCGCGGGTCTGGCGATGGCGGATGGTGGTACCAACGCGTGGAACATTAACGATAGCAAA
GATCAGAGCGAAGTGACCTGGTTCATGGATAACATTTATAAACAAATATACCAGCCGTGGT
ATTCCGGTGATTATTGGTGAATGCGGTGCGGTGGATAAAAAACAACCTGAAAACCCGTGTG
GAATATATGAGCTATTATGTGGCGCAGGCGAAAGCGCGTGGTATTCTGTGCATTCTGTGG
GATAACAACAACTTCAGCGGTACCGGTGAACTGTTCGGTTCTTCGATCGTCGTAGCTGC
CAGTCAAATTCCCGGAAATTATTGATGGTATGGTGAAATATGCGTTCGGTCTGATTAAC
TAA

3.3 Codon Optimization

Why codon optimization is needed: Different organisms prefer different codons for the same amino acid. A codon that is common in C. cellulolyticum might be rare in E. coli. Rare codons cause the ribosome to stall because the matching tRNA is scarce. This slows translation and reduces protein yield. Codon optimization swaps rare codons for ones the host uses frequently. This keeps the ribosome moving and increases output.

Other factors also matter. Stable mRNA hairpins near the start codon can block ribosome binding. Extreme GC content reduces expression. Certain restriction enzyme sites need to be removed for cloning compatibility.

Organism chosen: I optimized for Escherichia coli K-12. E. coli is the standard host for recombinant protein production. It grows fast, is cheap to culture, and has well-characterized genetics. CelCCA has been successfully expressed in E. coli before (Fierobe et al., 1991), so this is a validated choice.

The optimized sequence uses the highest-frequency E. coli codon for each amino acid. GC content is 47.4%, which falls in the ideal range of 40-60% for E. coli. The sequence does not contain recognition sites for common Type IIs restriction enzymes like BsaI, BsmBI, or BbsI. This keeps the sequence compatible with Golden Gate cloning.

3.4 You Have a Sequence! Now What?

I would produce CelCCA using a cell-based expression system in E. coli. Here is the process:

Step 1: Build an expression construct. The codon-optimized gene goes into an expression cassette. The cassette has a promoter, a ribosome binding site (RBS), a start codon, the coding sequence, a His-tag, a stop codon, and a terminator. This cassette is cloned into a plasmid vector with an antibiotic resistance gene and an origin of replication.

Step 2: Transform into E. coli. The plasmid is introduced into competent E. coli cells through heat shock or electroporation. Cells that take up the plasmid survive on antibiotic selection plates.

Step 3: Transcription. RNA polymerase binds to the promoter. It reads the template DNA strand from 3’ to 5’ and builds a complementary mRNA strand from 5’ to 3’. Thymine (T) in DNA becomes uracil (U) in the mRNA. With a constitutive promoter like BBa_J23106, transcription runs continuously. No inducer is needed.

Step 4: Translation. The ribosome binds the RBS on the mRNA and starts at the AUG start codon. It reads the mRNA in triplets (codons). Each codon is matched by a tRNA carrying the right anticodon and amino acid. The ribosome adds each amino acid to the growing chain. Translation stops when the ribosome reaches the UAA stop codon. The finished protein is then released.

Step 5: Protein folding and purification. The polypeptide folds into its functional 3D structure (the TIM barrel). The C-terminal His-tag enables purification by immobilized metal affinity chromatography (IMAC). Ni-NTA resin binds the histidine residues. The protein is eluted with imidazole.

Alternative: Cell-free expression. The same construct could be used in a cell-free TX-TL system. These systems use cell extracts containing ribosomes, tRNAs, RNA polymerase, and energy sources. Cell-free expression is faster (hours instead of days) and works for toxic proteins. However, yields are lower and costs are higher at scale.

Part 4: Twist Order

image.png image.pngimage.png image.pngimage.png image.png

For the Benchling and Twist exercise, here are the components of my expression cassette:

ComponentPartLength
PromoterBBa_J2310635 bp
RBSBBa_B0034 (with spacers)22 bp
Start CodonATG3 bp
Coding SequenceCelCCA catalytic domain (codon-optimized)1,137 bp
His Tag7x His21 bp
Stop CodonTAA3 bp
TerminatorBBa_B0015129 bp
Total insert~1,350 bp

I selected pTwist Amp High Copy as the vector. It carries ampicillin resistance and a high-copy-number origin (pUC ori) for good plasmid yields.

Part 5: DNA Read/Write/Edit

5.1 DNA Read

(i) What DNA would you want to sequence and why?

I would sequence environmental metagenomes from extreme environments. Good sources include the rumen of herbivorous animals and hot springs where cellulose is actively broken down. These places harbor thousands of uncultured microorganisms that make novel cellulases. Most of these organisms cannot be grown in the lab, so their enzymes remain hidden. Sequencing the total DNA from these environments reveals new enzyme variants with useful properties. These include higher thermostability, different pH optima, and better activity on crystalline cellulose.

This connects to my protein of interest. CelCCA works under moderate conditions. But industrial biomass conversion often needs enzymes that tolerate 60-80 C and low pH. Metagenomic sequencing is the fastest way to find such enzymes without culturing each organism one by one.

(ii) What sequencing technology would you use?

I would use two platforms together: Illumina short-read sequencing for accuracy and Oxford Nanopore long-read sequencing for assembly.

Illumina (second-generation sequencing):

Illumina uses sequencing-by-synthesis. The input is fragmented DNA, typically 300-600 bp fragments for metagenomic work. Library preparation adds sequencing adapters to both ends and amplifies the fragments by PCR.

The essential steps are:

  1. Adapter-ligated fragments bind to complementary oligos on a flow cell surface.
  2. Each fragment undergoes bridge amplification. This creates a cluster of about 1,000 identical copies.
  3. Fluorescent nucleotides with reversible terminators are washed over the flow cell. One nucleotide is added per strand per cycle. A camera records which color lights up at each cluster. Then the terminator is removed so the next cycle can proceed.
  4. This repeats for 150-300 cycles. The result is paired-end reads of 150-300 bp from each end of each fragment.

The output is millions to billions of short reads in FASTQ format. Each read has per-base quality scores. Error rates are about 0.1%. This is great for detecting variants and measuring abundance. The main weakness is read length. Short reads make it hard to assemble full genes from complex samples where many organisms share similar sequences.

Oxford Nanopore (third-generation sequencing):

Nanopore sequences single DNA molecules. It passes each strand through a protein pore in an electrically resistant membrane. Each base disrupts the ionic current in a specific way. A neural network translates these current patterns into nucleotide sequences.

Input preparation is simple. Adapters are ligated to native DNA. No PCR is needed. This is a big advantage because it avoids GC-bias and preserves DNA methylation patterns.

Reads can be very long. Typical reads are 10 kb, and records reach 4 Mb. A whole cellulase gene cluster (often 10-30 kb) can fit in one read. This removes assembly ambiguity. The downside is accuracy. Raw single-read accuracy is about 92-98%. Recent chemistry improvements (R10.4 pores) and better base callers now push consensus accuracy above 99%.

The output is FASTQ or FAST5 files with base calls and raw signal data.

Why combine them? Nanopore reads provide scaffolding for complete gene cluster assembly. Illumina reads then “polish” the assembly to fix remaining errors. This hybrid approach gives both the length of long reads and the accuracy of short reads.

5.2 DNA Write

(i) What DNA would you want to synthesize and why?

I would synthesize a library of CelCCA variants with designed mutations. I would target residues near the catalytic glutamates (Glu170 and Glu307) and the aromatic residues lining the substrate-binding cleft. The goal is to engineer variants with better thermostability and broader substrate range for industrial use.

Rather than making one gene at a time, I would design 50-100 variants as a gene fragment library. Each variant would carry 3-10 mutations predicted by computational tools (covered in HTGAA Week 4). Each sequence would be about 1,140 bp, codon-optimized and flanked by standard assembly overlaps.

(ii) What synthesis technology would you use?

I would use chip-based oligonucleotide synthesis followed by enzymatic assembly. Twist Bioscience is one company that offers this service.

Essential steps:

  1. Oligo synthesis: Short oligos (150-300 nt) are made in parallel on a silicon chip using phosphoramidite chemistry. Each cycle adds one nucleotide in four steps. First, the DMT protecting group is removed from the 5’-OH. Second, the next phosphoramidite monomer is coupled. Third, unreacted chains are capped to prevent deletions. Fourth, the new bond is oxidized for stability. This cycle repeats once per base.

  2. Assembly: Overlapping oligos are combined and joined by overlap-extension PCR or Gibson Assembly. Adjacent oligos share complementary overlaps. They anneal together and polymerase extends them to build the full-length gene.

  3. Error correction: Each coupling step is about 99.0-99.5% efficient. Errors accumulate in longer oligos. Enzymatic mismatch cleavage or sequencing-based selection removes bad sequences. The final gene is cloned and verified by sequencing.

Limitations:

Length is the main constraint. Individual oligos work up to about 200-300 nt. Full genes up to about 5 kb can be assembled, but cost rises with length. My 1,140 bp CelCCA gene is well within the standard range.

Accuracy is also a factor. Synthesis error rates are about 1 in 300 bases before correction. After correction and clonal selection, you get essentially perfect sequences. This verification adds time and cost.

Cost has dropped a lot. Twist currently charges about $0.07 per bp for standard genes. My 1,140 bp gene costs about $80 per variant. A library of 100 variants would run about $8,000.

Turnaround is typically 2-3 weeks for clonal genes.

5.3 DNA Edit

(i) What DNA would you want to edit and why?

I would edit the genome of Clostridium cellulolyticum to improve its biomass conversion ability. The wild-type strain already makes a cellulosome, a large multi-enzyme complex on its cell surface that degrades cellulose. But its productivity is limited. I would make three types of edits:

  1. Knock out carbon catabolite repression genes. This lets the organism use cellulose and other sugars at the same time instead of preferring one carbon source. It would speed up overall biomass conversion.

  2. Insert a metabolic pathway for ethanol or butanol production. This would turn C. cellulolyticum into a consolidated bioprocessing organism. It could both break down cellulose and ferment the sugars in one step. Current industrial processes need separate organisms for each step.

  3. Modify the cellulosome scaffolding protein (CipC). Adding slots for more enzyme types would let the organism degrade a wider range of plant polymers.

These edits would advance next-generation biofuel production. Going directly from raw plant waste to fuel in one organism would cut the cost of cellulosic biofuels significantly.

(ii) What editing technology would you use?

I would use CRISPR-Cas9 with homology-directed repair (HDR).

How CRISPR-Cas9 works:

  1. Design a guide RNA (sgRNA). The first 20 nucleotides are the spacer. They match the target DNA site. The rest of the RNA forms a scaffold that binds Cas9. The target must sit next to a PAM sequence. For SpCas9, the PAM is NGG (any nucleotide then two guanines).

  2. Prepare the components. The sgRNA and Cas9 protein (or a plasmid encoding both) are delivered into the cell. For C. cellulolyticum, delivery would use electroporation or conjugation. A homology donor template is also provided. This template carries the desired edit flanked by 500-1000 bp homology arms matching the regions around the cut site.

  3. Cas9 cuts the DNA. The Cas9-sgRNA complex scans the genome for matching sequences next to a PAM. When it finds a match, it unwinds the DNA and checks for complementarity. If the match is good, Cas9 makes a double-strand break 3 bp upstream of the PAM.

  4. Repair introduces the edit. The cell repairs the break using one of two pathways. Non-homologous end joining (NHEJ) is error-prone. It creates random insertions or deletions, useful for knockouts. Homology-directed repair (HDR) uses the donor template as a blueprint. This allows precise insertions, replacements, or corrections.

Inputs needed: Cas9 protein or plasmid, the sgRNA (designed computationally and synthesized), the homology donor template, and competent cells. Guide design uses tools like Benchling or CRISPOR. These tools pick sites with high on-target activity and low off-target risk.

Limitations:

Efficiency varies by organism. CRISPR works well in E. coli and many model organisms. Clostridia are harder to edit. They have low transformation efficiency and restriction-modification systems that destroy foreign DNA. Genetic tools are also limited. Recent work on Clostridial CRISPR systems (using Cas9 or Cas12a on shuttle vectors) has improved results. But editing efficiency is still around 10-50% per target, compared to 50-90% in model organisms.

Off-target cutting is another concern. The Cas9-sgRNA complex can tolerate a few mismatches. It might cut at unintended sites. This is managed by careful guide design, high-fidelity Cas9 variants (like eSpCas9 or HiFi Cas9), and whole-genome sequencing of edited clones.

For the multiplex edits I described, I would do multiple rounds of editing in sequence. Each round targets one edit and selects for success before moving on. An alternative for E. coli would be MAGE (Multiplex Automated Genome Engineering), which makes many edits at once. But MAGE is not established in Clostridia yet, so sequential CRISPR is the practical approach.

Week 3 HW: Lab Automation

Part 1: Opentrons Art

I designed my artwork using the Automation Art GUI at opentrons-art.rcdonovan.com. I uploaded a bat image and the tool pixelated it into dispensing coordinates for three fluorescent proteins: mClover3 (green, 41 points), mRFP1 (red, 393 points), and Azurite (blue, 40 points). The design uses 0.75 µL droplet sizes at 2.2 mm spacing.

Design preview from the GUI:

Screenshot 2026-02-27 at 18.02.52.png Screenshot 2026-02-27 at 18.02.52.png

The GUI generated coordinate lists for each color, which were exported as a complete Python script for the Opentrons OT-2 using the 96 Deep-Well Plate download option. The script uses the Opentrons Python API (v2.20) with a P20 single-channel pipette. For each color, the robot picks up a tip, aspirates fluorescent protein from a deep-well source plate, and dispenses 0.75 µL at each coordinate relative to the center of an agar plate. It automatically refills when the pipette runs dry. Tips are changed between colors to avoid cross-contamination.

Full Opentrons Python Script (click to expand)
from opentrons import types
import string

metadata = {
    'protocolName': 'Abhinav Rajendran - Opentrons Art - HTGAA',
    'author': 'HTGAA',
    'source': 'HTGAA 2026',
    'apiLevel': '2.20'
}

Z_VALUE_AGAR = 2.0
POINT_SIZE = 0.75

mclover3_points = [(-7.7,34.1), (7.7,34.1), (-20.9,25.3), (20.9,25.3), (-20.9,23.1), (20.9,23.1), (-25.3,20.9), (-23.1,20.9), (23.1,20.9), (25.3,20.9), (-20.9,12.1), (20.9,12.1), (-20.9,9.9), (20.9,9.9), (-34.1,7.7), (-31.9,7.7), (-20.9,7.7), (-25.3,5.5), (-23.1,5.5), (23.1,5.5), (25.3,5.5), (-25.3,3.3), (-23.1,3.3), (23.1,3.3), (25.3,3.3), (-27.5,-3.3), (-7.7,-3.3), (27.5,-3.3), (-27.5,-5.5), (-7.7,-5.5), (27.5,-5.5), (-34.1,-7.7), (-31.9,-7.7), (20.9,-7.7), (34.1,-7.7), (-27.5,-16.5), (-27.5,-18.7), (-18.7,-20.9), (-16.5,-20.9), (16.5,-20.9), (18.7,-20.9)]
mrfp1_points = [(-5.5,34.1), (-3.3,34.1), (-1.1,34.1), (1.1,34.1), (3.3,34.1), (5.5,34.1), (-14.3,31.9), (-12.1,31.9), (-9.9,31.9), (9.9,31.9), (12.1,31.9), (14.3,31.9), (-14.3,29.7), (-12.1,29.7), (-9.9,29.7), (9.9,29.7), (12.1,29.7), (14.3,29.7), (-20.9,27.5), (-18.7,27.5), (-16.5,27.5), (16.5,27.5), (18.7,27.5), (20.9,27.5), (-25.3,25.3), (-23.1,25.3), (23.1,25.3), (25.3,25.3), (-25.3,23.1), (-23.1,23.1), (23.1,23.1), (25.3,23.1), (-27.5,20.9), (27.5,20.9), (-27.5,18.7), (27.5,18.7), (-27.5,16.5), (27.5,16.5), (-29.7,14.3), (29.7,14.3), (31.9,14.3), (-29.7,12.1), (29.7,12.1), (31.9,12.1), (-29.7,9.9), (29.7,9.9), (31.9,9.9), (-25.3,7.7), (-23.1,7.7), (-18.7,7.7), (-16.5,7.7), (16.5,7.7), (18.7,7.7), (23.1,7.7), (25.3,7.7), (34.1,7.7), (-34.1,5.5), (-31.9,5.5), (-20.9,5.5), (-18.7,5.5), (-16.5,5.5), (-12.1,5.5), (-9.9,5.5), (-7.7,5.5), (-5.5,5.5), (-3.3,5.5), (-1.1,5.5), (1.1,5.5), (3.3,5.5), (5.5,5.5), (7.7,5.5), (9.9,5.5), (12.1,5.5), (16.5,5.5), (18.7,5.5), (20.9,5.5), (34.1,5.5), (-34.1,3.3), (-31.9,3.3), (-20.9,3.3), (-18.7,3.3), (-16.5,3.3), (-12.1,3.3), (-9.9,3.3), (-7.7,3.3), (-5.5,3.3), (-3.3,3.3), (-1.1,3.3), (1.1,3.3), (3.3,3.3), (5.5,3.3), (7.7,3.3), (9.9,3.3), (12.1,3.3), (16.5,3.3), (18.7,3.3), (20.9,3.3), (34.1,3.3), (-34.1,1.1), (-31.9,1.1), (-18.7,1.1), (-16.5,1.1), (-14.3,1.1), (-12.1,1.1), (-9.9,1.1), (-7.7,1.1), (-5.5,1.1), (-3.3,1.1), (-1.1,1.1), (1.1,1.1), (3.3,1.1), (5.5,1.1), (7.7,1.1), (9.9,1.1), (12.1,1.1), (14.3,1.1), (16.5,1.1), (18.7,1.1), (34.1,1.1), (-34.1,-1.1), (-31.9,-1.1), (-14.3,-1.1), (-5.5,-1.1), (-3.3,-1.1), (-1.1,-1.1), (1.1,-1.1), (3.3,-1.1), (5.5,-1.1), (14.3,-1.1), (34.1,-1.1), (-34.1,-3.3), (-31.9,-3.3), (-25.3,-3.3), (-23.1,-3.3), (-20.9,-3.3), (-18.7,-3.3), (-16.5,-3.3), (-14.3,-3.3), (-12.1,-3.3), (-9.9,-3.3), (-5.5,-3.3), (-3.3,-3.3), (-1.1,-3.3), (1.1,-3.3), (3.3,-3.3), (5.5,-3.3), (7.7,-3.3), (9.9,-3.3), (12.1,-3.3), (14.3,-3.3), (16.5,-3.3), (18.7,-3.3), (20.9,-3.3), (23.1,-3.3), (25.3,-3.3), (34.1,-3.3), (-34.1,-5.5), (-31.9,-5.5), (-25.3,-5.5), (-23.1,-5.5), (-20.9,-5.5), (-18.7,-5.5), (-16.5,-5.5), (-14.3,-5.5), (-12.1,-5.5), (-9.9,-5.5), (-5.5,-5.5), (-3.3,-5.5), (-1.1,-5.5), (1.1,-5.5), (3.3,-5.5), (5.5,-5.5), (7.7,-5.5), (9.9,-5.5), (12.1,-5.5), (14.3,-5.5), (16.5,-5.5), (18.7,-5.5), (20.9,-5.5), (23.1,-5.5), (25.3,-5.5), (34.1,-5.5), (-14.3,-7.7), (-12.1,-7.7), (-9.9,-7.7), (-7.7,-7.7), (-5.5,-7.7), (-3.3,-7.7), (-1.1,-7.7), (1.1,-7.7), (3.3,-7.7), (5.5,-7.7), (7.7,-7.7), (9.9,-7.7), (12.1,-7.7), (14.3,-7.7), (-29.7,-9.9), (-12.1,-9.9), (-9.9,-9.9), (-7.7,-9.9), (-5.5,-9.9), (-3.3,-9.9), (-1.1,-9.9), (1.1,-9.9), (3.3,-9.9), (5.5,-9.9), (7.7,-9.9), (9.9,-9.9), (12.1,-9.9), (29.7,-9.9), (31.9,-9.9), (-29.7,-12.1), (-12.1,-12.1), (-9.9,-12.1), (-7.7,-12.1), (-5.5,-12.1), (-3.3,-12.1), (-1.1,-12.1), (1.1,-12.1), (3.3,-12.1), (5.5,-12.1), (7.7,-12.1), (9.9,-12.1), (12.1,-12.1), (29.7,-12.1), (31.9,-12.1), (-29.7,-14.3), (-12.1,-14.3), (-9.9,-14.3), (-7.7,-14.3), (-5.5,-14.3), (-3.3,-14.3), (-1.1,-14.3), (1.1,-14.3), (3.3,-14.3), (5.5,-14.3), (7.7,-14.3), (9.9,-14.3), (12.1,-14.3), (27.5,-14.3), (29.7,-14.3), (31.9,-14.3), (-12.1,-16.5), (-9.9,-16.5), (-7.7,-16.5), (-5.5,-16.5), (-3.3,-16.5), (-1.1,-16.5), (1.1,-16.5), (3.3,-16.5), (5.5,-16.5), (7.7,-16.5), (9.9,-16.5), (12.1,-16.5), (23.1,-16.5), (25.3,-16.5), (27.5,-16.5), (-12.1,-18.7), (-9.9,-18.7), (-7.7,-18.7), (-5.5,-18.7), (-3.3,-18.7), (-1.1,-18.7), (1.1,-18.7), (3.3,-18.7), (5.5,-18.7), (7.7,-18.7), (9.9,-18.7), (12.1,-18.7), (23.1,-18.7), (25.3,-18.7), (27.5,-18.7), (-27.5,-20.9), (-12.1,-20.9), (-9.9,-20.9), (-7.7,-20.9), (-5.5,-20.9), (-3.3,-20.9), (-1.1,-20.9), (1.1,-20.9), (3.3,-20.9), (5.5,-20.9), (7.7,-20.9), (9.9,-20.9), (12.1,-20.9), (20.9,-20.9), (23.1,-20.9), (25.3,-20.9), (27.5,-20.9), (-25.3,-23.1), (-23.1,-23.1), (-18.7,-23.1), (-16.5,-23.1), (-14.3,-23.1), (-12.1,-23.1), (-9.9,-23.1), (-7.7,-23.1), (-5.5,-23.1), (-3.3,-23.1), (-1.1,-23.1), (1.1,-23.1), (3.3,-23.1), (5.5,-23.1), (7.7,-23.1), (9.9,-23.1), (12.1,-23.1), (14.3,-23.1), (16.5,-23.1), (18.7,-23.1), (20.9,-23.1), (23.1,-23.1), (25.3,-23.1), (-25.3,-25.3), (-23.1,-25.3), (-18.7,-25.3), (-16.5,-25.3), (-14.3,-25.3), (-12.1,-25.3), (-9.9,-25.3), (-7.7,-25.3), (-5.5,-25.3), (-3.3,-25.3), (-1.1,-25.3), (1.1,-25.3), (3.3,-25.3), (5.5,-25.3), (7.7,-25.3), (9.9,-25.3), (12.1,-25.3), (14.3,-25.3), (16.5,-25.3), (18.7,-25.3), (20.9,-25.3), (23.1,-25.3), (25.3,-25.3), (-20.9,-27.5), (-18.7,-27.5), (-16.5,-27.5), (-14.3,-27.5), (-12.1,-27.5), (-9.9,-27.5), (-7.7,-27.5), (-5.5,-27.5), (-3.3,-27.5), (-1.1,-27.5), (1.1,-27.5), (3.3,-27.5), (5.5,-27.5), (7.7,-27.5), (9.9,-27.5), (12.1,-27.5), (14.3,-27.5), (16.5,-27.5), (18.7,-27.5), (20.9,-27.5), (-14.3,-29.7), (-12.1,-29.7), (-9.9,-29.7), (-7.7,-29.7), (-5.5,-29.7), (-3.3,-29.7), (-1.1,-29.7), (1.1,-29.7), (3.3,-29.7), (5.5,-29.7), (7.7,-29.7), (9.9,-29.7), (12.1,-29.7), (14.3,-29.7), (-5.5,-31.9), (-3.3,-31.9), (-1.1,-31.9), (1.1,-31.9), (3.3,-31.9), (5.5,-31.9), (-5.5,-34.1), (-3.3,-34.1), (-1.1,-34.1), (1.1,-34.1), (3.3,-34.1), (5.5,-34.1)]
azurite_points = [(-18.7,14.3), (-16.5,14.3), (-18.7,12.1), (-16.5,12.1), (16.5,12.1), (18.7,12.1), (-18.7,9.9), (-16.5,9.9), (16.5,9.9), (18.7,9.9), (-27.5,7.7), (-1.1,7.7), (1.1,7.7), (20.9,7.7), (27.5,7.7), (-14.3,5.5), (14.3,5.5), (-14.3,3.3), (14.3,3.3), (-20.9,1.1), (20.9,1.1), (-25.3,-1.1), (-23.1,-1.1), (-18.7,-1.1), (-16.5,-1.1), (-7.7,-1.1), (7.7,-1.1), (16.5,-1.1), (18.7,-1.1), (23.1,-1.1), (25.3,-1.1), (-20.9,-7.7), (16.5,-7.7), (18.7,-7.7), (23.1,-14.3), (25.3,-14.3), (-7.7,-31.9), (7.7,-31.9), (-7.7,-34.1), (7.7,-34.1)]

point_name_pairing = [("mclover3", mclover3_points), ("mrfp1", mrfp1_points), ("azurite", azurite_points)]

TIP_RACK_DECK_SLOT = 9
COLORS_DECK_SLOT = 6
AGAR_DECK_SLOT = 5
PIPETTE_STARTING_TIP_WELL = 'A1'

well_colors = {
    'A1': 'sfGFP', 'A2': 'mRFP1', 'A3': 'mKO2', 'A4': 'Venus',
    'A5': 'mKate2_TF', 'A6': 'Azurite', 'A7': 'mCerulean3', 'A8': 'mClover3',
    'A9': 'mJuniper', 'A10': 'mTurquoise2', 'A11': 'mBanana', 'A12': 'mPlum',
    'B1': 'Electra2', 'B2': 'mWasabi', 'B3': 'mScarlet_I', 'B4': 'mPapaya',
    'B5': 'eqFP578', 'B6': 'tdTomato', 'B7': 'DsRed', 'B8': 'mKate2',
    'B9': 'EGFP', 'B10': 'mRuby2', 'B11': 'TagBFP', 'B12': 'mChartreuse_TF',
    'C1': 'mLychee_TF', 'C2': 'mTagBFP2', 'C3': 'mEGFP', 'C4': 'mNeonGreen',
    'C5': 'mAzamiGreen', 'C6': 'mWatermelon', 'C7': 'avGFP', 'C8': 'mCitrine',
    'C9': 'mVenus', 'C10': 'mCherry', 'C11': 'mHoneydew', 'C12': 'TagRFP',
    'D1': 'mTFP1', 'D2': 'Ultramarine', 'D3': 'ZsGreen1', 'D4': 'mMiCy',
    'D5': 'mStayGold2', 'D6': 'PA_GFP'
}

volume_used = {'mclover3': 0, 'mrfp1': 0, 'azurite': 0}

def update_volume_remaining(current_color, quantity_to_aspirate):
    rows = string.ascii_uppercase
    for well, color in list(well_colors.items()):
        if color == current_color:
            if (volume_used[current_color] + quantity_to_aspirate) > 250:
                row = well[0]
                col = well[1:]
                next_row = rows[rows.index(row) + 1]
                next_well = f"{next_row}{col}"
                del well_colors[well]
                well_colors[next_well] = current_color
                volume_used[current_color] = quantity_to_aspirate
            else:
                volume_used[current_color] += quantity_to_aspirate
            break

def run(protocol):
    protocol.home()
    tips_20ul = protocol.load_labware('opentrons_96_tiprack_20ul', TIP_RACK_DECK_SLOT, 'Opentrons 20uL Tips')
    pipette_20ul = protocol.load_instrument("p20_single_gen2", "right", [tips_20ul])
    temperature_plate = protocol.load_labware('nest_96_wellplate_2ml_deep', 6)
    agar_plate = protocol.load_labware('htgaa_agar_plate', AGAR_DECK_SLOT, 'Agar Plate')
    agar_plate.set_offset(x=0.00, y=0.00, z=Z_VALUE_AGAR)
    center_location = agar_plate['A1'].top()
    pipette_20ul.starting_tip = tips_20ul.well(PIPETTE_STARTING_TIP_WELL)

    def dispense_and_jog(pipette, volume, location):
        assert(isinstance(volume, (int, float)))
        above_location = location.move(types.Point(z=location.point.z + 2))
        pipette.move_to(above_location)
        pipette.dispense(volume, location)
        pipette.move_to(above_location)

    def location_of_color(color_string):
        for well, color in well_colors.items():
            if color.lower() == color_string.lower():
                return temperature_plate[well]
        raise ValueError(f"No well found with color {color_string}")

    for i, (current_color, point_list) in enumerate(point_name_pairing):
        if not point_list:
            continue
        pipette_20ul.pick_up_tip()
        max_aspirate = int(18 // POINT_SIZE) * POINT_SIZE
        quantity_to_aspirate = min(len(point_list) * POINT_SIZE, max_aspirate)
        update_volume_remaining(current_color, quantity_to_aspirate)
        pipette_20ul.aspirate(quantity_to_aspirate, location_of_color(current_color))

        for i in range(len(point_list)):
            x, y = point_list[i]
            adjusted_location = center_location.move(types.Point(x, y))
            dispense_and_jog(pipette_20ul, POINT_SIZE, adjusted_location)
            if pipette_20ul.current_volume == 0 and len(point_list[i+1:]) > 0:
                quantity_to_aspirate = min(len(point_list[i:]) * POINT_SIZE, max_aspirate)
                update_volume_remaining(current_color, quantity_to_aspirate)
                pipette_20ul.aspirate(quantity_to_aspirate, location_of_color(current_color))
        pipette_20ul.drop_tip()

AI use: Claude (Anthropic) was used to help document this assignment. The Python script and coordinates were generated by the opentrons-art GUI tool.

Part 2: Post-Lab Questions

Published Paper Using Automation for Biological Applications

Greenhalgh et al. (2024) published “Enabling high-throughput enzyme discovery and engineering with a low-cost, robot-assisted pipeline” in Scientific Reports. The paper describes a generalizable pipeline for high-throughput protein expression and purification using small-scale E. coli cultures and an affordable liquid-handling robot.

The platform purifies 96 proteins in parallel in deep-well plate format. The robot automates the most tedious and error-prone steps: cell lysis, Ni-NTA magnetic bead binding for His-tagged proteins, wash cycles, and elution. Each step was miniaturized from bench-scale protocols to work within 96-well plates, reducing reagent waste and eliminating manual pipetting of hundreds of small volumes. The authors demonstrated reproducibility across replicate experiments and achieved yields up to 400 µg of purified protein per well, which was sufficient for both thermostability and activity assays.

As a test case, the authors used their platform to express and purify the leading PET hydrolases (plastic-degrading enzymes) from the literature. They generated a standardized benchmark dataset comparing these enzymes under identical conditions, something that had not been done before because each enzyme had originally been characterized in a different lab with different protocols. This highlights a key advantage of automation: it removes lab-to-lab variability and lets you make fair comparisons across a protein library.

What makes this paper relevant to protein binder engineering is the generalizability of the approach. The same pipeline could be applied to screen computationally designed protein binders. AI tools like RFdiffusion and ProteinMPNN can now generate hundreds of candidate binder designs in silico, but the experimental bottleneck is expressing and testing them all. A robot-assisted pipeline like this one turns what would be weeks of manual work into a few days of automated runs, closing the gap between computational design throughput and experimental validation throughput.

The platform is built on open-source code (available on GitHub) and uses equipment accessible to most labs, making it a practical model for anyone looking to scale up protein screening.

Reference: Greenhalgh JC, Fahlberg SA, Pfleger BF, Romero PA. Enabling high-throughput enzyme discovery and engineering with a low-cost, robot-assisted pipeline. Sci Rep. 2024;14:14449. doi:10.1038/s41598-024-64938-0

Automation Plan for Final Project

My final project centers on validating computationally designed protein binders through automated experimental screening. The core idea is to use AI protein design tools to generate candidate binders against a target of interest, then use liquid-handling automation to express, purify, and assay them in high throughput.

Here is the automation workflow I would implement:

Step 1: Construct assembly (Opentrons OT-2). Synthesized binder genes (ordered from Twist as clonal fragments) are cloned into an E. coli expression vector using Golden Gate assembly. The OT-2 sets up all reactions in a 96-well plate: picking the correct gene fragment from a source plate, adding vector backbone, BsaI, T4 ligase, and buffer. This eliminates the most tedious manual step of pipetting dozens of small-volume reactions.

Step 2: Transformation and expression (semi-manual). Assembled constructs are transformed into E. coli BL21(DE3) and plated on selective media. After colony picking, cultures are grown in 96-deep-well plates. The OT-2 handles inoculation and induction (adding IPTG at the right OD), ensuring consistent volumes across all 96 wells.

Step 3: Protein purification (Opentrons OT-2). Following the approach from Greenhalgh et al., cells are lysed and His-tagged binder proteins are purified using Ni-NTA magnetic beads in plate format. The robot performs all bead binding, wash, and elution steps. This is where automation saves the most time: manual bead purification of 96 samples takes a full day, while the robot does it in about two hours with better consistency.

Step 4: Binding assay (Opentrons OT-2 + plate reader). Purified binders are dispensed into an ELISA or bio-layer interferometry plate with immobilized target protein. The OT-2 handles serial dilutions for dose-response curves and dispenses detection reagents. A plate reader measures binding signal. Hits are ranked by apparent affinity.

Pseudocode:

for each binder in designed_library:
    # Build
    opentrons.golden_gate(
        insert=binder_gene_fragment,
        vector=pET29b_backbone,
        destination=assembly_plate[binder.index]
    )

    # After transformation, colony picking, growth (semi-manual)

    # Purify (fully automated)
    opentrons.lyse_cells(culture_plate[binder.index])
    opentrons.add_magnetic_beads(lysate_plate[binder.index])
    opentrons.wash(lysate_plate[binder.index], wash_buffer, n=3)
    opentrons.elute(lysate_plate[binder.index], elution_buffer)

    # Assay (fully automated)
    opentrons.serial_dilute(purified_binder, assay_plate, 8_points)
    opentrons.add_target_protein(assay_plate)
    opentrons.add_detection_reagent(assay_plate)
    plate_reader.measure(absorbance_450nm)

# Rank binders by binding signal, feed back to computational model

This pipeline closes the design-build-test-learn loop for AI-designed protein binders. Each round of 96 binders could be screened in about one week, with the computational design of the next round starting immediately from the binding data. For a cloud lab alternative, this workflow maps well onto platforms like Ginkgo Nebula, which could handle all wet lab steps in a fully automated facility at higher throughput but with longer turnaround and higher cost per experiment.

Part 3: Final Project Ideas

Idea 1: De Novo Protein Binder Design Using AI

De Novo Protein Binders Against Snake Venom Three-Finger Toxins: Design synthetic protein binders targeting three-finger toxins (3FTx), the dominant lethal component in elapid snake venoms (cobras, kraits, mambas). 3FTx share a conserved disulfide-rich β-sheet scaffold despite sequence divergence across species, making them a good target for a broad-spectrum binder. Use computational protein design strategies and protein language model embeddings to generate and rank candidate binders against the conserved 3FTx fold. Top candidates validated via automated ELISA binding screen. Snakebite envenoming kills >100,000 people/year and is a WHO-listed neglected tropical disease; current antivenoms are animal-derived, expensive, and species-specific. Computationally designed binders could enable a synthetic, broad-spectrum, low-cost alternative.

Idea 2: AI-Guided Thermostabilisation of Carbonic Anhydrase

Carbonic anhydrase catalyses the reversible hydration of CO₂ (CO₂ + H₂O → HCO₃⁻ + H⁺), a reaction with major industrial applications in carbon capture and sequestration. However, the enzyme denatures at the elevated temperatures found in industrial flue gas streams (>50°C), limiting its practical use. This project would use AI thermostability prediction models — such as ThermoMPNN (a graph neural network trained on ProteinMPNN embeddings to predict ΔΔG° for point mutations) and TemStaPro (which uses protein language model embeddings to classify thermostability across temperature thresholds) — to computationally identify stabilising mutations. The approach would generate a ranked library of single and combinatorial mutants, predict their stability profiles across a temperature range, then experimentally validate top candidates using automated activity assays at increasing temperatures. The goal is to shift the enzyme’s functional temperature window above 60°C while retaining catalytic efficiency.

Idea 3: Improving Cellulase Catalytic Efficiency via Directed Evolution

Engineer a cellulase with improved catalytic efficiency (kcat/Km) for cellulose hydrolysis using a directed evolution approach. Cellulases break down cellulose into fermentable sugars and are a key bottleneck in biofuel production from lignocellulosic biomass. Starting from a wild-type endoglucanase, the project would use error-prone PCR or combinatorial saturation mutagenesis to generate variant libraries, then screen them in 96-well format using an automated DNS (dinitrosalicylic acid) reducing sugar assay on the Opentrons OT-2. Top-performing variants from each round would be recombined and re-screened over multiple DBTL cycles. Unlike the other two projects which are primarily computational design-driven, this project takes a classical directed evolution approach but uses lab automation to dramatically increase screening throughput.

AI use: Claude (Anthropic) was used to help draft and format this homework documentation. The Opentrons art Python script and coordinates were generated by the opentrons-art GUI tool. Web search was used to identify relevant published papers and AI thermostability tools.

Subsections of Labs

Week 1 Lab: Pipetting

cover image cover image

Subsections of Projects

Individual Final Project

cover image cover image

Group Final Project

cover image cover image