<Abhinav Rajendran> — HTGAA Spring 2026

cover image cover image

About me

Hi guys, I’m Abhi. I’m based in London, joining via the LifeFabs node. I currently spend most of my time computationally designing proteins — building AI tools for protein engineering. I’m taking HTGAA to get hands-on wet lab experience and start translating those computational designs into real biology.

Contact info

X.com

Homework

Labs

Projects

Subsections of <Abhinav Rajendran> — HTGAA Spring 2026

Homework

Weekly homework submissions:

  • Week 1 HW: Principles and Practices

    Application: An AI Agent for Protein and Molecular Design I’m developing an AI agent for protein and molecular design - an autonomous system that can take a high-level design brief (e.g. “design a protein that binds target X with nanomolar affinity”) and execute the full computational design pipeline: searching structure databases, running generative models, evaluating candidates, iterating on designs, and preparing sequences for synthesis. Unlike standalone models, an agent orchestrates multiple tools and makes decisions across the design cycle with minimal human intervention.

  • Week 2 HW: DNA Read, Write, and Edit

    Part 1: Benchling Gel Art Virtual restriction digest of Lambda DNA (J02459) using EcoRI, HindIII, BamHI, PstI, SalI, and XhoI, visualized in NEBcutter. Part 3: DNA Design Challenge 3.1 Choose Your Protein I chose endoglucanase A (CelCCA) from the bacterium Clostridium cellulolyticum. This organism has since been reclassified as Ruminiclostridium cellulolyticum. The crystal structure of its catalytic domain is in the PDB as entry 1EDG. It was solved at 1.6 angstrom resolution.

  • Week 3 HW: Lab Automation

    Part 1: Opentrons Art I designed my artwork using the Automation Art GUI at opentrons-art.rcdonovan.com. I uploaded a bat image and the tool pixelated it into dispensing coordinates for three fluorescent proteins: mClover3 (green, 41 points), mRFP1 (red, 393 points), and Azurite (blue, 40 points). The design uses 0.75 µL droplet sizes at 2.2 mm spacing.

  • Week 4 HW: Protein Design Part I

    Part A: Conceptual Questions (Shuguang Zhang) Q1: How many molecules of amino acids do you take with a piece of 500 grams of meat? Meat is roughly 25% protein by weight, so 500g of meat contains ~125g of protein. The average amino acid has a molecular weight of ~110 Da (daltons), i.e. ~110 g/mol. So 125g ÷ 110 g/mol ≈ 1.14 mol of amino acid residues. Multiply by Avogadro’s number (6.022 × 10²³): roughly 6.8 × 10²³ amino acid molecules, just over one mole.

  • Week 5 HW: Protein Design Part II

    Part 1: Generate Binders with PepMLM Sequence Preparation Retrieved WT SOD1 from UniProt P00441 and introduced A4V (position 5 in full sequence, position 4 in mature protein after Met cleavage): WT: M A T K A V C V L K … A4V: M A T K V V C V L K … ^ PepMLM Results Model: PepMLM-650M (ESM-2 based, Colab, T4 GPU) | Parameters: length = 12, num_binders = 4, top_k = 3

  • Week 6 HW: Genetic Circuits Part I

    Assignment: DNA Assembly 1. What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose? Phusion DNA Polymerase: high-fidelity polymerase with 3’→5’ proofreading activity, so it corrects errors during extension. Much lower error rate than standard Taq. dNTPs (dATP, dTTP, dCTP, dGTP): the nucleotide building blocks the polymerase adds to the growing strand. MgCl₂: magnesium ions are an essential cofactor for polymerase function. Concentration affects stringency. Reaction buffer: maintains optimal pH and salt conditions across the three PCR temperature steps. The user adds template DNA and two primers (forward and reverse) to complete the reaction.

  • Week 7 HW: Genetic Circuits Part II

    Assignment Part 1: Intracellular Artificial Neural Networks (IANNs) 1. What advantages do IANNs have over traditional genetic circuits, whose input/output behaviors are Boolean functions? Boolean circuits force biological signals into binary (high/low), but real biomarkers exist at continuous concentrations. IANNs operate on analog values, performing weighted summation and nonlinear activation (ReLU), so they can compute complex continuous functions like bandpass filters and diagonal decision boundaries. These are the kinds of input-output shapes actually needed for problems like cancer classification, where you care about relative concentration levels, not just on/off.

Subsections of Homework

Week 1 HW: Principles and Practices

Application: An AI Agent for Protein and Molecular Design

I’m developing an AI agent for protein and molecular design - an autonomous system that can take a high-level design brief (e.g. “design a protein that binds target X with nanomolar affinity”) and execute the full computational design pipeline: searching structure databases, running generative models, evaluating candidates, iterating on designs, and preparing sequences for synthesis. Unlike standalone models, an agent orchestrates multiple tools and makes decisions across the design cycle with minimal human intervention.

The promise is enormous: compressing weeks of expert computational work into hours, democratising access to protein engineering capabilities, and enabling rapid iteration on drug candidates, industrial enzymes, and biosensors. But agency amplifies dual-use risk. A standalone generative model requires a knowledgeable user to interpret and act on outputs. An agent that autonomously navigates the full design-to-synthesis pipeline lowers the expertise barrier dramatically. In 2022, Urbina et al. demonstrated a related concern — they inverted a drug discovery model’s objective function and generated ~40,000 molecules predicted to be more toxic than VX nerve agent, in under 6 hours. An agentic system could, in principle, not only generate such candidates but evaluate, optimise, and prepare them for ordering — all without the user needing deep domain knowledge.

Policy Goals

Primary Goal: Prevent misuse of generative biological AI while preserving its benefits

Sub-goals:

  1. Biosecurity — Prevent AI-designed biological agents (proteins, toxins, pathogens) from being created or used to cause harm
  2. Maintaining open science — Avoid governance structures so restrictive that they get in the way of legitimate research and fair access to these tools
  3. Accountability — Ensure clear responsibility chains so that when things go wrong, there are mechanisms for tracking where things went wrong

Governance Actions

Action 1: Technical Screening Layer — Automated Hazard Flagging on Agent Outputs

Purpose: Currently, most generative bio-AI systems have no built-in safety filters, including emerging agentic pipelines. A user can instruct an agent to design any sequence or molecule without any check on whether the output is potentially dangerous. Many foundation model providers have some guardrails in place but these mostly police intent rather than dengerous molecules. This is worse with agents than standalone models because the agent may autonomously evaluate, refine, and prepare dangerous designs for synthesis without a human reviewing intermediate steps. I’m proposing a technical screening layer, analogous to content moderation in LLMs, that automatically flags outputs with high predicted toxicity, homology to known threat agents (select agents, toxins), or dual-use concern at multiple checkpoints in the agent’s pipeline.

Design: This requires:

  • A curated database of known threat sequences and molecular scaffolds, drawing from select agent lists and known toxin families
  • Lightweight classifier models trained to flag outputs above a risk threshold
  • Integration at the API level, so screening happens before results are returned
  • Model developers (companies like the one I work at, plus academic labs releasing open models) would need to implement this. Funding could come from existing biosecurity programmes such as UK AISI and US BARDA

Assumptions:

  • That dangerous outputs are detectable computationally. This is partially true (homology to known agents is searchable) but novel threats with no known analogues would slip through
  • That model developers will adopt this voluntarily or can be incentivised to do so
  • That the databases of known threats are comprehensive and kept current

Risks of Failure & “Success”:

  • Failure: Screening is trivially bypassed, for example by users running open-source models locally without the filter. Creates a false sense of security
  • “Success”: Over-sensitive filters block legitimate research. Researchers designing novel antimicrobials might constantly trigger toxicity flags. Could push users toward unfiltered open-source alternatives, defeating the point of the policy

Action 2: Industry API-Gated Access with Tiered Permissions

Purpose: Currently, access to powerful generative bio-AI is relatively open, including agentic systems that can autonomously execute multi-step design pipelines. Many underlying models are available as downloadable weights or through APIs with minimal identity verification. An agent that chains these models together amplifies risk because it reduces the expertise needed to go from intent to synthesis-ready design. I’m proposing a tiered access system where the level of capability scales with the user’s credentials and intended use:

  • Tier 1 (Open): In silico exploration. Anyone can query models for general protein properties, structure prediction, and basic design
  • Tier 2 (Verified): Full generative capability. Requires institutional affiliation, identity verification, and a stated research purpose
  • Tier 3 (Screened): Synthesis-coupled design. When a user wants to order synthetic DNA or protein based on AI-generated designs, synthesis providers (Twist, IDT, etc.) run additional biosecurity screening on the sequences

Design: This requires:

  • Identity verification infrastructure, which could piggyback on existing systems like ORCID for academics or institutional credentials
  • Coordination between AI model providers and DNA synthesis companies. The International Gene Synthesis Consortium (IGSC) already screens orders, but integration with upstream AI tools is new
  • Industry buy-in from model providers to gate their APIs. Companies like Anthropic have shown this is viable for language models (Claude was initially waitlisted)

Assumptions:

  • That tiering is enforceable. If model weights are open-source, gating the API is moot
  • That institutional affiliation is a reasonable proxy for trustworthiness. It’s not perfect, as state-sponsored actors have institutional credentials
  • That synthesis providers are the right chokepoint. This only works if physical synthesis remains the bottleneck, which may not hold as benchtop synthesis becomes easier

Risks of Failure & “Success”:

  • Failure: Determined bad actors route around the system entirely. Tiering only inconveniences legitimate researchers
  • “Success”: Creates a two-tier research ecosystem where well-resourced institutions have full access and smaller labs or Global South researchers are locked out, exacerbating existing inequities in biotech

Action 3: Regulatory Mandatory Dual-Use Review for Generative Bio-AI Publications and Releases

Purpose: Currently, there is no systematic requirement to assess dual-use risk before publishing generative bio-AI models, agentic systems, or their underlying datasets. The Urbina paper was itself a demonstration of how easily a published model could be repurposed, and agentic systems that chain multiple models into autonomous pipelines compound this risk by making misuse more accessible. I’m proposing mandatory dual-use risk assessments, similar to Institutional Biosafety Committee (IBC) review for wet lab work, before any generative bio-AI model, agent framework, training dataset, or capability benchmark is publicly released.

Design: This requires:

  • Expanding the remit of existing biosafety/biosecurity review bodies (such as IBCs or the UK’s ACDP) to cover computational tools, not just physical experiments
  • Developing standardised dual-use risk assessment frameworks specific to AI-bio. The existing frameworks are designed for gain-of-function wet lab work and don’t map cleanly
  • Journals and preprint servers (Nature, bioRxiv) could require evidence of dual-use review as a condition of publication, similar to ethics approval for human subjects research
  • Government funding agencies (UKRI, NIH, DARPA) could mandate dual-use review as a grant condition

Assumptions:

  • That review bodies have the technical expertise to evaluate AI model capabilities. Currently most IBCs do not
  • That pre-publication review is fast enough not to fatally slow down a fast-moving field
  • That the definition of “dual-use” can be operationalised clearly enough for consistent review decisions

Risks of Failure & “Success”:

  • Failure: Review becomes a rubber stamp. Committees lack expertise, approve everything, and the process adds bureaucratic overhead without improving safety
  • “Success”: Slows the pace of open publication enough that research moves to private industry where there’s less oversight. Creates a perverse incentive to not publish, reducing the transparency that currently helps the security community track developments

Scoring

Does the option:Option 1 (Screening)Option 2 (Tiered API)Option 3 (Dual-Use Review)
Enhance Biosecurity
• By preventing incidents212
• By helping respond221
Foster Lab Safety
• By preventing incidentn/an/a2
• By helping respondn/an/an/a
Protect the environment
• By preventing incidents222
• By helping respond332
Other considerations
• Minimizing costs and burdens to stakeholders123
• Feasibility?123
• Not impede research232
• Promote constructive applications122

(1 = best, 3 = worst, n/a = not applicable)

Recommendation

I would recommend prioritising a combination of Actions 1 and 2, technical screening integrated with tiered API access, addressed to an organization like the AI Safety Institute who are in my opinion world leading!

Action 1 (automated screening) scores highest on feasibility and cost because it’s a technical solution that model developers can implement without legislative change. It’s the lowest-friction intervention. However, it’s insufficient alone because it’s bypassable with open-source models.

Action 2 (tiered access) addresses that gap by creating identity-linked accountability, and by integrating with the existing DNA synthesis screening infrastructure (IGSC). Together, these two actions create defence in depth: screening catches inadvertent misuse, and tiered access raises the bar for deliberate misuse.

Action 3 (mandatory dual-use review) scores well on response capability — a paper trail of risk assessments is valuable after an incident — but is the hardest to implement. The expertise gap in review bodies is real, and the risk of pushing research into less transparent private settings is significant. I’d recommend this as a medium-term goal, starting with voluntary frameworks that build capacity before mandating compliance.

Key trade-off: All three actions risk disadvantaging smaller labs and researchers who lack institutional infrastructure. Any implementation should include capacity-building provisions — for example, free verified access tiers for researchers from lower-income institutions.

Key uncertainty: The biggest unknown is how long DNA/protein synthesis remains the effective bottleneck, and also whether it even can be considered a bottlneck in 2026. If benchtop synthesis becomes cheap and accessible, Actions 1 and 2 lose much of their enforcement power, and the governance challenge shifts fundamentally toward the wet lab.

Week 1 Ethical Reflection

Halfpipe of doom was interesting, the observation that powerful technologies simultaneously promise to save and destroy the world. This isn’t new. Nuclear physics gave us both energy and bombs. Every transformative technology has this yin and yang.

This is definitely gogin to accelerate in biology right now, I think we are at the pivot point. The tools we’re learning in this course — DNA synthesis, CRISPR, protein design, autonomous AI agents that chain these together — are the biological equivalent of splitting the atom. The constructive applications are huge, but so is the potential for misuse. And unlike nuclear technology, where the materials and infrastructure required act as natural barriers, the barriers in biology are collapsing so fast. AI compresses the knowledge barrier, synthesis costs keep dropping, and the biological “materials” literally self-replicate.

This reinforces why governance can’t be an afterthought bolted on after the technology matures. It needs to be designed in parallel!

Week 2 Lecture Prep

Jacobson Questions

Q1: What is the error rate of polymerase? How does this compare to the length of the human genome? How does biology deal with that discrepancy?

Polymerase with its built-in proofreading has an error rate of about 1 in 10⁶. The human genome is ~3.2 billion bp, so each replication would introduce ~3,200 errors. Biology fixes this with post-replication mismatch repair systems like MutS, which bring the effective error rate down to roughly 1 in 10⁹–10¹⁰.

Q2: How many different ways are there to code for an average human protein? Why don’t all of these work in practice?

An average human protein is ~345 amino acids. Most amino acids have ~3 synonymous codons, giving roughly 3³45 possible DNA sequences for the same protein. In practice most won’t work because of codon usage bias (organisms prefer codons matched to their tRNA abundance), mRNA secondary structure affecting translation, and RNA cleavage rules.

LeProust Questions

Q1: What’s the most commonly used method for oligo synthesis currently?

Phosphoramidite chemistry, developed by Caruthers in 1981. A four-step cycle (coupling, capping, oxidation, deblocking) repeated for each base. Used in both traditional column synthesisers and modern chip-based platforms like Twist’s silicon platform.

Q2: Why is it difficult to make oligos longer than 200nt via direct synthesis?

Coupling efficiency compounds over length. Even at ~99% per step, (0.99)²⁰⁰ ≈ 13% full-length product. Longer oligos are dominated by truncations and errors.

Q3: Why can’t you make a 2000bp gene via direct oligo synthesis?

At 2000 cycles the full-length yield is essentially zero, and with ~1:200 per-base error rate you’d average ~10 errors per molecule. Instead, genes are built by assembling shorter overlapping oligos (60–200nt) using methods like Gibson assembly, then error-corrected.

Church Question

Q1: What are the 10 essential amino acids in all animals, and how does this affect your view of the “Lysine Contingency”?

The 10 essential amino acids are: histidine, isoleucine, leucine, lysine, methionine, phenylalanine, threonine, tryptophan, valine, and arginine. Animals cannot synthesise these and must get them from diet.

The “Lysine Contingency” in Jurassic Park was a biocontainment strategy where dinosaurs were engineered to not synthesise lysine. The problem is that lysine is already essential in all animals. Plus lysine is abundant in normal food sources, making it useless as containment.

Week 2 HW: DNA Read, Write, and Edit

Part 1: Benchling Gel Art

image.png image.png

Virtual restriction digest of Lambda DNA (J02459) using EcoRI, HindIII, BamHI, PstI, SalI, and XhoI, visualized in NEBcutter.

Part 3: DNA Design Challenge

3.1 Choose Your Protein

I chose endoglucanase A (CelCCA) from the bacterium Clostridium cellulolyticum. This organism has since been reclassified as Ruminiclostridium cellulolyticum. The crystal structure of its catalytic domain is in the PDB as entry 1EDG. It was solved at 1.6 angstrom resolution.

Why this protein? Cellulases break down cellulose, the most abundant organic polymer on Earth. They are important for biomass conversion, biofuel production, and the paper and textile industries. CelCCA belongs to glycosyl hydrolase family 5. It folds into a classic (alpha/beta)8 TIM barrel, one of the most common enzyme folds in nature. The protein has clear industrial relevance. It also has a high-resolution structure that will be useful for computational analysis in later weeks.

The protein sequence comes from the RCSB PDB (entry 1EDG, Chain A, UniProt: P17901). The catalytic domain is 380 amino acids long:

>pdb|1EDG|A Endoglucanase A catalytic domain, Ruminiclostridium cellulolyticum H10
MYDASLIPNLQIPQKNIPNNDGMNFVKGLRLGWNLGNTFDAFNGTNITNELDYETSWSG
IKTTKQMIDAIKQKGFNTVRIPVSWHPHVSGSDYKISDVWMNRVQEVVNYCIDNKMYVIL
NTHHDVDKVKGYFPSSQYMASSKKYITSVWAQIAARFANYDEHLIFEGMNEPRLVGHANE
WWPELTNSDVVDSINCINQLNQDFVNTVRATGGKNASRYLMCPGYVASPDGATNDYFRMP
NDISGNNNKIIVSVHAYCPWNFAGLAMADGGTNAWNINDSKDQSEVTWFMDNIYNKYTSR
GIPVIIGECGAVDKNNLKTRVEYMSYYVAQAKARGILCILWDNNNFSGTGELFGFFDRRS
CQFKFPEIIDGMVKYAFGLIN

3.2 Reverse Translate: Protein to DNA

I converted the 380-amino-acid sequence back into DNA. The genetic code is degenerate, meaning most amino acids can be encoded by multiple codons. There is no single “correct” reverse translation. I used the most common E. coli codons for each amino acid.

The resulting coding sequence is 1,140 bp (380 aa x 3 nt/aa). Adding a TAA stop codon gives 1,143 bp total:

ATGTATGATGCGAGCCTGATTCCGAACCTGCAGATTCCGCAGAAAAACATTCCGAACAAC
GATGGTATGAACTTCGTGAAAGGTCTGCGTCTGGGTTGGAACCTGGGTAACACCTTCGAT
GCGTTCAACGGTACCAACATTACCAACGAACTGGATTATGAAACCAGCTGGAGCGGTATT
AAAACCACCAAACAGATGATTGATGCGATTAAACAGAAAGGTTTCAACACCGTGCGTATT
CCGGTGAGCTGGCATCCGCATGTGAGCGGTAGCGATTATAAAATTAGCGATGTGTGGATG
AACCGTGTGCAGGAAGTGGTGAACTATTGCATTGATAACAAAATGTATGTGATTCTGAAC
ACCCATCATGATGTGGATAAAGTGAAAGGTTATTTTCCGAGCAGCCAGTATATGGCGAGC
AGCAAAAAATATATTACCAGCGTGTGGGCGCAGATTGCGGCGCGTTTCGCGAACTATGAT
GAACATCTGATTTTCGAAGGTATGAACGAACCGCGTCTGGTGGGTCATGCGAACGAATGG
TGGCCGGAACTGACCAACAGCGATGTGGTGGATAGCATTAACTGCATTAACCAGCTGAAC
CAGGATTTCGTGAACACCGTGCGTGCGACCGGTGGTAAAAACGCGAGCCGTTATCTGATG
TGCCCGGGTTATGTGGCGAGCCCGGATGGTGCGACCAACGATTATTTCCGTATGCCGAAC
GATATTAGCGGTAACAACAACAAAATTATTGTGAGCGTGCATGCGTATTGCCCGTGGAAC
TTCGCGGGTCTGGCGATGGCGGATGGTGGTACCAACGCGTGGAACATTAACGATAGCAAA
GATCAGAGCGAAGTGACCTGGTTCATGGATAACATTTATAAACAAATATACCAGCCGTGGT
ATTCCGGTGATTATTGGTGAATGCGGTGCGGTGGATAAAAAACAACCTGAAAACCCGTGTG
GAATATATGAGCTATTATGTGGCGCAGGCGAAAGCGCGTGGTATTCTGTGCATTCTGTGG
GATAACAACAACTTCAGCGGTACCGGTGAACTGTTCGGTTCTTCGATCGTCGTAGCTGC
CAGTCAAATTCCCGGAAATTATTGATGGTATGGTGAAATATGCGTTCGGTCTGATTAAC
TAA

3.3 Codon Optimization

Why codon optimization is needed: Different organisms prefer different codons for the same amino acid. A codon that is common in C. cellulolyticum might be rare in E. coli. Rare codons cause the ribosome to stall because the matching tRNA is scarce. This slows translation and reduces protein yield. Codon optimization swaps rare codons for ones the host uses frequently. This keeps the ribosome moving and increases output.

Other factors also matter. Stable mRNA hairpins near the start codon can block ribosome binding. Extreme GC content reduces expression. Certain restriction enzyme sites need to be removed for cloning compatibility.

Organism chosen: I optimized for Escherichia coli K-12. E. coli is the standard host for recombinant protein production. It grows fast, is cheap to culture, and has well-characterized genetics. CelCCA has been successfully expressed in E. coli before (Fierobe et al., 1991), so this is a validated choice.

The optimized sequence uses the highest-frequency E. coli codon for each amino acid. GC content is 47.4%, which falls in the ideal range of 40-60% for E. coli. The sequence does not contain recognition sites for common Type IIs restriction enzymes like BsaI, BsmBI, or BbsI. This keeps the sequence compatible with Golden Gate cloning.

3.4 You Have a Sequence! Now What?

I would produce CelCCA using a cell-based expression system in E. coli. Here is the process:

Step 1: Build an expression construct. The codon-optimized gene goes into an expression cassette. The cassette has a promoter, a ribosome binding site (RBS), a start codon, the coding sequence, a His-tag, a stop codon, and a terminator. This cassette is cloned into a plasmid vector with an antibiotic resistance gene and an origin of replication.

Step 2: Transform into E. coli. The plasmid is introduced into competent E. coli cells through heat shock or electroporation. Cells that take up the plasmid survive on antibiotic selection plates.

Step 3: Transcription. RNA polymerase binds to the promoter. It reads the template DNA strand from 3’ to 5’ and builds a complementary mRNA strand from 5’ to 3’. Thymine (T) in DNA becomes uracil (U) in the mRNA. With a constitutive promoter like BBa_J23106, transcription runs continuously. No inducer is needed.

Step 4: Translation. The ribosome binds the RBS on the mRNA and starts at the AUG start codon. It reads the mRNA in triplets (codons). Each codon is matched by a tRNA carrying the right anticodon and amino acid. The ribosome adds each amino acid to the growing chain. Translation stops when the ribosome reaches the UAA stop codon. The finished protein is then released.

Step 5: Protein folding and purification. The polypeptide folds into its functional 3D structure (the TIM barrel). The C-terminal His-tag enables purification by immobilized metal affinity chromatography (IMAC). Ni-NTA resin binds the histidine residues. The protein is eluted with imidazole.

Alternative: Cell-free expression. The same construct could be used in a cell-free TX-TL system. These systems use cell extracts containing ribosomes, tRNAs, RNA polymerase, and energy sources. Cell-free expression is faster (hours instead of days) and works for toxic proteins. However, yields are lower and costs are higher at scale.

Part 4: Twist Order

image.png image.pngimage.png image.pngimage.png image.png

For the Benchling and Twist exercise, here are the components of my expression cassette:

ComponentPartLength
PromoterBBa_J2310635 bp
RBSBBa_B0034 (with spacers)22 bp
Start CodonATG3 bp
Coding SequenceCelCCA catalytic domain (codon-optimized)1,137 bp
His Tag7x His21 bp
Stop CodonTAA3 bp
TerminatorBBa_B0015129 bp
Total insert~1,350 bp

I selected pTwist Amp High Copy as the vector. It carries ampicillin resistance and a high-copy-number origin (pUC ori) for good plasmid yields.

Part 5: DNA Read/Write/Edit

5.1 DNA Read

(i) What DNA would you want to sequence and why?

I would sequence environmental metagenomes from extreme environments. Good sources include the rumen of herbivorous animals and hot springs where cellulose is actively broken down. These places harbor thousands of uncultured microorganisms that make novel cellulases. Most of these organisms cannot be grown in the lab, so their enzymes remain hidden. Sequencing the total DNA from these environments reveals new enzyme variants with useful properties. These include higher thermostability, different pH optima, and better activity on crystalline cellulose.

This connects to my protein of interest. CelCCA works under moderate conditions. But industrial biomass conversion often needs enzymes that tolerate 60-80 C and low pH. Metagenomic sequencing is the fastest way to find such enzymes without culturing each organism one by one.

(ii) What sequencing technology would you use?

I would use two platforms together: Illumina short-read sequencing for accuracy and Oxford Nanopore long-read sequencing for assembly.

Illumina (second-generation sequencing):

Illumina uses sequencing-by-synthesis. The input is fragmented DNA, typically 300-600 bp fragments for metagenomic work. Library preparation adds sequencing adapters to both ends and amplifies the fragments by PCR.

The essential steps are:

  1. Adapter-ligated fragments bind to complementary oligos on a flow cell surface.
  2. Each fragment undergoes bridge amplification. This creates a cluster of about 1,000 identical copies.
  3. Fluorescent nucleotides with reversible terminators are washed over the flow cell. One nucleotide is added per strand per cycle. A camera records which color lights up at each cluster. Then the terminator is removed so the next cycle can proceed.
  4. This repeats for 150-300 cycles. The result is paired-end reads of 150-300 bp from each end of each fragment.

The output is millions to billions of short reads in FASTQ format. Each read has per-base quality scores. Error rates are about 0.1%. This is great for detecting variants and measuring abundance. The main weakness is read length. Short reads make it hard to assemble full genes from complex samples where many organisms share similar sequences.

Oxford Nanopore (third-generation sequencing):

Nanopore sequences single DNA molecules. It passes each strand through a protein pore in an electrically resistant membrane. Each base disrupts the ionic current in a specific way. A neural network translates these current patterns into nucleotide sequences.

Input preparation is simple. Adapters are ligated to native DNA. No PCR is needed. This is a big advantage because it avoids GC-bias and preserves DNA methylation patterns.

Reads can be very long. Typical reads are 10 kb, and records reach 4 Mb. A whole cellulase gene cluster (often 10-30 kb) can fit in one read. This removes assembly ambiguity. The downside is accuracy. Raw single-read accuracy is about 92-98%. Recent chemistry improvements (R10.4 pores) and better base callers now push consensus accuracy above 99%.

The output is FASTQ or FAST5 files with base calls and raw signal data.

Why combine them? Nanopore reads provide scaffolding for complete gene cluster assembly. Illumina reads then “polish” the assembly to fix remaining errors. This hybrid approach gives both the length of long reads and the accuracy of short reads.

5.2 DNA Write

(i) What DNA would you want to synthesize and why?

I would synthesize a library of CelCCA variants with designed mutations. I would target residues near the catalytic glutamates (Glu170 and Glu307) and the aromatic residues lining the substrate-binding cleft. The goal is to engineer variants with better thermostability and broader substrate range for industrial use.

Rather than making one gene at a time, I would design 50-100 variants as a gene fragment library. Each variant would carry 3-10 mutations predicted by computational tools (covered in HTGAA Week 4). Each sequence would be about 1,140 bp, codon-optimized and flanked by standard assembly overlaps.

(ii) What synthesis technology would you use?

I would use chip-based oligonucleotide synthesis followed by enzymatic assembly. Twist Bioscience is one company that offers this service.

Essential steps:

  1. Oligo synthesis: Short oligos (150-300 nt) are made in parallel on a silicon chip using phosphoramidite chemistry. Each cycle adds one nucleotide in four steps. First, the DMT protecting group is removed from the 5’-OH. Second, the next phosphoramidite monomer is coupled. Third, unreacted chains are capped to prevent deletions. Fourth, the new bond is oxidized for stability. This cycle repeats once per base.

  2. Assembly: Overlapping oligos are combined and joined by overlap-extension PCR or Gibson Assembly. Adjacent oligos share complementary overlaps. They anneal together and polymerase extends them to build the full-length gene.

  3. Error correction: Each coupling step is about 99.0-99.5% efficient. Errors accumulate in longer oligos. Enzymatic mismatch cleavage or sequencing-based selection removes bad sequences. The final gene is cloned and verified by sequencing.

Limitations:

Length is the main constraint. Individual oligos work up to about 200-300 nt. Full genes up to about 5 kb can be assembled, but cost rises with length. My 1,140 bp CelCCA gene is well within the standard range.

Accuracy is also a factor. Synthesis error rates are about 1 in 300 bases before correction. After correction and clonal selection, you get essentially perfect sequences. This verification adds time and cost.

Cost has dropped a lot. Twist currently charges about $0.07 per bp for standard genes. My 1,140 bp gene costs about $80 per variant. A library of 100 variants would run about $8,000.

Turnaround is typically 2-3 weeks for clonal genes.

5.3 DNA Edit

(i) What DNA would you want to edit and why?

I would edit the genome of Clostridium cellulolyticum to improve its biomass conversion ability. The wild-type strain already makes a cellulosome, a large multi-enzyme complex on its cell surface that degrades cellulose. But its productivity is limited. I would make three types of edits:

  1. Knock out carbon catabolite repression genes. This lets the organism use cellulose and other sugars at the same time instead of preferring one carbon source. It would speed up overall biomass conversion.

  2. Insert a metabolic pathway for ethanol or butanol production. This would turn C. cellulolyticum into a consolidated bioprocessing organism. It could both break down cellulose and ferment the sugars in one step. Current industrial processes need separate organisms for each step.

  3. Modify the cellulosome scaffolding protein (CipC). Adding slots for more enzyme types would let the organism degrade a wider range of plant polymers.

These edits would advance next-generation biofuel production. Going directly from raw plant waste to fuel in one organism would cut the cost of cellulosic biofuels significantly.

(ii) What editing technology would you use?

I would use CRISPR-Cas9 with homology-directed repair (HDR).

How CRISPR-Cas9 works:

  1. Design a guide RNA (sgRNA). The first 20 nucleotides are the spacer. They match the target DNA site. The rest of the RNA forms a scaffold that binds Cas9. The target must sit next to a PAM sequence. For SpCas9, the PAM is NGG (any nucleotide then two guanines).

  2. Prepare the components. The sgRNA and Cas9 protein (or a plasmid encoding both) are delivered into the cell. For C. cellulolyticum, delivery would use electroporation or conjugation. A homology donor template is also provided. This template carries the desired edit flanked by 500-1000 bp homology arms matching the regions around the cut site.

  3. Cas9 cuts the DNA. The Cas9-sgRNA complex scans the genome for matching sequences next to a PAM. When it finds a match, it unwinds the DNA and checks for complementarity. If the match is good, Cas9 makes a double-strand break 3 bp upstream of the PAM.

  4. Repair introduces the edit. The cell repairs the break using one of two pathways. Non-homologous end joining (NHEJ) is error-prone. It creates random insertions or deletions, useful for knockouts. Homology-directed repair (HDR) uses the donor template as a blueprint. This allows precise insertions, replacements, or corrections.

Inputs needed: Cas9 protein or plasmid, the sgRNA (designed computationally and synthesized), the homology donor template, and competent cells. Guide design uses tools like Benchling or CRISPOR. These tools pick sites with high on-target activity and low off-target risk.

Limitations:

Efficiency varies by organism. CRISPR works well in E. coli and many model organisms. Clostridia are harder to edit. They have low transformation efficiency and restriction-modification systems that destroy foreign DNA. Genetic tools are also limited. Recent work on Clostridial CRISPR systems (using Cas9 or Cas12a on shuttle vectors) has improved results. But editing efficiency is still around 10-50% per target, compared to 50-90% in model organisms.

Off-target cutting is another concern. The Cas9-sgRNA complex can tolerate a few mismatches. It might cut at unintended sites. This is managed by careful guide design, high-fidelity Cas9 variants (like eSpCas9 or HiFi Cas9), and whole-genome sequencing of edited clones.

For the multiplex edits I described, I would do multiple rounds of editing in sequence. Each round targets one edit and selects for success before moving on. An alternative for E. coli would be MAGE (Multiplex Automated Genome Engineering), which makes many edits at once. But MAGE is not established in Clostridia yet, so sequential CRISPR is the practical approach.

Week 3 HW: Lab Automation

Part 1: Opentrons Art

I designed my artwork using the Automation Art GUI at opentrons-art.rcdonovan.com. I uploaded a bat image and the tool pixelated it into dispensing coordinates for three fluorescent proteins: mClover3 (green, 41 points), mRFP1 (red, 393 points), and Azurite (blue, 40 points). The design uses 0.75 µL droplet sizes at 2.2 mm spacing.

Design preview from the GUI:

Screenshot 2026-02-27 at 18.02.52.png Screenshot 2026-02-27 at 18.02.52.png

The GUI generated coordinate lists for each color, which were exported as a complete Python script for the Opentrons OT-2 using the 96 Deep-Well Plate download option. The script uses the Opentrons Python API (v2.20) with a P20 single-channel pipette. For each color, the robot picks up a tip, aspirates fluorescent protein from a deep-well source plate, and dispenses 0.75 µL at each coordinate relative to the center of an agar plate. It automatically refills when the pipette runs dry. Tips are changed between colors to avoid cross-contamination.

Full Opentrons Python Script (click to expand)
from opentrons import types
import string

metadata = {
    'protocolName': 'Abhinav Rajendran - Opentrons Art - HTGAA',
    'author': 'HTGAA',
    'source': 'HTGAA 2026',
    'apiLevel': '2.20'
}

Z_VALUE_AGAR = 2.0
POINT_SIZE = 0.75

mclover3_points = [(-7.7,34.1), (7.7,34.1), (-20.9,25.3), (20.9,25.3), (-20.9,23.1), (20.9,23.1), (-25.3,20.9), (-23.1,20.9), (23.1,20.9), (25.3,20.9), (-20.9,12.1), (20.9,12.1), (-20.9,9.9), (20.9,9.9), (-34.1,7.7), (-31.9,7.7), (-20.9,7.7), (-25.3,5.5), (-23.1,5.5), (23.1,5.5), (25.3,5.5), (-25.3,3.3), (-23.1,3.3), (23.1,3.3), (25.3,3.3), (-27.5,-3.3), (-7.7,-3.3), (27.5,-3.3), (-27.5,-5.5), (-7.7,-5.5), (27.5,-5.5), (-34.1,-7.7), (-31.9,-7.7), (20.9,-7.7), (34.1,-7.7), (-27.5,-16.5), (-27.5,-18.7), (-18.7,-20.9), (-16.5,-20.9), (16.5,-20.9), (18.7,-20.9)]
mrfp1_points = [(-5.5,34.1), (-3.3,34.1), (-1.1,34.1), (1.1,34.1), (3.3,34.1), (5.5,34.1), (-14.3,31.9), (-12.1,31.9), (-9.9,31.9), (9.9,31.9), (12.1,31.9), (14.3,31.9), (-14.3,29.7), (-12.1,29.7), (-9.9,29.7), (9.9,29.7), (12.1,29.7), (14.3,29.7), (-20.9,27.5), (-18.7,27.5), (-16.5,27.5), (16.5,27.5), (18.7,27.5), (20.9,27.5), (-25.3,25.3), (-23.1,25.3), (23.1,25.3), (25.3,25.3), (-25.3,23.1), (-23.1,23.1), (23.1,23.1), (25.3,23.1), (-27.5,20.9), (27.5,20.9), (-27.5,18.7), (27.5,18.7), (-27.5,16.5), (27.5,16.5), (-29.7,14.3), (29.7,14.3), (31.9,14.3), (-29.7,12.1), (29.7,12.1), (31.9,12.1), (-29.7,9.9), (29.7,9.9), (31.9,9.9), (-25.3,7.7), (-23.1,7.7), (-18.7,7.7), (-16.5,7.7), (16.5,7.7), (18.7,7.7), (23.1,7.7), (25.3,7.7), (34.1,7.7), (-34.1,5.5), (-31.9,5.5), (-20.9,5.5), (-18.7,5.5), (-16.5,5.5), (-12.1,5.5), (-9.9,5.5), (-7.7,5.5), (-5.5,5.5), (-3.3,5.5), (-1.1,5.5), (1.1,5.5), (3.3,5.5), (5.5,5.5), (7.7,5.5), (9.9,5.5), (12.1,5.5), (16.5,5.5), (18.7,5.5), (20.9,5.5), (34.1,5.5), (-34.1,3.3), (-31.9,3.3), (-20.9,3.3), (-18.7,3.3), (-16.5,3.3), (-12.1,3.3), (-9.9,3.3), (-7.7,3.3), (-5.5,3.3), (-3.3,3.3), (-1.1,3.3), (1.1,3.3), (3.3,3.3), (5.5,3.3), (7.7,3.3), (9.9,3.3), (12.1,3.3), (16.5,3.3), (18.7,3.3), (20.9,3.3), (34.1,3.3), (-34.1,1.1), (-31.9,1.1), (-18.7,1.1), (-16.5,1.1), (-14.3,1.1), (-12.1,1.1), (-9.9,1.1), (-7.7,1.1), (-5.5,1.1), (-3.3,1.1), (-1.1,1.1), (1.1,1.1), (3.3,1.1), (5.5,1.1), (7.7,1.1), (9.9,1.1), (12.1,1.1), (14.3,1.1), (16.5,1.1), (18.7,1.1), (34.1,1.1), (-34.1,-1.1), (-31.9,-1.1), (-14.3,-1.1), (-5.5,-1.1), (-3.3,-1.1), (-1.1,-1.1), (1.1,-1.1), (3.3,-1.1), (5.5,-1.1), (14.3,-1.1), (34.1,-1.1), (-34.1,-3.3), (-31.9,-3.3), (-25.3,-3.3), (-23.1,-3.3), (-20.9,-3.3), (-18.7,-3.3), (-16.5,-3.3), (-14.3,-3.3), (-12.1,-3.3), (-9.9,-3.3), (-5.5,-3.3), (-3.3,-3.3), (-1.1,-3.3), (1.1,-3.3), (3.3,-3.3), (5.5,-3.3), (7.7,-3.3), (9.9,-3.3), (12.1,-3.3), (14.3,-3.3), (16.5,-3.3), (18.7,-3.3), (20.9,-3.3), (23.1,-3.3), (25.3,-3.3), (34.1,-3.3), (-34.1,-5.5), (-31.9,-5.5), (-25.3,-5.5), (-23.1,-5.5), (-20.9,-5.5), (-18.7,-5.5), (-16.5,-5.5), (-14.3,-5.5), (-12.1,-5.5), (-9.9,-5.5), (-5.5,-5.5), (-3.3,-5.5), (-1.1,-5.5), (1.1,-5.5), (3.3,-5.5), (5.5,-5.5), (7.7,-5.5), (9.9,-5.5), (12.1,-5.5), (14.3,-5.5), (16.5,-5.5), (18.7,-5.5), (20.9,-5.5), (23.1,-5.5), (25.3,-5.5), (34.1,-5.5), (-14.3,-7.7), (-12.1,-7.7), (-9.9,-7.7), (-7.7,-7.7), (-5.5,-7.7), (-3.3,-7.7), (-1.1,-7.7), (1.1,-7.7), (3.3,-7.7), (5.5,-7.7), (7.7,-7.7), (9.9,-7.7), (12.1,-7.7), (14.3,-7.7), (-29.7,-9.9), (-12.1,-9.9), (-9.9,-9.9), (-7.7,-9.9), (-5.5,-9.9), (-3.3,-9.9), (-1.1,-9.9), (1.1,-9.9), (3.3,-9.9), (5.5,-9.9), (7.7,-9.9), (9.9,-9.9), (12.1,-9.9), (29.7,-9.9), (31.9,-9.9), (-29.7,-12.1), (-12.1,-12.1), (-9.9,-12.1), (-7.7,-12.1), (-5.5,-12.1), (-3.3,-12.1), (-1.1,-12.1), (1.1,-12.1), (3.3,-12.1), (5.5,-12.1), (7.7,-12.1), (9.9,-12.1), (12.1,-12.1), (29.7,-12.1), (31.9,-12.1), (-29.7,-14.3), (-12.1,-14.3), (-9.9,-14.3), (-7.7,-14.3), (-5.5,-14.3), (-3.3,-14.3), (-1.1,-14.3), (1.1,-14.3), (3.3,-14.3), (5.5,-14.3), (7.7,-14.3), (9.9,-14.3), (12.1,-14.3), (27.5,-14.3), (29.7,-14.3), (31.9,-14.3), (-12.1,-16.5), (-9.9,-16.5), (-7.7,-16.5), (-5.5,-16.5), (-3.3,-16.5), (-1.1,-16.5), (1.1,-16.5), (3.3,-16.5), (5.5,-16.5), (7.7,-16.5), (9.9,-16.5), (12.1,-16.5), (23.1,-16.5), (25.3,-16.5), (27.5,-16.5), (-12.1,-18.7), (-9.9,-18.7), (-7.7,-18.7), (-5.5,-18.7), (-3.3,-18.7), (-1.1,-18.7), (1.1,-18.7), (3.3,-18.7), (5.5,-18.7), (7.7,-18.7), (9.9,-18.7), (12.1,-18.7), (23.1,-18.7), (25.3,-18.7), (27.5,-18.7), (-27.5,-20.9), (-12.1,-20.9), (-9.9,-20.9), (-7.7,-20.9), (-5.5,-20.9), (-3.3,-20.9), (-1.1,-20.9), (1.1,-20.9), (3.3,-20.9), (5.5,-20.9), (7.7,-20.9), (9.9,-20.9), (12.1,-20.9), (20.9,-20.9), (23.1,-20.9), (25.3,-20.9), (27.5,-20.9), (-25.3,-23.1), (-23.1,-23.1), (-18.7,-23.1), (-16.5,-23.1), (-14.3,-23.1), (-12.1,-23.1), (-9.9,-23.1), (-7.7,-23.1), (-5.5,-23.1), (-3.3,-23.1), (-1.1,-23.1), (1.1,-23.1), (3.3,-23.1), (5.5,-23.1), (7.7,-23.1), (9.9,-23.1), (12.1,-23.1), (14.3,-23.1), (16.5,-23.1), (18.7,-23.1), (20.9,-23.1), (23.1,-23.1), (25.3,-23.1), (-25.3,-25.3), (-23.1,-25.3), (-18.7,-25.3), (-16.5,-25.3), (-14.3,-25.3), (-12.1,-25.3), (-9.9,-25.3), (-7.7,-25.3), (-5.5,-25.3), (-3.3,-25.3), (-1.1,-25.3), (1.1,-25.3), (3.3,-25.3), (5.5,-25.3), (7.7,-25.3), (9.9,-25.3), (12.1,-25.3), (14.3,-25.3), (16.5,-25.3), (18.7,-25.3), (20.9,-25.3), (23.1,-25.3), (25.3,-25.3), (-20.9,-27.5), (-18.7,-27.5), (-16.5,-27.5), (-14.3,-27.5), (-12.1,-27.5), (-9.9,-27.5), (-7.7,-27.5), (-5.5,-27.5), (-3.3,-27.5), (-1.1,-27.5), (1.1,-27.5), (3.3,-27.5), (5.5,-27.5), (7.7,-27.5), (9.9,-27.5), (12.1,-27.5), (14.3,-27.5), (16.5,-27.5), (18.7,-27.5), (20.9,-27.5), (-14.3,-29.7), (-12.1,-29.7), (-9.9,-29.7), (-7.7,-29.7), (-5.5,-29.7), (-3.3,-29.7), (-1.1,-29.7), (1.1,-29.7), (3.3,-29.7), (5.5,-29.7), (7.7,-29.7), (9.9,-29.7), (12.1,-29.7), (14.3,-29.7), (-5.5,-31.9), (-3.3,-31.9), (-1.1,-31.9), (1.1,-31.9), (3.3,-31.9), (5.5,-31.9), (-5.5,-34.1), (-3.3,-34.1), (-1.1,-34.1), (1.1,-34.1), (3.3,-34.1), (5.5,-34.1)]
azurite_points = [(-18.7,14.3), (-16.5,14.3), (-18.7,12.1), (-16.5,12.1), (16.5,12.1), (18.7,12.1), (-18.7,9.9), (-16.5,9.9), (16.5,9.9), (18.7,9.9), (-27.5,7.7), (-1.1,7.7), (1.1,7.7), (20.9,7.7), (27.5,7.7), (-14.3,5.5), (14.3,5.5), (-14.3,3.3), (14.3,3.3), (-20.9,1.1), (20.9,1.1), (-25.3,-1.1), (-23.1,-1.1), (-18.7,-1.1), (-16.5,-1.1), (-7.7,-1.1), (7.7,-1.1), (16.5,-1.1), (18.7,-1.1), (23.1,-1.1), (25.3,-1.1), (-20.9,-7.7), (16.5,-7.7), (18.7,-7.7), (23.1,-14.3), (25.3,-14.3), (-7.7,-31.9), (7.7,-31.9), (-7.7,-34.1), (7.7,-34.1)]

point_name_pairing = [("mclover3", mclover3_points), ("mrfp1", mrfp1_points), ("azurite", azurite_points)]

TIP_RACK_DECK_SLOT = 9
COLORS_DECK_SLOT = 6
AGAR_DECK_SLOT = 5
PIPETTE_STARTING_TIP_WELL = 'A1'

well_colors = {
    'A1': 'sfGFP', 'A2': 'mRFP1', 'A3': 'mKO2', 'A4': 'Venus',
    'A5': 'mKate2_TF', 'A6': 'Azurite', 'A7': 'mCerulean3', 'A8': 'mClover3',
    'A9': 'mJuniper', 'A10': 'mTurquoise2', 'A11': 'mBanana', 'A12': 'mPlum',
    'B1': 'Electra2', 'B2': 'mWasabi', 'B3': 'mScarlet_I', 'B4': 'mPapaya',
    'B5': 'eqFP578', 'B6': 'tdTomato', 'B7': 'DsRed', 'B8': 'mKate2',
    'B9': 'EGFP', 'B10': 'mRuby2', 'B11': 'TagBFP', 'B12': 'mChartreuse_TF',
    'C1': 'mLychee_TF', 'C2': 'mTagBFP2', 'C3': 'mEGFP', 'C4': 'mNeonGreen',
    'C5': 'mAzamiGreen', 'C6': 'mWatermelon', 'C7': 'avGFP', 'C8': 'mCitrine',
    'C9': 'mVenus', 'C10': 'mCherry', 'C11': 'mHoneydew', 'C12': 'TagRFP',
    'D1': 'mTFP1', 'D2': 'Ultramarine', 'D3': 'ZsGreen1', 'D4': 'mMiCy',
    'D5': 'mStayGold2', 'D6': 'PA_GFP'
}

volume_used = {'mclover3': 0, 'mrfp1': 0, 'azurite': 0}

def update_volume_remaining(current_color, quantity_to_aspirate):
    rows = string.ascii_uppercase
    for well, color in list(well_colors.items()):
        if color == current_color:
            if (volume_used[current_color] + quantity_to_aspirate) > 250:
                row = well[0]
                col = well[1:]
                next_row = rows[rows.index(row) + 1]
                next_well = f"{next_row}{col}"
                del well_colors[well]
                well_colors[next_well] = current_color
                volume_used[current_color] = quantity_to_aspirate
            else:
                volume_used[current_color] += quantity_to_aspirate
            break

def run(protocol):
    protocol.home()
    tips_20ul = protocol.load_labware('opentrons_96_tiprack_20ul', TIP_RACK_DECK_SLOT, 'Opentrons 20uL Tips')
    pipette_20ul = protocol.load_instrument("p20_single_gen2", "right", [tips_20ul])
    temperature_plate = protocol.load_labware('nest_96_wellplate_2ml_deep', 6)
    agar_plate = protocol.load_labware('htgaa_agar_plate', AGAR_DECK_SLOT, 'Agar Plate')
    agar_plate.set_offset(x=0.00, y=0.00, z=Z_VALUE_AGAR)
    center_location = agar_plate['A1'].top()
    pipette_20ul.starting_tip = tips_20ul.well(PIPETTE_STARTING_TIP_WELL)

    def dispense_and_jog(pipette, volume, location):
        assert(isinstance(volume, (int, float)))
        above_location = location.move(types.Point(z=location.point.z + 2))
        pipette.move_to(above_location)
        pipette.dispense(volume, location)
        pipette.move_to(above_location)

    def location_of_color(color_string):
        for well, color in well_colors.items():
            if color.lower() == color_string.lower():
                return temperature_plate[well]
        raise ValueError(f"No well found with color {color_string}")

    for i, (current_color, point_list) in enumerate(point_name_pairing):
        if not point_list:
            continue
        pipette_20ul.pick_up_tip()
        max_aspirate = int(18 // POINT_SIZE) * POINT_SIZE
        quantity_to_aspirate = min(len(point_list) * POINT_SIZE, max_aspirate)
        update_volume_remaining(current_color, quantity_to_aspirate)
        pipette_20ul.aspirate(quantity_to_aspirate, location_of_color(current_color))

        for i in range(len(point_list)):
            x, y = point_list[i]
            adjusted_location = center_location.move(types.Point(x, y))
            dispense_and_jog(pipette_20ul, POINT_SIZE, adjusted_location)
            if pipette_20ul.current_volume == 0 and len(point_list[i+1:]) > 0:
                quantity_to_aspirate = min(len(point_list[i:]) * POINT_SIZE, max_aspirate)
                update_volume_remaining(current_color, quantity_to_aspirate)
                pipette_20ul.aspirate(quantity_to_aspirate, location_of_color(current_color))
        pipette_20ul.drop_tip()

AI use: Claude (Anthropic) was used to help document this assignment. The Python script and coordinates were generated by the opentrons-art GUI tool.

Part 2: Post-Lab Questions

Published Paper Using Automation for Biological Applications

Greenhalgh et al. (2024) published “Enabling high-throughput enzyme discovery and engineering with a low-cost, robot-assisted pipeline” in Scientific Reports. The paper describes a generalizable pipeline for high-throughput protein expression and purification using small-scale E. coli cultures and an affordable liquid-handling robot.

The platform purifies 96 proteins in parallel in deep-well plate format. The robot automates the most tedious and error-prone steps: cell lysis, Ni-NTA magnetic bead binding for His-tagged proteins, wash cycles, and elution. Each step was miniaturized from bench-scale protocols to work within 96-well plates, reducing reagent waste and eliminating manual pipetting of hundreds of small volumes. The authors demonstrated reproducibility across replicate experiments and achieved yields up to 400 µg of purified protein per well, which was sufficient for both thermostability and activity assays.

As a test case, the authors used their platform to express and purify the leading PET hydrolases (plastic-degrading enzymes) from the literature. They generated a standardized benchmark dataset comparing these enzymes under identical conditions, something that had not been done before because each enzyme had originally been characterized in a different lab with different protocols. This highlights a key advantage of automation: it removes lab-to-lab variability and lets you make fair comparisons across a protein library.

What makes this paper relevant to protein binder engineering is the generalizability of the approach. The same pipeline could be applied to screen computationally designed protein binders. AI tools like RFdiffusion and ProteinMPNN can now generate hundreds of candidate binder designs in silico, but the experimental bottleneck is expressing and testing them all. A robot-assisted pipeline like this one turns what would be weeks of manual work into a few days of automated runs, closing the gap between computational design throughput and experimental validation throughput.

The platform is built on open-source code (available on GitHub) and uses equipment accessible to most labs, making it a practical model for anyone looking to scale up protein screening.

Reference: Greenhalgh JC, Fahlberg SA, Pfleger BF, Romero PA. Enabling high-throughput enzyme discovery and engineering with a low-cost, robot-assisted pipeline. Sci Rep. 2024;14:14449. doi:10.1038/s41598-024-64938-0

Automation Plan for Final Project

My final project centers on validating computationally designed protein binders through automated experimental screening. The core idea is to use AI protein design tools to generate candidate binders against a target of interest, then use liquid-handling automation to express, purify, and assay them in high throughput.

Here is the automation workflow I would implement:

Step 1: Construct assembly (Opentrons OT-2). Synthesized binder genes (ordered from Twist as clonal fragments) are cloned into an E. coli expression vector using Golden Gate assembly. The OT-2 sets up all reactions in a 96-well plate: picking the correct gene fragment from a source plate, adding vector backbone, BsaI, T4 ligase, and buffer. This eliminates the most tedious manual step of pipetting dozens of small-volume reactions.

Step 2: Transformation and expression (semi-manual). Assembled constructs are transformed into E. coli BL21(DE3) and plated on selective media. After colony picking, cultures are grown in 96-deep-well plates. The OT-2 handles inoculation and induction (adding IPTG at the right OD), ensuring consistent volumes across all 96 wells.

Step 3: Protein purification (Opentrons OT-2). Following the approach from Greenhalgh et al., cells are lysed and His-tagged binder proteins are purified using Ni-NTA magnetic beads in plate format. The robot performs all bead binding, wash, and elution steps. This is where automation saves the most time: manual bead purification of 96 samples takes a full day, while the robot does it in about two hours with better consistency.

Step 4: Binding assay (Opentrons OT-2 + plate reader). Purified binders are dispensed into an ELISA or bio-layer interferometry plate with immobilized target protein. The OT-2 handles serial dilutions for dose-response curves and dispenses detection reagents. A plate reader measures binding signal. Hits are ranked by apparent affinity.

Pseudocode:

for each binder in designed_library:
    # Build
    opentrons.golden_gate(
        insert=binder_gene_fragment,
        vector=pET29b_backbone,
        destination=assembly_plate[binder.index]
    )

    # After transformation, colony picking, growth (semi-manual)

    # Purify (fully automated)
    opentrons.lyse_cells(culture_plate[binder.index])
    opentrons.add_magnetic_beads(lysate_plate[binder.index])
    opentrons.wash(lysate_plate[binder.index], wash_buffer, n=3)
    opentrons.elute(lysate_plate[binder.index], elution_buffer)

    # Assay (fully automated)
    opentrons.serial_dilute(purified_binder, assay_plate, 8_points)
    opentrons.add_target_protein(assay_plate)
    opentrons.add_detection_reagent(assay_plate)
    plate_reader.measure(absorbance_450nm)

# Rank binders by binding signal, feed back to computational model

This pipeline closes the design-build-test-learn loop for AI-designed protein binders. Each round of 96 binders could be screened in about one week, with the computational design of the next round starting immediately from the binding data. For a cloud lab alternative, this workflow maps well onto platforms like Ginkgo Nebula, which could handle all wet lab steps in a fully automated facility at higher throughput but with longer turnaround and higher cost per experiment.

Part 3: Final Project Ideas

Idea 1: De Novo Protein Binder Design Using AI

De Novo Protein Binders Against Snake Venom Three-Finger Toxins: Design synthetic protein binders targeting three-finger toxins (3FTx), the dominant lethal component in elapid snake venoms (cobras, kraits, mambas). 3FTx share a conserved disulfide-rich β-sheet scaffold despite sequence divergence across species, making them a good target for a broad-spectrum binder. Use computational protein design strategies and protein language model embeddings to generate and rank candidate binders against the conserved 3FTx fold. Top candidates validated via automated ELISA binding screen. Snakebite envenoming kills >100,000 people/year and is a WHO-listed neglected tropical disease; current antivenoms are animal-derived, expensive, and species-specific. Computationally designed binders could enable a synthetic, broad-spectrum, low-cost alternative.

Idea 2: AI-Guided Thermostabilisation of Carbonic Anhydrase

Carbonic anhydrase catalyses the reversible hydration of CO₂ (CO₂ + H₂O → HCO₃⁻ + H⁺), a reaction with major industrial applications in carbon capture and sequestration. However, the enzyme denatures at the elevated temperatures found in industrial flue gas streams (>50°C), limiting its practical use. This project would use AI thermostability prediction models — such as ThermoMPNN (a graph neural network trained on ProteinMPNN embeddings to predict ΔΔG° for point mutations) and TemStaPro (which uses protein language model embeddings to classify thermostability across temperature thresholds) — to computationally identify stabilising mutations. The approach would generate a ranked library of single and combinatorial mutants, predict their stability profiles across a temperature range, then experimentally validate top candidates using automated activity assays at increasing temperatures. The goal is to shift the enzyme’s functional temperature window above 60°C while retaining catalytic efficiency.

Idea 3: Improving Cellulase Catalytic Efficiency via Directed Evolution

Engineer a cellulase with improved catalytic efficiency (kcat/Km) for cellulose hydrolysis using a directed evolution approach. Cellulases break down cellulose into fermentable sugars and are a key bottleneck in biofuel production from lignocellulosic biomass. Starting from a wild-type endoglucanase, the project would use error-prone PCR or combinatorial saturation mutagenesis to generate variant libraries, then screen them in 96-well format using an automated DNS (dinitrosalicylic acid) reducing sugar assay on the Opentrons OT-2. Top-performing variants from each round would be recombined and re-screened over multiple DBTL cycles. Unlike the other two projects which are primarily computational design-driven, this project takes a classical directed evolution approach but uses lab automation to dramatically increase screening throughput.

AI use: Claude (Anthropic) was used to help draft and format this homework documentation. The Opentrons art Python script and coordinates were generated by the opentrons-art GUI tool. Web search was used to identify relevant published papers and AI thermostability tools.

Week 4 HW: Protein Design Part I

Part A: Conceptual Questions (Shuguang Zhang)

Q1: How many molecules of amino acids do you take with a piece of 500 grams of meat?

Meat is roughly 25% protein by weight, so 500g of meat contains ~125g of protein. The average amino acid has a molecular weight of ~110 Da (daltons), i.e. ~110 g/mol. So 125g ÷ 110 g/mol ≈ 1.14 mol of amino acid residues. Multiply by Avogadro’s number (6.022 × 10²³): roughly 6.8 × 10²³ amino acid molecules, just over one mole.

Q2: Why do humans eat beef but do not become a cow, eat fish but do not become fish?

Digestion breaks dietary proteins down into individual amino acids via proteases in the stomach and small intestine. These free amino acids enter the bloodstream as a generic pool; they carry no “cow identity.” Ribosomes then reassemble them into human proteins according to instructions from your own DNA.

Q3: Why are there only 20 natural amino acids?

The 20 amino acids cover the key physicochemical properties needed for protein function: hydrophobic, hydrophilic, charged, aromatic, small, large, flexible (glycine), rigid (proline), with minimal redundancy. The triplet genetic code (61 codons → 20 amino acids) is already near its error-tolerance optimum; adding more amino acids would mean sacrificing the codon redundancy that buffers against point mutations. And once the translation machinery (tRNA synthetases, ribosome) was locked in for 20, adding a 21st would require co-evolving multiple components simultaneously.

Q4: Can you make other non-natural amino acids? Design some new amino acids.

Yes. Non-natural amino acids (nnAAs) are routinely made and incorporated into proteins, either by chemical synthesis (solid-phase peptide synthesis) or by engineering orthogonal tRNA/synthetase pairs to read a stop codon as a nnAA (amber suppression). Examples: p-azidophenylalanine (adds a click-chemistry handle for bioconjugation), α-aminoisobutyric acid (forces helical folding, used in peptide drugs), and trifluoroleucine (fluorinated side chain that resists proteolytic degradation, useful for extending therapeutic half-life).

Q5: Where did amino acids come from before enzymes that make them, and before life started?

Amino acids form spontaneously under prebiotic conditions, no enzymes needed. Key sources: (1) Miller-Urey synthesis: electric discharges through early-Earth atmospheres produce amino acids (the original 1952 experiment yielded glycine, alanine, aspartic acid). (2) Meteorites: the Murchison meteorite contains 70+ amino acids, confirming extraterrestrial abiotic synthesis. (3) Hydrothermal vents: high-temperature reactions between CO₂, H₂, and NH₃. (4) Strecker synthesis: aldehydes + ammonia + HCN → amino acids, all plausible on the early Earth.

Q6: If you make an α-helix using D-amino acids, what handedness would you expect?

Left-handed. Natural L-amino acids form right-handed α-helices because of how the side chain sits relative to the backbone: right-handed dihedrals (φ ≈ −57°, ψ ≈ −47°) avoid steric clashes. D-amino acids are the mirror image at Cα, so the favoured angles flip (φ ≈ +57°, ψ ≈ +47°), producing a left-handed helix. The entire structure is just the mirror image of a normal α-helix.

Q7: Can you discover additional helices in proteins?

Yes. Beyond the common α-helix and 3₁₀-helix, other forms exist: the π-helix (wider, rarer, found as short insertions in ~15% of proteins), the polyproline II helix (left-handed, extended, common in collagen), and the collagen triple helix. New helical geometries could be accessed by using non-natural amino acids that explore regions of the Ramachandran plot forbidden to the 20 canonical ones.

Q8: Why are most molecular helices right-handed?

Because L-amino acids dominate biology. The Cα stereochemistry of L-amino acids means right-handed backbone dihedrals avoid steric clashes between the side chain and the preceding carbonyl oxygen; left-handed conformations are energetically penalised. The dominance of L-amino acids over D-amino acids is itself likely a frozen accident from early life: once one chirality was selected, everything downstream inherited the preference.

Q9: Why do β-sheets tend to aggregate?

β-sheets have “sticky edges”: their edge strands expose unsatisfied backbone NH and C=O groups that can hydrogen-bond with strands from other molecules, extending the sheet. Unlike α-helices (where all backbone H-bonds are satisfied internally), β-sheets are inherently open-ended. On top of that, sheets stack face-to-face via hydrophobic side chains. This combination of edge H-bonding and hydrophobic stacking drives formation of extended multi-layered aggregates, with amyloid fibrils being the extreme case.


Part B: Protein Analysis and Visualization

Part B: Protein Analysis and Visualization — PDB 1EDG

1. Protein Description and Selection

Protein: Endoglucanase A (CelCCA) — a bacterial cellulase enzyme Organism: Ruminiclostridium cellulolyticum (formerly Clostridium cellulolyticum), strain H10 UniProt: P17901 (GUNA_RUMCH) EC Number: 3.2.1.4

What it does: Endoglucanase A cleaves internal beta-1,4-glucosidic bonds in cellulose — the most abundant organic polymer on Earth. It’s one of three enzyme types needed to fully break down cellulose into glucose. The catalytic mechanism is a retaining double-displacement (Koshland mechanism), using two conserved glutamic acid residues: Glu170 (proton donor) and Glu307 (nucleophile), separated by 5.5 angstroms.

Why selected: Cellulases are central to biofuel production, textile processing, and the global carbon cycle. Understanding how nature designs enzymes to break down cellulose is directly relevant to protein engineering — we might redesign these enzymes for improved thermostability, altered substrate specificity, or integration into industrial biocatalytic workflows. This connects well to the HTGAA course themes of designing biological systems with practical applications.


2. Amino Acid Sequence

The crystallized catalytic domain is 380 residues (the full-length precursor is 475 residues including a signal peptide and C-terminal dockerin domain):

MYDASLIPNLQIPQKNIPNNDGMNFVKGLRLGWNLGNTFDAFNGTNITNELDYETSWSG
IKTTKQMIDAIKQKGFNTVRIPVSWHPHVSGSDYKISDVWMNRVQEVVNYCIDNKMYVIL
NTHHDVDKVKGYFPSSQYMASSKKYITSVWAQIAARFANYDEHLIFEGMNEPRLVGHANEW
WPELTNSDVVDSINCINQLNQDFVNTVRATGGKNASRYLMCPGYVASPDGATNDYFRMPND
ISGNNNKIIVSVHAYCPWNFAGLAMADGGTNAWNINDSKDQSEVTWFMDNIYNKYTSRGIP
VIIGECGAVDKNNLKTRVEYMSYYVAQAKARGILCILWDNNNFSGTGELFGFFDRRSCQFK
FPEIIDGMVKYAFGLIN

3. Sequence Length and Amino Acid Frequency

Length: 380 residues (catalytic domain) Molecular weight: ~43.1 kDa

Amino AcidCountFrequency
N (Asn)4110.8%
I (Ile)297.6%
G (Gly)287.4%
V (Val)266.8%
D (Asp)256.6%
S (Ser)246.3%
A (Ala)236.1%
K (Lys)215.5%
Y (Tyr)195.0%
L (Leu)184.7%
F (Phe)184.7%
T (Thr)184.7%
P (Pro)143.7%
M (Met)133.4%
R (Arg)133.4%
E (Glu)133.4%
Q (Gln)123.2%
W (Trp)112.9%
H (His)71.8%
C (Cys)71.8%

Most frequent: Asparagine (N) — 41 residues, 10.8% Least frequent: Cysteine (C) and Histidine (H) — 7 each, 1.8%

Composition summary:

  • Hydrophobic residues (A,I,L,M,F,W,V,P): 152 (40.0%)
  • Hydrophilic residues (D,E,K,R,H,N,Q,S,T): 174 (45.8%)
  • Aromatic residues (F,W,Y): 48 (12.6%)

The unusually high asparagine content (10.8% vs ~4% average) likely reflects the enzyme’s need for hydrogen-bonding residues in the active site cleft to interact with cellulose substrate hydroxyl groups.


4. Sequence Homologs

Using UniProt reference clusters:

  • UniRef90: 7 members (>=90% identity) — close homologs within order Eubacteriales
  • UniRef50: 11 members (>=50% identity) — broader Eubacteriales homologs

Using RCSB PDB sequence similarity (>=30% identity, E-value < 0.001):

  • 50 polymer entity hits in the PDB with structural similarity

The broader Glycosyl Hydrolase Family 5 contains over 57,000 sequences in UniProt (Swiss-Prot + TrEMBL), showing this is an extremely widespread enzyme family across bacteria, archaea, fungi, and plants.


5. Protein Family

CAZy family: Glycoside Hydrolase Family 5 (GH5), Clan GH-A Pfam: PF00150 — Cellulase (glycosyl hydrolase family 5) InterPro: IPR001547 — Glycoside hydrolase, family 5

GH5 is one of the largest glycoside hydrolase families, containing enzymes with diverse activities including endoglucanase, beta-mannanase, exo-1,3-glucanase, xylanase, and endoglycoceramidase. All share the (alpha/beta)8 TIM barrel fold and retaining catalytic mechanism.

The full-length protein also contains a dockerin domain (Pfam PF00404, residues 409-474), which anchors it to the cellulosome — a large multi-enzyme complex on the bacterial cell surface that coordinates cellulose degradation.


6. RCSB Structure Page

PDB ID: 1EDG Deposition date: July 7, 1995 Release date: August 17, 1996 Current version: 1.3 (last revised February 7, 2024)

Structure Quality

MetricValueAssessment
Resolution1.60 AExcellent (< 2.0 A)
R-work0.191Good
R-free0.220Good
Ramachandran outliers0.26%Very good
Clash score5.39Acceptable
Data completeness98.0%Very good

This is a high-quality structure. At 1.60 A resolution, individual atoms and even some hydrogen atoms can be resolved. The R-factors and validation metrics are all within acceptable ranges.

Method: X-ray diffraction Space group: P 2(1) 2(1) 2(1) (orthorhombic) Data collected: November 1994 at EMBL/DESY Hamburg synchrotron

Primary citation: Ducros V et al. (1995) “Crystal structure of the catalytic domain of a bacterial cellulase belonging to family 5.” Structure 3:939-949. PMID: 8535787


7. Other Molecules in the Structure

Ligands/ions/cofactors: None. The structure was solved in the apo (unbound) form.

Water molecules: 375 solvent atoms resolved in the crystal structure.

Disulfide bonds: None (0) Cis-peptides: 1

The absence of bound substrate or inhibitor means this structure represents the resting state of the enzyme. The active site cleft is empty and accessible.


8. Structure Classification Family

CATH classification:

  • 3.20.20.80 — Glycosidases
    • Class 3: Alpha-Beta proteins
    • Architecture 20: Alpha-Beta Barrel
    • Topology 20: TIM Barrel
    • Homologous Superfamily 80: Glycosidases

SCOP classification:

  • Fold: c.1 — TIM beta/alpha-barrel
  • Superfamily: c.1.8 — (Trans)glycosidases

The (alpha/beta)8 TIM barrel is the most common enzyme fold in nature, found in ~10% of all known enzyme structures. It consists of 8 alternating beta-strands and alpha-helices forming a barrel, with the active site located at the C-terminal ends of the beta-strands.


9. 3D Visualization (PyMOL)

All visualizations were generated using PyMOL (command-line mode). The PDB file was fetched directly from RCSB and solvent was removed before visualization.

9a. Cartoon, Ribbon, and Ball-and-Stick Representations

CartoonRibbonBall and Stick

Rainbow coloring from N-terminus (blue) to C-terminus (red) shows the polypeptide chain threading through the TIM barrel architecture.

9b. Color by Secondary Structure

Color key: Red = alpha-helices | Yellow = beta-sheets | Green = loops/coils

Analysis: The protein has significantly more helices than sheets. This is characteristic of the TIM barrel fold:

  • 8 large alpha-helices (red) form the outer ring of the barrel
  • 8 smaller beta-strands (yellow) form the inner barrel core
  • Extensive loop regions (green) connect the secondary structure elements and form the active site cleft at the top of the barrel

The PyMOL coloring confirms: 1,294 atoms in helices vs 279 atoms in sheets — roughly a 4.6:1 ratio of helix to sheet content.

9c. Color by Residue Type

Color key: Orange = hydrophobic (A,V,L,I,M,F,W,P) | Green = polar uncharged (S,T,Y,N,Q,C,G) | Blue = positively charged (K,R,H) | Red = negatively charged (D,E)

Analysis:

  • Hydrophobic residues (orange) are concentrated in the protein interior, packed between the beta-strands and alpha-helices of the TIM barrel — this is the hydrophobic core that stabilizes the fold
  • Polar residues (green) are distributed throughout but are especially abundant on the surface and in the active site cleft
  • Charged residues (blue = positive, red = negative) are predominantly on the protein surface, as expected — they interact with solvent and contribute to protein solubility
  • The active site region shows a mix of polar and charged residues, consistent with the enzyme’s need to bind and hydrolyze the polar cellulose substrate

9d. Surface View — Binding Pockets

Analysis: The surface view clearly reveals a prominent elongated cleft running across one face of the protein. This is the substrate-binding groove where cellulose chains bind for hydrolysis. The cleft is:

  • Deep and elongated — characteristic of endoglucanases, which must accommodate a long polysaccharide chain
  • Lined with polar (green) and charged (red, blue) residues — providing hydrogen bonds and electrostatic interactions to grip the cellulose substrate
  • Flanked by aromatic residues — tryptophan and tyrosine side chains form stacking interactions with the sugar rings (a hallmark of carbohydrate-active enzymes)

The catalytic residues Glu170 and Glu307 sit at the base of this cleft, positioned to cleave the glycosidic bond via the retaining mechanism.

The transparent surface overlay shows how the TIM barrel architecture creates the active site cleft at the C-terminal face of the barrel.


PyMOL Commands Used

# Load and clean
fetch 1EDG, async=0
remove solvent

# Cartoon view (rainbow N→C)
hide all; show cartoon
spectrum count, rainbow
ray 1200, 900; png cartoon_view.png, dpi=150

# Ribbon view
hide all; show ribbon
spectrum count, rainbow
ray 1200, 900; png ribbon_view.png, dpi=150

# Ball and stick
hide all; show sticks; show spheres
set sphere_scale, 0.25; set stick_radius, 0.1
spectrum count, rainbow
ray 1200, 900; png ball_and_stick_view.png, dpi=150

# Secondary structure coloring
hide all; show cartoon
color red, ss h        # helices
color yellow, ss s     # sheets
color green, ss l+''   # loops
ray 1200, 900; png secondary_structure.png, dpi=150

# Residue type coloring
hide all; show cartoon
color orange, resn ALA+VAL+LEU+ILE+MET+PHE+TRP+PRO  # hydrophobic
color green, resn SER+THR+TYR+ASN+GLN+CYS+GLY        # polar
color blue, resn LYS+ARG+HIS                          # positive
color red, resn ASP+GLU                                # negative
ray 1200, 900; png residue_type.png, dpi=150

# Surface view
hide all; show surface
# (same residue type coloring as above)
ray 1200, 900; png surface_view.png, dpi=150

# Surface + cartoon overlay
hide all; show cartoon; color gray80
show surface; set transparency, 0.7
ray 1200, 900; png surface_cartoon_overlay.png, dpi=150

Part C: Using ML-Based Protein Design Tools — PDB 1EDG

All experiments were run using the Amina CLI (v0.2.6), which provides cloud-hosted access to the same models used in the Colab notebook (ESM-1v, ESM2, ESMFold, ProteinMPNN). Our target protein is Endoglucanase A (1EDG) from R. cellulolyticum — a 380-residue cellulase with a TIM barrel fold.


C1. Protein Language Modeling

Deep Mutational Scan (ESM-1v)

Tool: amina run esm1v --mode dms with 5-model ensemble What it does: For each of the 380 positions, the model masks that residue and scores all 20 possible amino acids, producing 7,600 mutation scores. More negative scores indicate mutations that would be more damaging to protein function.

Score statistics: Mean = -3.30, Std = 3.41, Range = [-15.4, +5.1]

Most Conserved Positions (hardest to mutate)

RankPositionResidueAvg ScoreSignificance
1307Glu (E)-11.91Catalytic nucleophile
279Arg (R)-11.19Structural role in barrel
3169Asn (N)-10.83Adjacent to catalytic Glu170
4254His (H)-10.73Active site geometry
5170Glu (E)-10.63Catalytic proton donor
6156Phe (F)-10.21Aromatic stacking in substrate cleft
7149Trp (W)-10.13Substrate binding platform

Key observation: The two catalytic glutamic acids (Glu307 and Glu170) rank #1 and #5 respectively as the most conserved positions — the language model independently identifies the active site residues purely from evolutionary sequence patterns, without any structural information. The surrounding residues (Asn169, Phe156, Trp149) are also highly conserved, reflecting their roles in positioning the catalytic machinery and binding the cellulose substrate through aromatic stacking interactions.

Most Tolerant Positions (easiest to mutate)

PositionResidueAvg ScoreInterpretation
376Phe (F)+1.94C-terminal, surface-exposed
287Trp (W)+1.60Solvent-exposed, non-functional
266Met (M)+1.40Surface loop region
86Pro (P)+1.31Flexible loop

These tolerant positions are on the protein surface, far from the active site — consistent with the expectation that surface residues under less evolutionary constraint can accept substitutions more readily.

Latent Space Analysis (ESM2 Embeddings)

Tool: amina run esm2-embedding with ESM2-8M model Dataset: 500 random sequences from the SCOP structural domain database (astral-scopedom-seqres-gd-sel-gs-bib-40-2.08) plus our 1EDG sequence (501 total)

Clustering results: HDBSCAN identified 7 clusters, with 1EDG assigned to Cluster 5 (the largest cluster, 262 members, probability = 0.60).

Analysis: The t-SNE plot shows that protein sequences form neighborhoods based on structural/functional similarity in the ESM2 embedding space. Our 1EDG (labeled “1EDG_CelCCA”) sits within the large central cluster, which likely groups alpha/beta barrel proteins — consistent with the TIM barrel being the most common enzyme fold in nature. The moderate cluster probability (0.60) suggests 1EDG sits near the boundary of this neighborhood, which makes sense given that GH5 cellulases, while sharing the TIM barrel fold, have distinct sequence features compared to other TIM barrel enzymes.


C2. Protein Folding (ESMFold)

Folding the Native Sequence

Tool: amina run esmfold on the native 1EDG sequence (380 residues)

Results:

  • Mean pLDDT: 0.93 (very high confidence)
  • Backbone RMSD vs crystal structure: 1.27 A

The pLDDT plot shows near-perfect confidence (>0.9) across most of the protein, with brief dips around:

  • Residues ~245-250: A flexible loop region connecting beta-strands in the barrel
  • Residues ~350-360: A surface-exposed turn
  • Residues ~375-380: The C-terminus (expected lower confidence at chain termini)

ESMFold vs Crystal Structure Overlay

Gray = crystal structure (1EDG, 1.60 A X-ray), Blue = ESMFold prediction

The predicted structure matches the crystal structure remarkably well (RMSD = 1.27 A). The overall TIM barrel topology, helix positions, and loop conformations are all correctly predicted. The largest deviations occur in the flexible loop regions identified by the lower pLDDT scores — these are genuinely flexible in the real protein and are often poorly resolved even in crystal structures.

Is the structure resilient to mutations? The TIM barrel is one of the most robust protein folds in nature. Moderate mutations (especially in surface loops and non-core positions) would be expected to preserve the overall fold, while mutations to the hydrophobic core or conserved structural positions (identified by the DMS scan) would likely destabilize it.


C3. Protein Generation (ProteinMPNN + ESMFold)

Inverse Folding with ProteinMPNN

Tool: amina run proteinmpnn on the cleaned 1EDG crystal structure Parameters: Chain A, 8 sequences, temperature = 0.1 (near-greedy sampling), vanilla model

Results:

RankScoreSequence RecoveryKey Observation
1 (best)0.67755.3%Best scoring design
20.69555.3%Similar to rank 1
30.69553.2%
40.70453.2%
50.70654.5%
60.70851.3%
70.70953.2%
8 (worst)0.72148.9%Most divergent

Mean sequence recovery: 53.1% — meaning ProteinMPNN independently recovered ~53% of the native amino acids purely from the backbone geometry. This is typical for a well-folded globular protein.

Predicted vs Original Sequence Comparison

Aligning the best ProteinMPNN design (rank 1) against the native sequence:

Native:  MYDASLIPNLQIPQKNIPNNDGMNFVKGLRLGWNLGNTFDAFNGTNITNELDYETSWSG...
Design:  AYDPSLIPDLNIPQKPIPDNEAMRFVKSLRLGWNLGNTFDANSGSNIKNKLDYETAKAGV...
Match:   .YD.SLIP.L.IPQK.IP.N..M.FVK.LRLGWNLGNTFD.N.G.NI.N.LDYET....

Patterns in the designed sequence:

  • Core residues are highly conserved: Hydrophobic core positions (Val, Leu, Ile, Phe) are almost always recovered, reflecting their structural importance
  • Active site residues recovered: The catalytic Glu residues and surrounding aromatic platforms are retained
  • Surface residues diverge most: Charged and polar surface residues (Lys, Asp, Glu, Asn) show the most substitutions — consistent with the DMS tolerance analysis
  • Glycines conserved: Gly residues in tight turns and barrel connections are strongly recovered, as they occupy positions requiring the unique conformational flexibility of glycine

Refolding the Designed Sequence with ESMFold

Tool: amina run esmfold on the best ProteinMPNN sequence (rank 1)

Results:

  • Mean pLDDT: 0.92 (very high — the designed sequence is predicted to fold confidently)
  • Backbone RMSD vs original crystal: 0.79 A

Gray = crystal structure (1EDG), Green = ESMFold prediction of ProteinMPNN-designed sequence

The ProteinMPNN-designed sequence (with only ~55% identity to the native) folds into essentially the same structure as the original protein (RMSD = 0.79 A). Remarkably, this RMSD is even lower than the ESMFold prediction of the native sequence (1.27 A), suggesting that ProteinMPNN designs may be optimized for a more “ideal” version of the backbone geometry.

Summary Table

ComparisonRMSD (A)pLDDT
ESMFold (native seq) vs Crystal1.270.93
ESMFold (ProteinMPNN seq) vs Crystal0.790.92

Tools Used (Amina CLI)

All computations ran on AminoAnalytica’s cloud infrastructure via the Amina CLI:

# C1: Deep Mutational Scan (5-model ensemble, 7600 mutations scored)
amina run esm1v -m dms -s "<sequence>" -n 5 -j 1edg_dms -o ./C1_dms/

# C1: Latent Space Embeddings (501 sequences, ESM2-8M + t-SNE + HDBSCAN)
amina run esm2-embedding -f embedding_input.fasta -m 8M -j 1edg_embed -o ./C1_embeddings/

# C2: Structure Prediction (ESMFold)
amina run esmfold -s "<sequence>" -j 1edg_fold -o ./C2_esmfold/

# C2: RMSD Comparison
amina run simple-rmsd --reference crystal.pdb --mobile esmfold.pdb -o ./C2_esmfold/

# C3: Inverse Folding (ProteinMPNN, 8 sequences, T=0.1)
amina run proteinmpnn --pdb 1edg_cleaned.pdb --chains A -n 8 --temperature 0.1 -o ./C3_proteinmpnn/

# C3: Refold designed sequence
amina run esmfold -s "<mpnn_sequence>" -j mpnn_refold -o ./C3_refold/
amina run simple-rmsd --reference crystal.pdb --mobile mpnn_refold.pdb -o ./C3_refold/

Structural overlays were visualized in PyMOL.


Part D: Group Brainstorm on Bacteriophage Engineering

Primary: Enhance the lytic activity (toxicity) of the MS2 L protein. Secondary: Improve protein stability of engineered variants.

Approach

We target the C-terminal functional domain of L — specifically residues in and around the conserved LS dipeptide motif identified by Chamakura et al. (2017) — while preserving membrane insertion and oligomerisation capacity, which Mezhyrova et al. (2023) showed is essential for the pore-forming lysis mechanism.

A key insight from the literature is that truncating L’s N-terminal domain removes its dependency on the host chaperone DnaJ and accelerates lysis by ~20 minutes. Rather than crude truncation, we use computational redesign to engineer a shorter, less basic Domain 1 that bypasses the DnaJ regulatory brake while retaining a full-length, well-folded protein.

Proposed Pipeline

L sequence
  │
  ▼
ESMFold ──────────────► Baseline monomer structure
  │
  ▼
ESM2 (log-likelihood) ► Score all single-point mutants
  │                      (avoid LS motif & TM core)
  ▼
ProteinMPNN ──────────► Redesign N-terminal domain
  │                      (shorter, less basic, DnaJ-independent)
  ▼
AlphaFold-Multimer ───► Validate oligomeric assembly
  │                      + check L–DnaJ complex disruption
  ▼
FoldSeek ─────────────► Confirm no convergence on
  │                      known problematic folds
  ▼
Ranked candidate list

Tools & Justification

ToolRoleWhy It Helps
ESMFoldPredict baseline L monomer structureFast single-chain prediction; no experimental structure exists for L
ESM2Score mutant fitness via log-likelihood ratiosCaptures evolutionary fitness signals even for short, poorly characterised proteins
ProteinMPNNRedesign the dispensable N-terminal domainGenerates sequences respecting backbone geometry of the functional C-terminal half
AlphaFold-MultimerPredict L homo-oligomers and L–DnaJ complexValidates that redesigned variants still form oligomeric assemblies needed for membrane disruption
FoldSeekStructure-based similarity search on final designsSanity check that designs don’t converge on known toxic or off-target folds

Potential Pitfalls

  1. No solved structure for L. It is a small (75 aa) membrane-associated peptide — all structural predictions are low-confidence and may not capture the membrane-inserted conformation accurately. Nanodisc or membrane-mimetic modelling is beyond the scope of these tools.

  2. Unknown lytic target. The actual molecular target of L in the bacterial cell envelope has never been identified. We are optimising against proxies (oligomerisation propensity, DnaJ interaction, stability) rather than the true mechanism of action. A design that scores well computationally may not translate to improved lysis in vivo.

Week 5 HW: Protein Design Part II

Part 1: Generate Binders with PepMLM

Sequence Preparation

Retrieved WT SOD1 from UniProt P00441 and introduced A4V (position 5 in full sequence, position 4 in mature protein after Met cleavage):

WT:  M A T K A V C V L K ...
A4V: M A T K V V C V L K ...
               ^

PepMLM Results

Model: PepMLM-650M (ESM-2 based, Colab, T4 GPU) | Parameters: length = 12, num_binders = 4, top_k = 3

RankPeptidePseudo-Perplexity
1WRYPAVAAEHWK12.41
2WRYPAVAAEHWE13.54
3WRYYPAGVAWKE14.19
4WLYPVAVLRWKX17.70
refFLYRWLPSRRGG— (known SOD1 binder)

3/4 designs share a WRY N-terminal motif (aromatic + cationic). The best peptide (PPL=12.41) and the known binder both feature aromatic-rich N-termini and positively charged residues, suggesting electrostatic complementarity with SOD1’s surface.


Part 2: Evaluate Binders with Boltz-2

Method

Each peptide was modeled as a complex with SOD1-A4V using Boltz-2 (via Amina CLI) — a diffusion-based structure prediction model that produces ipTM scores comparable to AlphaFold-Multimer. SOD1-A4V was chain A, peptide was chain B.

Note: Peptide 4 contained an ambiguous X residue; this was replaced with Ala for structure prediction.

Results

PeptidePseudo-PerplexityipTMpLDDTConfidencepDockQ2
WRYPAVAAEHWK12.410.89493.60.9280.458
WLYPVAVLRWKA17.700.76993.70.9040.308
WRYYPAGVAWKE14.190.75391.30.8810.195
FLYRWLPSRRGG (known)0.67592.80.8770.095
WRYPAVAAEHWE13.540.53093.10.8510.042

Analysis

The top PepMLM design (WRYPAVAAEHWK) substantially outperforms the known binder, with ipTM = 0.894 vs 0.675 — well above the 0.8 threshold generally considered confident for protein-protein interactions. Three of four PepMLM peptides exceed the known binder’s ipTM.

The best-scoring peptide also had the lowest PepMLM perplexity (12.41), showing agreement between the language model’s confidence and the structure predictor’s assessment. The one exception is peptide 2 (WRYPAVAAEHWE), which differs from peptide 1 by a single residue (K→E at position 12) but drops dramatically in ipTM (0.894 → 0.530) — suggesting this C-terminal charge is critical for the binding interaction.

Predicted structures are available as PDB files in C2_boltz2/ for visualization.


Part 3: Evaluate Therapeutic Properties with PeptiVerse

Method

All peptides (4 PepMLM designs + known binder) were submitted to PeptiVerse with SOD1-A4V as the target. Properties evaluated: binding affinity (pKd), solubility, hemolysis, molecular weight, net charge (pH 7).

Results

PeptideipTM (Boltz-2)Binding Affinity (pKd)SolubilityHemolysis (prob)Net Charge (pH 7)MW (Da)
WRYPAVAAEHWK0.8945.44Soluble0.023+0.851513.7
WRYPAVAAEHWE0.5305.58Soluble0.035−1.141514.6
WRYYPAGVAWKE0.7535.95Soluble0.022+0.771525.7
WLYPVAVLRWKA0.7696.48Soluble0.043+1.761501.8
FLYRWLPSRRGG (known)0.6755.97Soluble0.047+2.761507.7

Analysis

All peptides are predicted soluble and non-hemolytic (all < 5% probability), which is encouraging for therapeutic development. Binding affinities are all in the “weak binding” range (pKd 5.4–6.5), which is typical for short linear peptides.

Interestingly, ipTM and predicted affinity do not fully agree. The best structural binder (WRYPAVAAEHWK, ipTM = 0.894) has the lowest predicted affinity (pKd = 5.44), while peptide 4 (WLYPVAVLRWKA) has the highest affinity (pKd = 6.48) but only moderate ipTM (0.769). This reflects the fact that PeptiVerse predicts affinity from sequence alone, while Boltz-2 evaluates structural complementarity — the two metrics capture different aspects of binding.

Recommended peptide to advance: WRYPAVAAEHWK. It has the strongest structural evidence for binding (ipTM = 0.894, well above the 0.8 confidence threshold), excellent safety profile (non-hemolytic, soluble, near-neutral charge), and the lowest PepMLM perplexity (12.41). While its sequence-predicted affinity is modest, the high ipTM and pDockQ2 (0.458) suggest a well-defined binding interface that may translate better to experimental validation than affinity predictions alone.


Part 4: Generate Optimized Peptides with moPPIt

Method

moPPIt-v3 uses Multi-Objective Guided Discrete Flow Matching (MOG-DFM) to steer peptide generation toward specific target residues while optimizing multiple therapeutic properties simultaneously. Unlike PepMLM (which samples plausible binders from sequence alone), moPPIt lets you specify where to bind and what properties to optimize.

Parameters: Target = SOD1-A4V, length = 12, motif positions = 1–10 (N-terminal region near A4V), objectives = Hemolysis, Solubility, Motif (weight = 1.0 each), GPU = L4

Results

PeptideHemolysis ScoreSolubility ScoreMotif Score
DTKVKCGGNTQW0.9680.8330.803
GCFEKTTGKTQD0.9710.9170.769
KTGGKTQKITWH0.9620.8330.757
TDTIRYKRQADE0.9740.8330.664

(Scores closer to 1.0 = better for hemolysis/solubility; motif score reflects binding to target residues 1–10.)

PepMLM vs moPPIt Comparison

PropertyPepMLM PeptidesmoPPIt Peptides
Design strategySequence-conditioned samplingMulti-objective guided flow matching
CompositionAromatic-rich (WRY motifs), hydrophobicCharged/polar (K, T, D, E), hydrophilic
Target awarenessWhole protein (implicit)Specific residues 1–10 (explicit)
Therapeutic optimizationNone (binding only)Hemolysis, solubility optimized jointly

The moPPIt peptides are strikingly different in character — dominated by polar and charged residues (Lys, Thr, Asp, Glu, Gly) rather than the aromatic-heavy PepMLM designs. This likely reflects the multi-objective optimization steering away from hydrophobic residues (which can cause hemolysis and poor solubility) toward safer, more soluble compositions.

How to Evaluate Before Clinical Advancement

Before advancing any peptide toward clinical studies, one would need to:

  • Binding validation: Surface plasmon resonance (SPR) or isothermal titration calorimetry (ITC) to measure actual binding affinity
  • Structural confirmation: Cryo-EM or X-ray crystallography of the peptide-SOD1 complex
  • Selectivity testing: Ensure binding to mutant A4V-SOD1 over wild-type (to avoid disrupting normal SOD1 function)
  • Cell-based assays: Test whether peptide-E3 ligase fusions can degrade mutant SOD1 in cellular models
  • Stability and pharmacokinetics: Serum stability, half-life, and cell permeability measurements
  • In vivo efficacy: Animal models of SOD1-ALS (e.g., SOD1-G93A transgenic mice)


Part C: Phage Lysis Protein Design Challenge

Background

The MS2 phage L-protein (75 residues) lyses E. coli by forming membrane pores. E. coli can resist by mutating its DnaJ chaperone, preventing L-protein folding. Our goal: design L-protein mutants that reduce DnaJ dependence while maintaining lysis activity.

METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT
|_____________ Soluble domain (1-39) ____________|____ Transmembrane (40-75) ____|
         (interacts with DnaJ)                        (forms membrane pores)

Approach: Option 2 — Boltz-2 Co-folding + DMS Analysis

Step 1: Deep Mutational Scan (ESM-1v)

Tool: amina run esm1v -m dms with 5-model ensemble (1,500 mutations scored across 75 positions)

DMS vs Experimental Data Correlation

Cross-referencing the DMS scores with published experimental mutation data (Chamakura et al., 2017):

Experimentally validated lysis-positive mutations in the soluble domain:

PositionChangeLysisProteinDMS ScoreDMS Assessment
13P→L11−0.37Tolerated
15S→A11+0.01Favorable
18R→G11−0.92Tolerated
18R→I11−1.17Tolerated
23K→E10+0.58Favorable
25E→G10+0.23Favorable
25E→V10+0.04Favorable
25E→D10+0.15Favorable
26D→G10+0.14Favorable
30R→Q11−0.46Tolerated
30R→L11−0.35Tolerated
31R→I11−1.23Tolerated

Critical discrepancy — Position 29 (Cys): The DMS model scores C29 as the most mutable position (avg = +2.45), yet experimentally C29R kills both lysis and protein expression. The language model misses this — likely a disulfide bond or folding nucleation site. This highlights the importance of cross-validating computational predictions with experimental data.

Most tolerant positions (DMS): 29 (but experimentally essential!), 27, 24, 5, 23, 22, 39, 25 Most conserved positions (DMS): 1 (Met, start codon), 38, 31, 19, 30, 18, 20, 11

Step 2: Boltz-2 Co-fold (L-protein + DnaJ)

Tool: amina run boltz2 with L-protein (chain A, 75 aa) + DnaJ (chain B, 376 aa)

Results: ipTM = 0.165, pLDDT = 71.5, pDockQ2 = 0.009

The very low ipTM confirms what the assignment warned — folding models struggle with this system. The L-protein has a disordered soluble domain and a transmembrane region, neither of which fold well in isolation. However, the low-confidence prediction still places the soluble domain (residues 1–39) in proximity to DnaJ’s N-terminal J-domain, which is consistent with the known chaperone-substrate interaction mode.

Step 3: Designed Mutations

Based on combining: (1) DMS-favorable scores, (2) experimentally validated lysis-positive mutations, and (3) avoidance of conserved/essential positions, we propose 5 L-protein variants:

VariantMutationsRegionRationale
V1K23E + E25G + D26GSolubleAll three individually maintain lysis. Remodels the charge landscape of positions 23–26 (removes net +1 charge, adds flexibility). All DMS-favorable.
V2R18G + R20L + K23ESolubleRemoves three positive charges from the Arg/Lys-rich stretch (18–23), which is the predicted DnaJ interaction surface. All experimentally lysis-positive. May reduce DnaJ dependence by altering the chaperone recognition motif.
V3S15A + E25G + R30QSolubleCombines the most conservative experimentally validated mutations across different subregions. S15A and R30Q both maintain lysis AND protein levels, minimizing risk.
V4P13L + R18I + E25DSolubleP13L disrupts a potential turn structure; R18I replaces a charged Arg with hydrophobic Ile; E25D is a conservative acidic→acidic swap. All lysis-positive experimentally.
V5R19S + K23E + D26GSolubleTargets the cationic cluster (R19, K23) that likely mediates DnaJ binding. Replacing Arg/Lys with neutral/acidic residues may enable DnaJ-independent folding while maintaining the downstream lysis machinery.

Design principles:

  • All individual mutations are experimentally validated as lysis-positive
  • Mutations target the Arg/Lys-rich region (positions 18–26) that likely mediates DnaJ recognition
  • No mutations at position 29 (Cys — essential despite DMS scores) or position 1 (Met — start codon)
  • Each variant has ≥3 mutations for meaningful charge/surface remodeling

Tools Used

StepToolDetails
Deep mutational scanamina run esm1v -m dms5-model ensemble, 1500 mutations
Co-fold L-protein + DnaJamina run boltz22-chain complex, diffusion model
Experimental cross-validationPublished dataChamakura et al., 2017

Week 6 HW: Genetic Circuits Part I

Assignment: DNA Assembly

1. What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose?

  • Phusion DNA Polymerase: high-fidelity polymerase with 3’→5’ proofreading activity, so it corrects errors during extension. Much lower error rate than standard Taq.
  • dNTPs (dATP, dTTP, dCTP, dGTP): the nucleotide building blocks the polymerase adds to the growing strand.
  • MgCl₂: magnesium ions are an essential cofactor for polymerase function. Concentration affects stringency.
  • Reaction buffer: maintains optimal pH and salt conditions across the three PCR temperature steps.

The user adds template DNA and two primers (forward and reverse) to complete the reaction.

2. What are some factors that determine primer annealing temperature during PCR?

  • Primer length: longer primers form more hydrogen bonds with the template, raising Tm. Standard range is 18-22 bases. Too short risks nonspecific binding.
  • GC content: G-C pairs have 3 hydrogen bonds vs A-T’s 2, so GC-rich primers bind tighter and have higher Tm. Aim for 40-60%.
  • Mismatches: if the primer doesn’t perfectly match the template (e.g. when introducing a deliberate mutation), binding is weaker and you may need to lower annealing temperature or lengthen the primer.
  • Salt concentration: cations stabilize the primer-template duplex. Higher salt raises effective annealing temperature.
  • Primer pair matching: both primers should have similar Tm values (52-58°C range). Mismatched Tm between primers makes the reaction inefficient.

3. Compare and contrast PCR and restriction enzyme digests for creating linear DNA fragments.

PCR uses synthetic primers to define fragment boundaries, then amplifies the region between them. Full control over where you cut. Can also introduce mutations via primer mismatches and add overlaps for Gibson Assembly. Requires a thermocycler and careful primer design.

Restriction enzyme digests use enzymes that recognize and cut specific short sequences (4-8 bp). Simpler, cheaper, faster, but you’re limited to wherever those recognition sites naturally occur. If a site appears in an unwanted location, you get extra fragments.

Use restriction digests when convenient cut sites already exist at the right positions. Simple and cheap for basic two-part cloning.

Use PCR when you need custom fragment boundaries, want to introduce mutations, or need overlapping fragments for Gibson Assembly.

Key output difference: restriction digests give sticky or blunt ends defined by the enzyme. PCR gives blunt ends by default, but primers can be designed with 5’ tails to add any desired overlap.

4. How can you ensure that the DNA sequences you have digested and PCR’d will be appropriate for Gibson cloning?

Gibson Assembly requires adjacent fragments to share 20-40 bp overlapping sequences so the exonuclease can expose complementary single-stranded regions that anneal.

  • Design overlaps into PCR primers: each primer has a 3’ region (~20 bp) binding the template, plus a 5’ tail (~20+ bp) matching the adjacent fragment.
  • Verify fragment sizes on a gel before assembling. Wrong band size = failed reaction.
  • Purify fragments: PCR reactions contain enzymes, buffers, and dNTPs that interfere with Gibson. Use column purification to get clean DNA first.
  • Check for unintended internal homology that could cause fragments to misassemble at the wrong junctions.
  • Note: restriction digest fragments don’t inherently have overlaps, so they’d need overlaps added via PCR before Gibson, or should be assembled using ligation instead.

5. How does the plasmid DNA enter the E. coli cells during transformation?

Heat shock: Cells are made competent with CaCl₂ on ice (Ca²⁺ neutralizes negative charges on DNA and membrane). Then a sudden 42°C heat shock for ~30 seconds creates transient pores in the membrane, allowing DNA to enter. Cells go back on ice to reseal, then recover in rich media (no antibiotics) for ~1 hour so they can start expressing the plasmid’s resistance gene.

Electroporation: An electrical pulse creates temporary membrane pores. More efficient than heat shock but needs specialized equipment.

Both methods are inefficient (most cells die or don’t take up the plasmid), but antibiotic selection solves this. The plasmid carries a resistance gene, and cells are plated on agar with that antibiotic. Only cells with the plasmid survive and form colonies.

6. Describe another assembly method in detail: BioBrick Assembly

BioBrick is the iGEM standard. Its power is universal part compatibility rather than chemical cleverness.

The standard: every BioBrick part has the same flanking restriction sites. Left end (prefix): EcoRI + NotI. Right end (suffix): SpeI + PstI. Every part in the iGEM registry uses this same structure.

Joining two parts: Cut Part A with EcoRI + SpeI. Cut Part B with XbaI + PstI (XbaI and SpeI produce compatible sticky ends). Ligate. The SpeI-XbaI junction forms an 8 bp “scar” that can’t be re-cut by either enzyme. The combined AB part still has the same prefix/suffix on its outer ends, so it can be joined to Part C using the exact same process. Infinitely chainable.

Advantages: any part works with any other part, thousands available in the iGEM registry, simple protocol, easy to teach.

Disadvantages: one part at a time (each addition = cut, ligate, transform, select, verify), the scar can disrupt reading frames, restriction sites can’t appear inside your parts. Gibson and Golden Gate are technically faster for multi-fragment work.

See hand-drawn diagram below showing BioBrick prefix/suffix structure and two-part joining.

Week 7 HW: Genetic Circuits Part II

Assignment Part 1: Intracellular Artificial Neural Networks (IANNs)

1. What advantages do IANNs have over traditional genetic circuits, whose input/output behaviors are Boolean functions?

Boolean circuits force biological signals into binary (high/low), but real biomarkers exist at continuous concentrations. IANNs operate on analog values, performing weighted summation and nonlinear activation (ReLU), so they can compute complex continuous functions like bandpass filters and diagonal decision boundaries. These are the kinds of input-output shapes actually needed for problems like cancer classification, where you care about relative concentration levels, not just on/off.

IANNs are also continuously tunable. You can shift the decision boundary by adjusting translation rates, rather than being locked into fixed thresholds. And because they’re universal function approximators, a small number of neurons (even ~10) can approximate essentially any biologically relevant function.

2. Describe a useful application for an IANN; include a detailed description of input/output behavior, as well as any limitations.

Application: targeted cancer therapy using miRNA classification.

Cancer cells have distinct miRNA profiles compared to healthy cells, but no single miRNA is unique to cancer. An IANN could take 2-3 miRNA concentrations as analog inputs, with neurons computing weighted sums where some miRNAs have positive weights (via promoter-driven expression) and others have negative weights (via endoribonuclease-mediated mRNA degradation). By composing a few neurons into a network, you can create a bandpass function that only activates a therapeutic output (e.g. a cytokine like IL-12) when the miRNA profile matches the cancer signature. Healthy cells with different profiles produce zero output.

Limitations:

  • Current circuits support ~3 neurons max, limiting classifier complexity
  • Biological noise blurs decision boundaries, risking false positives
  • Systemic delivery of circuit DNA to cells remains challenging
  • Weights are fixed at design time and can’t adapt to evolving tumour mutations

3. Multilayer perceptron diagram

See hand-drawn diagram.

In the single-layer perceptron, X1 encodes Csy4 endoribonuclease and X2 encodes a fluorescent protein with a Csy4 target site on its mRNA. Csy4 sequesters the mRNA (subtraction), and biology’s floor at zero gives ReLU for free.

For the multilayer version: Layer 1’s output is an endoribonuclease instead of a fluorescent protein. This feeds into Layer 2, where it targets the mRNA of the final fluorescent output. Two chained subtraction + ReLU operations, linked by using an endoribonuclease as the intermediate signal.


Assignment Part 2: Fungal Materials

1. Examples of existing fungal materials, uses, advantages and disadvantages?

Examples:

  • Mycelium packaging/insulation — mycelium grown on agricultural waste (wood chips, hay), packed into molds. Replaces styrofoam. Extremely lightweight and thermally/acoustically insulating.
  • Mycelium leather — processed mycelium sheets as leather alternatives in fashion and biocouture.
  • Mycelium construction — bricks grown from mycelium for architectural structures. The High Five Pavilion at MoMA is built entirely from mycelium bricks. NASA’s Mycotexture lab is exploring mycelium habitats for the Moon/Mars.
  • Biosement — bacteria that convert ammonia to calcium carbonate, solidifying sand/gravel into cement. Companies like Biomason do this at industrial scale.
  • Bacterial cellulose — SCOBY-grown sheets used as fabric alternatives.

Advantages: renewable (grown from agricultural waste), biodegradable, excellent insulation, lightweight, low energy to produce (room temperature growth), mouldable into custom shapes.

Disadvantages: slow growth (days to weeks), weaker than conventional materials, contamination risk during growth, hard to achieve consistent quality at scale, shrinks during dehydration.

2. What might you want to genetically engineer fungi to do? Advantages of synbio in fungi vs bacteria?

Engineering goals:

  • Better material properties (stronger, more flexible cell walls)
  • Biosensing (e.g. colour change in response to environmental signals, like the Aspergillus niger melanin/xylose system)
  • Stress resistance for extreme environments (radiation, low carbon — relevant to NASA’s space habitat work)
  • Programmable growth patterns without external molds

Why fungi over bacteria:

  • Fungi are eukaryotes, so they handle complex protein folding and post-translational modifications better than bacteria
  • Fungi naturally form robust materials (mycelium networks, fruiting bodies) that bacteria can’t
  • They grow on cheap unprocessed substrates (wood, agricultural waste)
  • Food-safe and familiar to consumers, lowering regulatory barriers

Key challenge: most genetic engineering tools exist for model fungi (yeast, Aspergillus) in Ascomycota. The material-relevant species (oyster mushroom, reishi) are in Basidiomycota, a distant phylum. Tools don’t transfer between them, and basidiomycetes are harder to transform because their spores take months to produce. New methods like Agrobacterium-mediated transformation are being developed to bridge this gap.

Subsections of Labs

Week 1 Lab: Pipetting

cover image cover image

Subsections of Projects

Individual Final Project

cover image cover image

Group Final Project

cover image cover image