Class Assignment — Week 1 1) Biological Engineering Application I aim to develop a computational and experimental platform for engineering metabolically constrained microbial systems designed for responsible real-world use. Inspired by clinical exposure to preventable infectious disease and my research at the intersection of microbiology and computational biology, the platform integrates genomic design rules, programmed auxotrophies, and environmental sensing circuits that couple microbial survival to defined ecological contexts.
Class Assignment — Week 2 Part 1 — Sequence Retrieval and Design Workflow 1) Sequence Retrieval and Benchling Initialization The process began with obtaining a Lambda GenBank file from New England Biolabs. After confirming the correct format, I imported the file into Benchling as a DNA sequence. Care was taken to ensure that the file was not mistakenly uploaded as RNA and that annotations displayed properly within the platform.
Class Assignment — Week 3 1) Opentrons Artwork 2) Published Papers Utilizing Automation LabscriptAI — Autonomous Liquid-Handling Robotics Scripting Gao et al., 2025 introduce LabscriptAI, a multi-agent framework that translates natural language experimental descriptions into validated Python scripts for heterogeneous liquid-handling robots, including Opentrons platforms.
Class Assignment — Week 4 Part A. Conceptual Questions 1) How many molecules of amino acids do you take with a piece of 500 grams of meat? Assumptions: lean meat is ~20% protein by mass, average amino acid residue ~100 Da (≈100 g/mol).
Class Assignment — Week 5 Part A. SOD1 Binder Peptide Design Background ALS remains one of the more intractable neurodegenerative diseases partly because its genetic architecture is well-defined but hard to drug. The A4V mutation in SOD1 - a single alanine-to-valine substitution at residue 4 - is one of the most aggressive familial variants, accelerating disease progression significantly compared to other SOD1 mutations. The aggregation-prone nature of the A4V protein makes it an interesting peptide-binding target: if you can design a peptide that engages the misfolded or oligomerizing form, you potentially disrupt a key early step in motor neuron toxicity.
Class Assignment — Week 6 Part A. DNA Assembly 1. Components of Phusion High-Fidelity PCR Master Mix A) Phusion DNA Polymerase A DNA-binding protein subunit that ensures higher template processivity, speed, and accuracy/fidelity alongside 5´→3´ polymerase activity and 3´→5´ exonuclease activity for proofreading.
Class Assignment — Week 7 Part A. Intracellular Artificial Neural Networks (IANNs) 1. Advantages of IANNs over Boolean Genetic Circuits Boolean genetic circuits are fundamentally limited by their design logic: every input gets collapsed into a binary state, and the circuit operates on those discrete values. That works for simple switch-like decisions, but most physiologically relevant signals (metabolite concentrations, osmotic gradients, and quorum sensing molecule titres), exist on a continuum, and forcing them through a hard threshold discards information. IANNs avoid this by processing analog inputs directly, generating graded outputs that reflect the actual magnitude of the input rather than just which side of a threshold it fell on.
Class Assignment — Week 9 Part A. General and Lecturer-Specific Questions 1. General homework questions 1. Advantages of Cell-Free Protein Synthesis Over In Vivo Methods Cell-free systems decouple protein production from cell viability, giving you direct control over reaction composition, temperature, redox state, and cofactor concentrations, none of which are easily tunable in living cells.
Class Assignment — Week 10 Homework: Final Project ÌṢỌ is currently computational, so the “measurements” in scope are model outputs rather than physical assays. The key quantities I track are: steady-state pathogen kill rate as a function of MccH47 production, growth rate as a function of expression burden δ, biosensor activation ratio across tetrathionate concentrations, and containment escape probability over generational time. These are computed from ODE integration and Moran process simulation rather than physical instruments, but they map directly onto measurable biological quantities that would need experimental validation in a future phase of the project.
Class Assignment — Week 11 Part A. Community Bioart Reflections | The 1,536 Pixel Artwork Canvas I contributed to the “Love” apple-shaped yellow sign at the mid-bottom of the artwork, working on the DNA assembly for that section of the plate.
Fitness-aware design of engineered probiotics under ecological and evolutionary constraints.
This project is a model-first, constraint-aware approach to engineering E. coli Nissle 1917 (EcN) as a gut sentinel: sensing context, responding with targeted antimicrobials, and remaining governable through built-in containment.
Inspiration
Where this came from
During my medical training in Osogbo, diarrheal admissions became a rhythm I could not ignore. Children arrived dehydrated, eyes sunken, mothers anxious yet composed in that uniquely Nigerian way, strong because they had to be. We gave ORS, zinc, fluids. Sometimes antibiotics “just in case.” Sometimes it worked. Sometimes the silence afterward stayed with me longer than the ward round.
In microbiology, I encountered E. coli again, this time not only as culprit but as chassis. That shift lingered. What if the organism we blamed could be redesigned as a responder—quiet in health and active only when toxin or inflammatory signals rise—constrained and context bound, unable to persist beyond intention?
The idea was not dramatic. It was patterned. Repetition in the pediatric ward met ecological thinking in the lab. If microbes shape disease landscapes, perhaps they can also stabilize them—precisely, intelligently, and safely—within the same environments where I first learned to treat the consequences.
Why this matters
Childhood diarrhoeal disease remains high-burden with persistent treatment gaps, despite well-known interventions. The ambition here is not spectacle—it is reliable behavior under pressure: a responder that stays quiet in health, activates only under risk signals, and remains bounded by design.
Core design stance
Optimize for stability, not just performance. I’m not chasing one “best construct.” I’m mapping design regimes: what works, what breaks, and what stays governable as conditions shift—fitness cost vs efficacy, signal vs noise, activation vs survivability.
System overview
ÌṢỌ is designed as a three-layer system:
Detection: a biosensor tuned to a pathogen-associated signal or inflammation-linked marker
Response: context-dependent expression of targeted antimicrobials (microcins)
Containment: survival becomes conditional via metabolic dependency (“metabolic contract”)
Modeling assumptions & constraints
Burden matters: expression cost is a first-class design variable, not a footnote
Selection is always running: anything that reduces fitness will be negotiated by evolution
The gut isn’t a flask: competition and variability are the setting, not edge-cases
Outputs are design guidance: models inform what to build next, not clinical claims
Containment is a system property: not only “does it exist,” but “does it hold under pressure?”
Out of scope (Spring 2026)
Wet-lab validation
Full microbiome ecosystem simulation
Inventing novel antimicrobials
Clinical deployment trials
Regulatory implementation
Pipeline
Model → explore → optimize → stress-test.
The goal is to produce:
reproducible computational models
tradeoff plots (fitness vs efficacy)
robustness/sensitivity analyses
design regimes rather than a single “optimal” construct
Circuit modules
Module 1 — Biosensor: reads a context signal and gates activation to reduce unnecessary burden
Module 2 — Regulator: thresholded activation to limit leaky expression and improve stability under selection
Module 3 — Effector (microcin): narrow-spectrum antimicrobial peptides aiming to pressure pathogens while minimizing broader disruption
Module 4 — Containment: metabolic dependency to embed governance in biology
Governance & biosafety
Metabolic Dependency: if the engineered organism is made dependent on an externally supplied essential metabolite, it becomes non-viable without deliberate human-provided support.
Ecological Firewall: escapees cannot persist in nature, reducing ecological risk.
Human-Controlled Survival (“metabolic contract”): survival is coupled to oversight and supply chains, embedding accountability into the organism’s survival logic.
References
Ba, F., Zhang, Y., Ji, X., Liu, W.-Q., Ling, S., & Li, J. (2023). Expanding the toolbox of probiotic Escherichia coli Nissle 1917 for synthetic biology. bioRxiv. https://doi.org/10.1101/2023.06.05.543671
Egbewale, B. E., Karlsson, O., & Sudfeld, C. R. (2022). Childhood Diarrhea Prevalence and Uptake of Oral Rehydration Solution and Zinc Treatment in Nigeria. Children, 9(11), 1722. https://doi.org/10.3390/children9111722
Gayawan, E., Cameron, E., Okitika, T., Egbon, O. A., & Gething, P. (2024). A situational assessment of treatments received for childhood diarrhea in the Federal Republic of Nigeria. PLOS ONE, 19(5), e0303963. https://doi.org/10.1371/journal.pone.0303963
Lynch, J. P., Goers, L., & Lesser, C. F. (2022). Emerging strategies for engineering Escherichia coli Nissle 1917-based therapeutics. Trends in Pharmacological Sciences, 43(9). https://doi.org/10.1016/j.tips.2022.02.002
Palmer, J. D., Piattelli, E., McCormick, B. A., Silby, M. W., Brigham, C. J., & Bucci, V. (2017). Engineered Probiotic for the Inhibition of Salmonella via Tetrathionate-Induced Production of Microcin H47. ACS Infectious Diseases, 4(1), 39–45. https://doi.org/10.1021/acsinfecdis.7b00114
Weibel, N., Curcio, M., Schreiber, A., et al. (2024). Engineering a Novel Probiotic Toolkit in Escherichia coli Nissle 1917 for Sensing and Mitigating Gut Inflammatory Diseases. ACS Synthetic Biology, 13(8), 2376–2390. https://doi.org/10.1021/acssynbio.4c00036
HTGAA Group Project: Engineering the MS2 Bacteriophage L Protein
Subsections of Projects
Individual Final Project
ÌṢỌ — Yoruba: to be well; to recover.
A fitness-aware engineered probiotic designed to sense gut context, respond with targeted antimicrobials, and remain governable by design.
The problem
Childhood diarrhoeal disease kills roughly half a million children under five every year, and the majority of those deaths happen in sub-Saharan Africa. During clinical training in Osogbo, the treatment options were ORS, zinc, and empirical antibiotics. Effective, but blunt. The gap is not a shortage of therapeutics. It is a precision problem.
Core design question
How do we design microbial circuits that remain governable under the evolutionary and ecological pressures of a real gut environment?
The existing engineered probiotic literature optimises for peak performance under ideal conditions. ÌṢỌ maps design regimes: what works, what breaks, and what stays stable as conditions shift.
System architecture
ÌṢỌ is a four-module sense-respond-contain system built on E. coli Nissle 1917 (EcN):
Module
Component
Role
Biosensor
TtrS/TtrR two-component system
Detects tetrathionate, a pathogen-associated signal produced during gut inflammation by Salmonella and E. coli O157:H7
Regulator
Thresholded Hill-function promoter
Gates activation; suppresses leaky expression and reduces fitness cost at homeostatic baseline
Effector
Microcin H47 (MccH47)
Narrow-spectrum antimicrobial; ATP synthase inhibition; active against Salmonella, Shigella, pathogenic E. coli; endogenous immunity protein MchI in EcN chassis
Containment
deltaDAPA auxotrophy
DAP absent from mammalian gut; deletion is lethal without exogenous supply; escape frequency ~10^-8 per generation
Design stance
Optimise for stability, not just performance. The output is not a single optimal construct. It is a map of design regimes: parameter regions where the circuit functions, where it fails under burden, and where containment holds under selection pressure.
Scope — Spring 2026
Computational modelling only. Wet-lab validation, full microbiome simulation, and clinical deployment are explicitly out of scope for this phase.
Aim 1 (Experimental)
The first aim of my final project is to build and simulate a genome-scale metabolic and circuit-level ODE model of the four-module ÌṢỌ architecture by utilising Tellurium/libroadrunner for time-course simulation, SALib for global sensitivity analysis, and NumPy-based Moran process modelling for evolutionary containment stability, generating a Pareto-resolved fitness-efficacy landscape and a ranked parameter influence analysis as the primary computational output.
Aim 2 (Developmental)
Following a successful Aim 1, the top-ranked parameter regimes from the Pareto landscape will guide assembly and transformation of the sense-respond-contain circuit into EcN. The engineered sentinel will be tested in co-culture assays against Salmonella Typhimurium and E. coli O157:H7, validating both the tetrathionate-sensing threshold and MccH47-mediated kill kinetics experimentally. Discrepancies between model predictions and wet-lab data will feed back into model refinement.
Aim 3 (Visionary)
The long-term goal is a rugged, orally delivered live biotherapeutic that operates autonomously in the gut, activates only in the presence of pathogen-associated tetrathionate, kills narrow-spectrum without collateral microbiome disruption, and cannot persist outside the host. If the fitness-governability framework holds, ÌṢỌ becomes a design methodology applicable beyond this specific pathogen set, with direct relevance for AMR management, inflammatory bowel disease, and cancer immunotherapy in low-resource clinical settings.
Literature context
Palmer et al. (2017, ACS Infectious Diseases) demonstrated that EcN can be engineered to sense gut-luminal tetrathionate via the TtrS/TtrR two-component system and produce Microcin H47 in response, achieving measurable Salmonella inhibition in a mouse colonisation model. Critically for ÌṢỌ, this paper provides experimentally validated, ODE-parameterisable values for sensor activation kinetics, MccH47 production rates, and pathogen kill constants, making it the direct quantitative predecessor to this project rather than simply a conceptual reference.
Stritzker et al. (2007, International Journal of Medical Microbiology) characterised deltaDAPA auxotrophy in EcN in detail, reporting an escape frequency of approximately 10^-8 per generation under DAP-free conditions. That specific number is what makes the containment module computationally tractable: escape probability can be directly parameterised in the Moran process model rather than estimated from first principles.
What is novel
What ÌṢỌ does that neither paper does is treat fitness cost as a first-class design variable rather than a post-hoc observation. Every published EcN engineering study acknowledges metabolic burden; none model it explicitly as a design input alongside efficacy. ÌṢỌ builds a Pareto frontier that makes the tradeoff navigable rather than anecdotal. The containment module also moves from binary characterisation (auxotrophy is present or absent) to a dynamic system property, asking how quickly a loss-of-function mutant fixes in a finite population over evolutionary time.
Why this matters
Diarrhoeal disease causes approximately 1.6 million deaths per year globally, with the under-five burden concentrated in West and East Africa. Nigeria alone accounts for a disproportionate share of this mortality. Existing interventions reduce severity but do not prevent recurrence in high-transmission settings, and empirical antibiotic use is accelerating resistance emergence in the pathogens most responsible for paediatric deaths: Salmonella, enterotoxigenic E. coli, and Shigella. A sentinel probiotic that activates conditionally, kills narrow-spectrum, and cannot persist outside the host addresses this without adding to AMR pressure.
Beyond the immediate clinical problem, the fitness-governability framework ÌṢỌ develops has broader implications. Any engineered living therapeutic faces the same core question: will the circuit hold under the evolutionary pressure of a real biological environment? Current regulatory frameworks for live biotherapeutics have no standardised computational tool for answering this before a clinical trial. ÌṢỌ begins building one. Nigerian and broader West African epidemiological data (Egbewale 2022; Gayawan 2024) are used to parameterise disease burden and clinical context from the start, not as a framing afterthought.
Ethical implications
Two principles are directly engaged here: beneficence and justice. A precision antimicrobial that spares the commensal microbiome and cannot persist outside the host is strictly better than empirical broad-spectrum antibiotics for the patient, for the microbiome, and for the resistance landscape. Research that addresses paediatric mortality in West Africa while remaining computationally grounded in West African epidemiology represents a genuine departure from the default of developing interventions for high-income contexts and adapting them downstream.
The risks require honesty. A single deltaDAPA deletion is probably not sufficient for any real-world deployment. The current model assumes a closed population and does not account for horizontal gene transfer of the dapA gene from environmental bacteria. The Moran process also excludes commensal competition dynamics, so estimates of circuit persistence are optimistic. These are known limitations, explicitly scope-bounded to this computational phase. Non-maleficence requires that these caveats travel with any communication of the results. Open-source model release via GitHub (MIT licensed) is a deliberate act toward equitable access to the methodology.
Chosen: tetrathionate via TtrS/TtrR two-component system.
Pathogen-specific: Salmonella and E. coli O157:H7 produce tetrathionate during gut inflammation via reactive oxygen species. Experimentally validated in EcN (Palmer et al. 2017). Signal is absent under homeostatic conditions, directly minimising leaky expression burden at baseline.
Microcin effector
Chosen: Microcin H47 (MccH47).
Naturally produced by EcN. Narrow-spectrum: E. coli, Salmonella, Shigella. Mechanism is ATP synthase inhibition, a well-characterised mode of action enabling direct ODE kill-kinetics parameterisation. Immunity protein MchI is endogenous to the EcN chassis. Palmer 2017 provides benchmarked production and kill-rate values for exactly this design.
Containment
Chosen: deltaDAPA auxotrophy (diaminopimelic acid / DAP).
DapA is essential for lysine and peptidoglycan synthesis. DAP is absent from the mammalian gut: no dietary source, no commensal production. Deletion is lethal without exogenous supply. Validated in EcN (Stritzker et al. 2007). Published escape frequency ~10^-8 per generation is directly parameterisable for the containment escape model.
ODE framework
Chosen: Tellurium + libroadrunner (SBML/Antimony).
Purpose-built for systems biology ODE modelling. Antimony syntax maps directly onto circuit topology (promoter to mRNA to protein). libroadrunner’s stiff CVODE solver handles fast mRNA turnover and slow protein accumulation dynamics without manual configuration. SBML export makes every model citable and reproducible. SciPy solve_ivp (LSODA flag) runs in parallel for parameter sweeps and Pareto grid computation.
Sensitivity analysis
Primary: PRCC via SALib (Marino et al. 2008).
Designed for nonlinear, monotonic systems, exactly what Hill-function gene circuits produce. 500 to 2000 Latin hypercube samples sufficient for 6 to 8 parameters.
Supplementary: Sobol total-order indices.
Captures interaction effects (Hill coefficient n and KD interact in the sensor module). 5000 to 10000 samples, tractable on a laptop in minutes.
Evolutionary stability
Chosen: Moran process with fitness-weighted selection.
Two competing types: functional circuit (fitness 1 minus delta) and loss-of-function mutant (fitness 1). Fixation probability computed analytically (Nowak 2006), then 1000 stochastic trajectories via numpy.random.choice() with fitness-weighted birth-death events. Directly answers: how long does the circuit remain functional under selection pressure?
ODE engine
Tellurium + libroadrunner — all four-module ODE construction and time-course simulation written in Antimony syntax. SBML export for reproducibility and citability.
Numerical / sweeps
SciPy solve_ivp (LSODA) — parameter sweeps and Pareto grid computation. LSODA auto-switches between stiff and non-stiff regimes.
Sensitivity analysis
SALib — PRCC for main figures, Sobol as supplementary. Canonical citation: Marino et al. 2008, J. Theor. Biol.
Evolutionary simulation
NumPy random.choice() — Moran process. Fitness-weighted birth-death events across 1000 independent trajectories. No additional dependencies.
Task 1: Environment setup and baseline biosensor model (weeks 1)
Install and configure the full modelling stack: Tellurium, libroadrunner, SALib, NumPy, Matplotlib, Seaborn, SciPy, all pinned in a venv with requirements.txt. Build the biosensor module as a two-ODE Hill-function model encoding the TtrS/TtrR tetrathionate-to-promoter activation pathway. Fit activation threshold KD and Hill coefficient n against Palmer 2017 time-course data.
Expected result: Simulated sensor activation curve matches digitised Palmer 2017 experimental data within 20% across the measured tetrathionate concentration range.
Task 2: Full four-module ODE construction (weeks 2)
Extend the biosensor ODE to include the regulator module (thresholded Hill-function promoter gating effector expression), the effector module (MccH47 production and pathogen kill kinetics), and the containment module (deltaDAPA escape probability). Write all models in Antimony syntax within Tellurium. Export validated models to SBML and commit to GitHub.
Expected result: Stable steady-state solutions for all four modules under both homeostatic and pathogen-present conditions. Leaky expression at baseline should approach zero.
Task 3: Pareto landscape and parameter sweep (weeks 3)
Use SciPy solve_ivp with LSODA flag to sweep burden parameter delta and effector output rate k_M across a 50 x 50 parameter grid. Record steady-state growth rate and pathogen suppression ratio for each grid point. Plot Pareto frontier, colour-coded by regulator variant (linear vs. thresholded).
Expected result: A visible Pareto frontier separating viable design space from over-burdened and under-effective regions. The thresholded regulator variant should dominate the frontier.
Task 4: Global sensitivity analysis (weeks 4)
Run PRCC analysis via SALib using Latin hypercube sampling across 6 to 8 parameters: Hill coefficient n, signal threshold KD, burden delta, MccH47 production rate k_M, pathogen kill rate k_kill, mRNA degradation rate, and protein dilution rate. Generate ranked tornado chart. Run supplementary Sobol total-order index analysis.
Expected result:n and KD rank as the top two PRCC drivers of sensor module output. Sobol indices confirm a significant interaction effect between the two.
Task 5: Evolutionary stability via Moran process (weeks 5)
Implement the Moran process in NumPy. Define two competing cell types (functional circuit, fitness 1 minus delta; loss-of-function mutant, fitness 1). Compute analytical fixation probability from Nowak 2006. Run 1000 stochastic trajectories. Vary delta across the Pareto-viable range; plot fixation probability across three population sizes with analytical solution overlaid.
Expected result: Fixation probability of the loss-of-function mutant increases sharply above delta = 0.1. This is the quantitative argument for why the thresholded regulator module is not optional.
Industry Council companies
Company
Role
Asimov (Kernel)
Validate Pareto landscape and containment circuit architecture; independent cross-check against Tellurium ODE results
SecureDNA
Screen all DNA sequences (mccH47, deltaDAPA cassette) before synthesis
Cultivarium
EcN-specific transformation protocols and characterised parts for Aim 2
Twist Biosciences
Codon-optimised construct synthesis for Aim 2
Opentrons
Co-culture assay automation for Aim 2 parallel screening
1 — Fitness-efficacy Pareto frontier
Burden parameter delta and effector output k_M swept across a 50 x 50 parameter grid. Each point represents steady-state growth rate and pathogen suppression ratio. Pareto frontier overlaid. Colour-coded by circuit variant (linear vs. thresholded regulator). This figure makes the design regime concept concrete: the viable parameter space, the over-burdened region, and the under-effective region visible in a single plot. No equivalent exists in the published EcN engineering literature.
2 — Sensitivity analysis (PRCC tornado)
PRCC bar chart ranked by absolute influence on steady-state pathogen suppression. Parameters: Hill coefficient n, signal threshold KD, burden delta, microcin production rate k_M, pathogen kill rate k_kill. Sobol indices shown as supplementary to capture n-KD interaction effects. Identifies which parameters drive circuit behaviour most strongly, informing which design variables to constrain first in any future experimental build.
3 — Containment escape probability
Semi-log plot of escape frequency vs. generations. Single deltaDAPA vs. dual deltaDAPA + deltaThyA auxotrophy compared. Analytical curve overlaid on stochastic simulation trajectories. Anchored to published escape frequency ~10^-8 per generation (Stritzker 2007).
4 — Evolutionary stability (Moran fixation)
Fixation probability of loss-of-function mutant as a function of burden delta, across three population sizes. 1000-trajectory stochastic fan with analytical Nowak 2006 solution overlaid. Demonstrates that the thresholded regulator (Module 2) extends functional circuit half-life under selection relative to constitutive expression: the quantitative argument for why the regulator module is not optional.
What is being validated
The primary validation for this phase is the biosensor module ODE fitted to Palmer 2017 experimental data, and the resulting Pareto landscape generated by the four-module parameter sweep. Together, these demonstrate that the computational framework is grounded in real parameterisation, not arbitrary simulation.
Validation protocol
Extract tetrathionate-responsive GFP expression data from Palmer et al. 2017 Figure 3 (digitised using WebPlotDigitizer).
Build the TtrS/TtrR two-component ODE in Antimony syntax within Tellurium. Initial parameter estimates: Hill coefficient n = 2.0, activation threshold KD = 50 micromolar.
Run time-course simulations across five tetrathionate concentrations matching Palmer 2017 experimental conditions.
Fit KD and n using SciPy curve_fit with residual sum of squares minimisation. Report fitted values with 95% confidence intervals.
Validate that simulated activation curve matches digitised data within 20% across the concentration range.
Extend validated sensor ODE to the full four-module system. Run 50 x 50 parameter sweep. Record steady-state outputs for each grid point.
Identify and plot the Pareto frontier. Export 300 dpi PNG and SVG to GitHub.
The main challenge so far is parameter identifiability in the effector module. The MccH47 kill-rate constant k_kill is not directly reported in Palmer 2017; it is inferred from pathogen viable-count time-courses. The current approach uses a range of literature-sourced bacteriocin kill-rate values (106 to 108 cells per micromolar per hour) and propagates this uncertainty explicitly through the sensitivity analysis rather than selecting a single point estimate. This produces confidence intervals on the Pareto frontier that reflect real parametric uncertainty rather than false precision.
A second known limitation is the absence of commensal competition in the current Moran process model. Current fixation probability estimates are therefore optimistic by design. This is flagged explicitly in every figure caption involving evolutionary stability output. The Lotka-Volterra competition extension planned for post-course development is the right fix; it is out of scope here and is stated as such.
Distinction from existing work
The engineered probiotic field asks: can we build a circuit that works?
ÌṢỌ asks: across what design regimes does a circuit remain both functional and governable under the pressures that will actually be present?
Fitness cost as a design variable — no published EcN paper produces a fitness-efficacy Pareto frontier. All existing work acknowledges burden; none model it as a first-class input.
Containment as a dynamic system property — the field treats auxotrophy as binary. ÌṢỌ models escape probability over evolutionary time via the Moran process.
Disease context — the dominant literature targets IBD and colorectal cancer. ÌṢỌ is framed around acute paediatric diarrhoeal disease in West African clinical settings; this changes which signals, effectors, and ecological assumptions are relevant.
Model-first methodology — existing EcN engineering papers build constructs first and measure them. ÌṢỌ maps the computational design landscape before any construct is built.
Geographic grounding — clinical inspiration and epidemiological parameters drawn from Nigerian data (Egbewale 2022; Gayawan 2024). African-origin disease burden as a scientific foundation, not a framing afterthought.
What ÌṢỌ builds on
Palmer et al. 2017 — direct experimental predecessor; tetrathionate/MccH47/EcN parameter source
Weibel et al. 2024 — modular architecture precedent; ÌṢỌ extends with containment module and ODE-level analysis
Ba et al. 2023 — EcN toolbox that any future wet-lab build would draw on
The models, figures, and write-up constitute the core of a bioRxiv preprint. Abstract, introduction, and discussion sections bring it to a citable first-author computational biology paper.
Microbiome competition layer
Add a simplified Lotka-Volterra competition term for commensal species. Explores how microbiome density affects EcN colonisation stability and circuit persistence under realistic ecological conditions.
Week 5 synthesis — microcin analog design
Apply the PepMLM/moPPIt peptide generation pipeline (HTGAA Week 5) to propose microcin-analog sequences with improved target specificity. AlphaFold3 structural prediction of microcin-pathogen outer membrane protein complexes bridges the computational peptide design and engineered probiotic work.
West Africa AMR data integration
Parameterise the pathogen kill model with AMR prevalence data from Nigerian clinical isolates (WHONET/GLASS). Grounds the model in Sub-Saharan African epidemiology and connects to the planned AMR West Africa genomic data paper.
All model code, SBML files, and figures. MIT licensed. CITATION.cff included.
Palmer, J. D., Piattelli, E., McCormick, B. A., Silby, M. W., Brigham, C. J., and Bucci, V. (2017). Engineered probiotic for the inhibition of Salmonella via tetrathionate-induced production of microcin H47. ACS Infectious Diseases, 4(1), 39-45. https://doi.org/10.1021/acsinfecdis.7b00114
Weibel, N., Curcio, M., Schreiber, A., et al. (2024). Engineering a novel probiotic toolkit in Escherichia coli Nissle 1917 for sensing and mitigating gut inflammatory diseases. ACS Synthetic Biology, 13(8), 2376-2390. https://doi.org/10.1021/acssynbio.4c00036
Lynch, J. P., Goers, L., and Lesser, C. F. (2022). Emerging strategies for engineering Escherichia coli Nissle 1917-based therapeutics. Trends in Pharmacological Sciences, 43(9). https://doi.org/10.1016/j.tips.2022.02.002
Ba, F., Zhang, Y., Ji, X., Liu, W.-Q., Ling, S., and Li, J. (2023). Expanding the toolbox of probiotic Escherichia coli Nissle 1917 for synthetic biology. bioRxiv. https://doi.org/10.1101/2023.06.05.543671
Stritzker, J., Weibel, S., Hill, P. J., Oelschlaeger, T. A., Goebel, W., and Szalay, A. A. (2007). Tumor-specific colonisation, tissue distribution, and gene induction by probiotic E. coli Nissle 1917 in live mice. International Journal of Medical Microbiology, 297(3), 151-162.
Marino, S., Hogue, I. B., Ray, C. J., and Kirschner, D. E. (2008). A methodology for performing global uncertainty and sensitivity analysis in systems biology. Journal of Theoretical Biology, 254(1), 178-196.
Nowak, M. A. (2006). Evolutionary Dynamics: Exploring the Equations of Life. Harvard University Press.
Moran, P. A. P. (1958). Random processes in genetics. Mathematical Proceedings of the Cambridge Philosophical Society, 54(1), 60-71.
Egbewale, B. E., Karlsson, O., and Sudfeld, C. R. (2022). Childhood diarrhea prevalence and uptake of oral rehydration solution and zinc treatment in Nigeria. Children, 9(11), 1722.
Gayawan, E., Cameron, E., Okitika, T., Egbon, O. A., and Gething, P. (2024). A situational assessment of treatments received for childhood diarrhoea in the Federal Republic of Nigeria. PLOS ONE, 19(5), e0303963.
This document captures the full scope of our group work within the Genspace node focused on engineering the MS2 bacteriophage L protein. Group 2 formed around a shared interest in improving the toxicity, stability, and tunability of the L protein through computational design.
Our early brainstorming sessions centered on three broad goals:
Increased stability
Higher titers
Higher toxicity of the lysis protein
After several meetings and independent exploration, the group converged on two main computational directions. The first centered on systematic truncation and mutagenesis of the N-terminal regulatory domain. The second focused on point mutations within conserved regions that could alter electrostatic interactions while preserving structure.
Two major pipelines emerged from that work. John’s pipeline explored N-terminal truncations, DnaJ disruption, sequence redesign, codon optimization, and sequencing validation. Eric’s pipeline focused on charge-based mutations, conservation mapping, structural modeling, ORF overlap analysis, and cross-referencing with experimental lysis data.
Both approaches identified strong but distinct candidates for improving L protein function.
John’s Analysis and Pipeline
Summary
The MS2 lysis protein L is a 75 amino acid single-pass transmembrane protein whose N-terminal region acts as a regulatory brake on lysis. Rather than directly participating in membrane disruption, this region delays insertion and oligomerization of the transmembrane domain.
My pipeline focused on systematically removing portions of that inhibitory region while preserving the membrane-spanning lytic core. The central hypothesis was simple: if the N-terminal domain slows lysis, then partial removal should release that inhibition and produce earlier, stronger lytic activity.
The strongest candidate to emerge from the analysis was L_trunc30, which removes the first 30 amino acids while preserving the entire transmembrane domain.
Partial truncations of the N-terminal region should reduce inhibition and increase lysis efficiency.
The regulatory function is probably localized to a smaller sub-region rather than spread evenly across the entire N-terminus.
There is likely an optimal truncation point where toxicity increases without destabilizing the membrane-spanning domain.
Pipeline Overview
Stage
Tool
Purpose
1
ESM2
Mutational scanning across all 75 residues
2
ESMFold
Structural prediction of truncation variants
3
AlphaFold-Multimer
Modeling interaction with DnaJ
4
GROMACS
Molecular dynamics and RMSF analysis
5
ProteinMPNN
Junction redesign and charge reduction
6
Codon optimization
Prepare E. coli expression constructs
7
Synthetic construct design
Assemble expression cassette
8
Bowtie2 + BCFtools
Variant calling and sequencing validation
9
IGV
Manual inspection of called variants
Major Findings
ESM2 Mutagenesis Scan
The ESM2 scan identified position C29 as the dominant mutational hotspot in the N-terminal domain.
Mutation
LLR
Notes
C29R
3.64
Top-ranked substitution
C29P
3.17
Strong helix-disrupting mutation
C29Q
3.06
Conservative but highly favored
F22R
1.86
Introduces basic charge
S9Q
1.69
Recovered independently in prior work
C29 accounted for 12 of the top 20 substitutions. That concentration strongly suggested that the wild-type residue at this site is not ideal for maximizing toxicity outside the native viral context.
Structural Findings
ESMFold predictions for all truncation variants suggested that the N-terminal domain is highly disordered in solution. Interdomain contact analysis returned essentially zero contacts across all variants, which fits with the known biology of the L protein.
The more useful signal came from molecular dynamics.
For L_trunc30:
Remaining N-terminal stub RMSF: ~1.87 nm
Transmembrane domain RMSF: ~0.27 nm
That sharp drop in flexibility confirmed that the transmembrane region remains stable even after removing 30 amino acids from the N-terminus.
Charge Analysis
The wild-type N-terminal region is strongly basic due to motifs like RRRPFK and RRQQR.
L_trunc30 reverses the overall charge profile:
Variant
Net charge
Interpretation
Wild-type L
Approximately +8
Strong DnaJ interaction expected
L_trunc30
-2
Reduced DnaJ binding and earlier lysis expected
This was important mechanistically because DnaJ binding depends heavily on electrostatic interactions with the positively charged N-terminal region.
Codon Optimization and Construct Design
All major truncation variants were codon-optimized for E. coli K-12.
The lead construct, L_trunc30, preserved the essential LS motif and was assembled into a complete 230 bp expression cassette with:
Ptrc promoter
Optimized RBS
Lambda t0 terminator
rrnB T1 terminator
Gibson overhangs compatible with the mUAV backbone
Lead Candidate
Candidate
Key Feature
Reason
L_trunc30
Removes aa 1-30
Strongest balance of toxicity, structural stability, and DnaJ disruption
Secondary Candidates
Candidate
Reason for Inclusion
C29R
Highest ESM2 score overall
F22R
Adds positive charge in N-terminal region
S9Q
Recovered independently in previous scans
L_trunc40
Most aggressive truncation, likely strongest toxicity
Eric approached the same problem from a different angle. Instead of removing large sections of the N-terminus, he focused on identifying individual amino acid substitutions that could improve toxicity while preserving the overall structure of the protein.
His strongest candidate was P13L, a single amino acid change in the N-terminal region.
Pipeline Overview
Stage
Tool
Purpose
1
UniProt + BLAST
Sequence retrieval and homolog identification
2
Clustal Omega
Conservation mapping
3
AlphaFold-Multimer
Oligomer modeling
4
ESM2
Mutation scoring
5
ESMFold
Structural confidence and pTM analysis
6
ChimeraX
Electrostatic visualization
7
Benchling
ORF overlap analysis
Major Findings
Conservation Analysis
Eric identified a relatively unconstrained region between amino acids 16 and 28 that could tolerate mutation without damaging essential structure.
Position
Wild-type residue
Interpretation
18
R
Fully conserved, avoid
21
P
Fully conserved, avoid
23
K
Fully conserved, avoid
26
D
Variable, strong candidate
13
P
Weakly conserved, potentially safe
Structural Modeling
P13L produced the strongest ESMFold result among all variants tested.
Variant
pTM
Change vs WT
Wild-type
0.273
Reference
D26R
0.267
Slight decrease
P13L
0.420
Strong increase
The jump from 0.273 to 0.420 made P13L the most structurally favorable point mutation in Eric’s pipeline.
Experimental Cross-Reference
Unlike my pipeline, Eric cross-referenced computational candidates with available lysis data.
Mutation
Replicate A
Replicate B
Result
P13L
1
1
Confirmed lytic
D26G
1
0
Mixed
K23E
1
0
Mixed
E25G
1
0
Mixed
P13L was the only candidate to remain consistently positive across both replicates.
ORF Overlap Analysis
One of the more interesting parts of Eric’s work was the DNA-level overlap analysis.
P13L falls within the overlap region between the coat protein and the L protein, which initially made it look risky. After codon-level analysis, though, the mutation turned out to be safe.
Gene
WT codon
Mutant codon
Result
L protein
CCG
CTG
Pro → Leu
Coat protein
TCC
TCT
Ser → Ser
That synonymous change in the coat protein meant the mutation could proceed without disrupting the overlapping reading frame.
Lead Candidate
Candidate
Key Feature
Reason
P13L
Single amino acid substitution
Best structural score and strongest experimental support
Secondary Candidates
Candidate
Status
D26R
Untested but promising
D26G
Mixed experimental results
N17R
Open candidate
H24R
Open candidate
Albert’s Notes
Albert focused primarily on structural stability.
His workflow emphasized:
Sequence retrieval from UniProt
BLAST and Clustal Omega for conservation mapping
ESM2 mutational scanning
ESMFold structure prediction
AlphaFold-Multimer confirmation of DnaJ interactions
Wet lab validation of top-ranked variants
His key concern was preserving structure while introducing beneficial mutations.
He also pointed out an important limitation that kept showing up across the project: membrane proteins are underrepresented in both structural databases and protein language model training sets. That means even high-scoring mutations should still be interpreted cautiously.
Tehseen’s Notes
Tehseen’s approach aligned closely with my truncation-based strategy but focused more on identifying the smallest regulatory segment required for precise control over lysis timing.
The central idea was not simply to remove the N-terminal region, but to identify exactly which residues are responsible for slowing lysis.
That led to three closely related hypotheses:
Partial truncations can increase lysis gradually rather than all at once.
Regulatory effects are probably localized to a smaller sub-region.
There is likely an optimal balance point between stronger toxicity and preserved protein stability.
Comparative Summary
Aspect
John’s Pipeline
Eric’s Pipeline
Main strategy
Progressive N-terminal truncation
Point mutation design
Lead candidate
L_trunc30
P13L
Core hypothesis
Remove inhibitory domain
Increase local electrostatic effects
ESM2 scope
Full 1,425-substitution scan
Single-site targeted analysis
Structural analysis
ESMFold + GROMACS RMSF
ESMFold + ChimeraX
DnaJ interaction
Central to model
Considered indirectly
Experimental validation
Not yet completed
P13L confirmed experimentally
Construct design
Fully assembled
Still planned
Sequencing workflow
Fully designed with Bowtie2, BCFtools, IGV
Listed as future step
Final Interpretation
The project ended up producing two very different but complementary engineering directions.
L_trunc30 represents the stronger systems-level redesign. It removes the inhibitory N-terminal region, reduces DnaJ engagement, preserves the transmembrane core, and provides a fully buildable expression construct ready for synthesis and sequencing validation.
P13L represents the cleaner minimal-change strategy. It preserves the full-length protein, improves structural confidence, survives ORF overlap analysis, and already has positive experimental support.
If the goal is maximum disruption of the native regulatory system, L_trunc30 is the stronger candidate.
If the goal is a simpler mutation with lower engineering risk and existing wet lab support, P13L is the better starting point.
The most practical next step would be to synthesize and compare both side by side.
Weeks
Subsections of Weeks
Week 1
Class Assignment — Week 1
1) Biological Engineering Application
I aim to develop a computational and experimental platform for engineering metabolically constrained microbial systems designed for responsible real-world use. Inspired by clinical exposure to preventable infectious disease and my research at the intersection of microbiology and computational biology, the platform integrates genomic design rules, programmed auxotrophies, and environmental sensing circuits that couple microbial survival to defined ecological contexts.
The central principle is ecological boundedness. Survival and function are conditional, not assumed. Outside intended environments, persistence becomes biologically untenable. This approach supports applications ranging from gut-targeted probiotics to agricultural symbionts and environmental remediation strains.
Rather than optimizing microbes solely for performance, I want to encode responsibility at the level of metabolism. The goal is to expand synthetic biology into high-need contexts while ensuring that safety, containment, and contextual awareness are intrinsic design features, not external corrections imposed after deployment.
2) Governance and Policy Goals
My overarching governance goal is to embed non-malfeasance directly into biological architecture rather than relying exclusively on downstream regulation.
First, intrinsic containment standards should become normative. This includes requiring conditional survival mechanisms such as auxotrophies or environmental dependency circuits prior to field deployment, alongside independent validation of escape potential and evolutionary stability.
Second, dual-use mitigation must be integrated into design pipelines. Sequence screening, risk-tiered access controls, and transparent but bounded documentation standards can reduce misuse without stifling legitimate research.
Third, equity should shape access and deployment. Safety-audited open frameworks should remain available to researchers in low-resource settings, and deployment priorities should align with public health and ecological need rather than purely commercial incentives.
Together, these goals move governance upstream. Ethical alignment becomes encoded in design logic, enabling innovation that is both socially responsive and technically responsible.
3) Governance Actions
Option 1 — Conditional Deployment Requirement
Purpose: Shift from voluntary containment to mandatory intrinsic safeguards for field-deployable microbes. Design: Regulators require documented metabolic constraints and third-party validation before approval. Academic labs and companies must comply. Assumptions: Safeguards remain evolutionarily stable and measurable. Risks: Overregulation may slow beneficial innovation; success may create complacency about residual risk.
Purpose: Embed sequence screening and risk assessment into computational design tools. Design: Tool developers, funders, and journals require automated biosecurity checks as part of research workflows. Assumptions: Screening algorithms remain adaptive to emerging threats. Risks: False positives could burden researchers; sophisticated actors might bypass systems.
Option 3 — Incentivized Safety Certification
Purpose: Encourage responsible innovation through market and funding incentives. Design: Grant agencies and industry consortia prioritize projects meeting certified intrinsic-containment standards. Assumptions: Financial incentives shape behavior effectively. Risks: Certification may become symbolic rather than substantive if poorly enforced.
4) Scoring Governance Actions
Criteria
Option 1
Option 2
Option 3
Enhance Biosecurity (prevent incidents)
1
1
2
Enhance Biosecurity (respond)
2
2
2
Foster Lab Safety (prevent)
1
2
2
Protect Environment (prevent)
1
2
2
Minimize Burden
3
2
1
Feasibility
2
1
1
Not Impede Research
3
1
1
Promote Constructive Applications
1
1
1
1 indicates strongest alignment.
5) Prioritization and Trade-offs
I would prioritize a combination of Option 2 and Option 3. Embedding screening directly into computational design tools makes safety habitual rather than exceptional, while incentive structures reinforce responsible norms without heavy-handed regulation.
Option 1 is powerful but risks slowing innovation in resource-constrained contexts where deployment urgency is high. My recommendation would target national research funders and international synthetic biology consortia, encouraging coordinated standards that scale globally.
Trade-offs include balancing speed with precaution and avoiding regulatory inequities that disadvantage researchers in low-income settings. Uncertainties remain regarding evolutionary stability of safeguards and adaptability of screening systems.
The central ethical concern that emerged for me is the illusion of control. Engineering containment does not eliminate uncertainty. Governance must remain adaptive, transparent, and humble, recognizing that biological systems are dynamic. Embedding responsibility into design is necessary, but continuous oversight and global dialogue remain essential.
Class Assignment — Week 2 Preparation
1) Essential Amino Acids and the Lysine Contingency
The ten essential amino acids in animals are histidine, isoleucine, leucine, lysine, methionine, phenylalanine, threonine, tryptophan, valine, and arginine (essential in growing animals). Animals cannot synthesize these; survival depends on dietary supply.
This reframes the Lysine Contingency for me. It is not merely a clever containment device. Engineering microbes that require lysine creates a metabolic dependency aligned with a biological universal. Because animals cannot produce lysine, ecological persistence becomes tightly coupled to controlled supplementation. Survival becomes conditional, not autonomous.
I now see it less as a biosafety patch and more as a governance-embedded metabolic contract. The dependency encodes authority into biochemistry. Control is not enforced externally; it is written into the organism’s survival logic. That shift moves containment from policy language into molecular architecture.
2) Suggested Code for AA:AA Interactions
From the genetic code logic shown, base pairs have symmetry rules. Amino acids need something analogous. I would propose a layered interaction code:
First layer: chemical class (polar, nonpolar, charged, aromatic). Second layer: interaction type (hydrophobic packing, hydrogen bonding, ionic pairing, pi stacking). Third layer: geometry constraint (distance and orientation tolerance).
For example, NP-HYD-G1 could denote nonpolar hydrophobic packing within a defined geometric band. CH-ION-G2 could represent oppositely charged ionic interaction with specific spacing tolerance.
Such a code treats protein structure not as artistic folding but as readable and writable interaction grammar. If we can read polymers, we should also encode their interaction rules explicitly. That shift makes protein design less descriptive and more programmable.
3) Ethical Reflections
Biological systems do not respect borders. Political, institutional, even disciplinary lines dissolve in ecology. Framing safety as compliance feels incomplete because evolution does not comply. Good intentions are structurally irrelevant to selection pressures.
Governance must therefore treat evolution as a first-class design constraint. Safeguards must assume mutation, drift, and ecological leakage. Ethical assumptions should be embedded in design architectures, not appended through oversight committees.
I am increasingly drawn to resilience-based governance. Instead of trusting actors, we engineer systems that remain bounded even under failure. The goal is not perfect control but constrained adaptability. In living systems, humility is ethical. Governance must anticipate dynamics, not merely regulate behavior.
Key Takeaways
Evolution is not theoretical. Population genetics, mutation rates, and selection coefficients are active in every gut. Any safeguard must assume adaptation under pressure.
Biology is programmable matter. DNA is a chemically precise information system. If we can write sequence, responsibility must be encoded at that same molecular layer.
Genetic recoding reshapes constraints. Codon reassignment and translational control can structurally limit horizontal gene transfer.
Design capacity is accelerating. Sequencing and synthesis technologies now scale faster than the institutions meant to guide them.
Design obeys physics. Protein folding, metabolic flux, and regulatory circuits follow thermodynamics and kinetics. Only systems stable under stress earn trust.
AI Prompts Employed
Help me design a scientific but warm homepage visual, iterate fast, and fix what breaks
Help me turn this from a messy course site into a coherent research story
Help me debug under deadline without losing momentum
Help me sound credible, grounded, and original — not speculative or sloppy
Make contact details easy to find me without making it cringe
Week 2
Class Assignment — Week 2
Part 1 — Sequence Retrieval and Design Workflow
1) Sequence Retrieval and Benchling Initialization
The process began with obtaining a Lambda GenBank file from New England Biolabs. After confirming the correct format, I imported the file into Benchling as a DNA sequence. Care was taken to ensure that the file was not mistakenly uploaded as RNA and that annotations displayed properly within the platform.
This step established a stable working environment before any design modifications were introduced. Confirming correct topology and annotation structure prevented downstream formatting or visualization issues.
2) Genomic Exploration and Annotation Familiarization
Once imported, I explored the annotated regions of the Lambda genome within Benchling. This involved confirming gene orientation, identifying labeled regions, and understanding the graphical interface for both linear and circular visualization.
Although exploratory, this step reinforced familiarity with the design environment. It ensured that I could distinguish between expected gene clusters and annotation artifacts, and that I could confidently navigate the interface for subsequent editing.
3) Protein Selection and Sequence Acquisition
Furthermore, I selected Microcin M as the protein of interest. The choice aligned with my project, ÌṢỌ, which focuses on context-sensitive antimicrobial response within the gut ecosystem.
The selection criteria included:
Narrow-spectrum antimicrobial activity
Relevance to microbial competition
Compatibility with a governed probiotic chassis
The amino acid sequence was retrieved in FASTA format from a reliable database (NCBI GenBank: CAE55705.1). I verified the header structure and ensured that the sequence corresponded exactly to the intended protein.
4) Reverse Translation
Using Benchling’s reverse translation functionality, I converted the amino acid sequence into a nucleotide sequence suitable for expression in Escherichia coli.
Key considerations included:
Maintaining correct reading frame
Ensuring inclusion of a start codon
Confirming appropriate stop codon placement
Selecting E. coli codon usage
The output DNA sequence was checked to ensure it translated back to the original protein sequence without truncation or frame shift.
5) Codon Optimization
Following reverse translation, codon optimization was performed for expression in E. coli. This step aimed to improve translational efficiency while minimizing expression burden and avoiding rare codons.
Optimization included:
Aligning codon usage with host bias
Avoiding problematic restriction sites
Preserving protein sequence integrity
This stage reinforced that codon choice influences not only protein yield but also metabolic load and evolutionary stability.
Part 2 — Construct Assembly and Validation
6) Expression Cassette Assembly
The optimized coding sequence was integrated into a complete expression cassette using the assignment’s structural framework:
Promoter → Ribosome Binding Site → Start Codon → Codon-Optimized CDS → Optional His Tag → Stop Codon → Terminator
Each component was manually inserted and annotated within Benchling. Particular care was taken to ensure that the coding region replaced the example scaffold sequence rather than being appended to it.
Linear and circular map views were used to confirm structural continuity, annotation accuracy, and absence of unintended sequence artifacts.
7) Virtual Digest and Gel Simulation
To validate construct integrity, I performed a virtual digest within Benchling and obtained predicted fragment sizes. These fragment sizes were then visualized using an external gel simulation tool.
This step confirmed that the construct behaved as expected under restriction enzyme analysis and reinforced my understanding of plasmid verification workflows.
8) FASTA Export and Synthesis Preparation
The completed expression cassette was exported in FASTA format for potential synthesis ordering. Care was taken to ensure:
Correct header formatting beginning with the greater-than symbol
No extraneous spaces or formatting characters
Proper file extension
Although synthesis ordering through Twist was initiated, access limitations prevented full completion. Instead of halting progress, I pivoted toward generating a complete plasmid visualization within Benchling.
9) Plasmid Map Generation
To simulate a complete plasmid construct, the sequence topology was converted to circular within Benchling. Circular map visualization confirmed clear annotation of promoter, ribosome binding site, coding sequence, and terminator.
This produced a plasmid map without requiring external synthesis confirmation. The visualization ensured structural coherence and clear representation of the engineered construct.
Technical Milestones Achieved
Successful import and annotation of GenBank files
Accurate reverse translation from protein to DNA
Codon optimization aligned with host expression
Proper construction of an annotated expression cassette
Verified FASTA export formatting
Simulated plasmid visualization in circular topology
Integration of molecular workflow with ecological design philosophy
Backbone Vector Documentation
The Microcin M expression cassette was designed for cloning into pUC19, a high-copy ColE1-origin plasmid carrying ampicillin resistance. pUC19 was selected primarily for its well-characterised cloning sites and broad compatibility with standard E. coli transformation protocols — practical considerations given that the immediate goal is sequence verification rather than stable expression. The MccH47 insert is flanked by EcoRI and HindIII sites for directional cloning into the multiple cloning site. The complete annotated construct is deposited in the class Benchling folder as MccH47_pUC19_EcN_construct.
For downstream ÌṢỌ deployment, the cassette would need migration to a lower-copy backbone — pSC101 or a chromosomal integration vector — to reduce metabolic burden on the EcN chassis and improve evolutionary stability under selection.
Throughout the experience, I maintained alignment with the core principles of ÌṢỌ:
Fitness cost is a primary design variable
Selection operates continuously
Expression burden affects evolutionary stability
Containment must be intrinsic to architecture
Models inform design boundaries
This reframed it for me from a cloning exercise into a constraint-aware engineering process.
Process Reflections
The workflow required iterative verification at each stage. Formatting, reading frame integrity, codon usage, annotation accuracy, and topology conversion each presented potential points of error and addressing them incrementally reduced compounding mistakes.
More importantly, it reinforced that biological engineering is not simply about inserting genes. It requires contextual awareness, ecological humility, and structural foresight.
Sequence design is only the beginning. Stability under pressure determines whether a system is viable outside controlled conditions.
This process strengthened both my technical fluency and design discipline, linking molecular implementation to ecological responsibility.
Gao et al., 2025 introduce LabscriptAI, a multi-agent framework that translates natural language experimental descriptions into validated Python scripts for heterogeneous liquid-handling robots, including Opentrons platforms.
The system integrates:
Hierarchical task planning
Platform-specific simulation validation
A precise refactoring engine for targeted debugging
Domain-specific knowledge retrieval
Human-in-the-loop safety checkpoints
Experimental validation included:
Cross-platform fluorescence calibration
Automated cell-free expression and screening of 298 GFP variants
The central contribution is not pipetting precision alone. It is structured experimental execution with embedded validation and safety logic. Automation becomes reproducible, cross-platform, and governable.
Active Learning Directed Evolution (ALDE)
Active Learning Directed Evolution which integrates machine learning uncertainty estimation with iterative experimental screening to guide protein engineering efficiently was introduced by Yang, Lal, Arnold, et al. 2025.
ALDE automates experimental decision-making by:
Training predictive sequence–function models
Quantifying uncertainty across unexplored sequence space
Selecting optimal next-round variants
Iteratively refining search trajectories
Rather than brute-force screening, ALDE navigates design space intelligently, minimizing experimental waste while maximizing functional discovery.
Together, these systems represent complementary layers:
ÌṢỌ is a fitness-aware engineered probiotic system designed to sense gut context, produce targeted antimicrobial responses, and remain bounded through intrinsic containment.
Automation enables a structured Design–Build–Test–Learn loop.
A) Combinatorial Genetic Circuit Screening (requires automation)
Objective: Evaluate sensor–effector variants under growth constraints.
Automated workflow:
Dispense transformation master mix into 96-well plate
Add plasmid constructs into defined coordinates
Perform serial dilution plating
Inoculate colonies into induction gradient
Measure OD600 for growth
Measure fluorescence for reporter output
Normalize fluorescence by growth to assess fitness-aware performance
Together, they form a closed-loop, governable engineering system that prioritizes stability under ecological pressure rather than maximal output under ideal conditions.
Works Cited
Yang, J., Lal, R. G., Bowden, J. C., et al. (2025). Active learning-assisted directed evolution. Nature Communications, 16, 714. https://doi.org/10.1038/s41467-025-55987-8
Gao, Y., Luo, Y., Li, W., Lan, Y., Jiang, H., Chen, Y., Yi, X., Li, B., Alinejad-Rokny, H., Wang, T., Fu, L., Yang, M., & Si, T. (2025). Autonomous liquid-handling robotics scripting for accessible and responsible protein engineering. bioRxiv. https://doi.org/10.1101/2025.09.30.679666
Proposed Final Project Ideas
Process Reflections
This week shifted my understanding of automation from technical convenience to systems architecture.
Initially, I approached the assignment by identifying a strong automation framework in LabscriptAI. However, as I explored complementary tools such as ALDE, it became clear that robotic precision alone is insufficient. Scalable biological engineering requires structured exploration, specifically uncertainty-aware active learning to navigate sequence and design space intelligently.
The key insight was recognizing that automation operates on two layers:
Cognitive layer deciding what experiment to run next
Execution layer safely and reproducibly running it
By combining both, my thinking moved beyond pipetting workflows toward a closed-loop, governable Design–Build–Test–Learn system. This reframing aligns directly with ÌṢỌ, which requires ecological realism, fitness awareness, and safety constraints.
Another important shift was recognizing the role of governance. Automation increases capability, but without structured safety checkpoints, biosecurity screening, and human oversight, it becomes fragile or irresponsible. Designing the automation architecture required explicit consideration of containment, ecological competition, and reproducibility.
This process strengthened three core skills:
Systems-level integration rather than tool-level selection
Designing for constraint rather than brute-force optimization
Framing automation as a platform rather than a procedure
Ultimately, I realized that my final project is not only an engineered probiotic. It is a structured, uncertainty-aware engineering pipeline for responsible biological deployment.
AI Prompts Employed
Compare ALDE and LabscriptAI to see if they work well together as a system
Design a closed-loop setup where AI chooses experiments and robots run them
List what I would automate for ÌṢỌ (Sentinel EcN)
Draft simple Opentrons-style pseudocode for running reactions
Integrate 3D printed tools, cloud labs, and governance into the automation workflow
Week 4
Class Assignment — Week 4
Part A. Conceptual Questions
1) How many molecules of amino acids do you take with a piece of 500 grams of meat?
Assumptions: lean meat is ~20% protein by mass, average amino acid residue ~100 Da (≈100 g/mol).
Step 1: Protein mass in 500 g meat 500 g × 0.20 = 100 g protein
Step 2: Convert to moles of amino acid residues 100 g ÷ (100 g/mol) = 1 mole
Answer: approximately 6.0 × 10²³ amino acid molecules (about 600 sextillion) which is actually the Avogadro’s Number in chemistry, or one mole of water
2) Why do humans eat beef but do not become a cow, eat fish but do not become fish?
Because eating provides raw materials, not biological identity. Digestion breaks proteins, fats, and nucleic acids into small molecules such as amino acids and fatty acids. By the time nutrients enter the bloodstream, they are no longer “cow” or “fish,” they are shared chemical building blocks used by all life.
What determines what we become is our genome and regulatory systems. Human cells assemble human proteins because human DNA encodes the instructions. Food is like construction material. The same bricks can build different structures depending on the blueprint.
3) Why are there only 20 natural amino acids?
The “20” is an evolutionary, chemical, and informational compromise. The standard amino acids provide enough chemical diversity for folding, catalysis, and signaling while keeping translation machinery stable and error-tolerant. Expanding beyond this set would require major coordinated changes to tRNAs, aminoacyl-tRNA synthetases, and ribosomes, which coul possibly be evolutionarily costly.
Also, the genetic code has 64 codons, which comfortably encodes 20 amino acids plus stop signals. The system stabilized around a set that is chemically sufficient and operationally efficient.
Notably, the set is not absolutely fixed. Biology also uses selenocysteine and pyrrolysine via specialized mechanisms, and synthetic biology can incorporate many noncanonical amino acids in engineered systems.
4) Can you make other non-natural amino acids? Design some new amino acids.
Yes. Chemists and synthetic biologists have created many noncanonical amino acids. Conceptually, you keep the standard amino acid backbone and alter the side chain to introduce new properties. Below are conceptual designs (structural ideas, not synthesis instructions):
Fluoro-leucine variant Replace a leucine side-chain hydrogen with fluorine to increase stability and hydrophobicity.
Photo-switch amino acid Add a light-responsive group (azobenzene-like) that changes shape under light, enabling reversible control of protein behavior.
Metal-binding amino acid Design a side chain with a strong chelating motif to coordinate metals more tightly than histidine, enabling engineered metalloenzymes.
Redox-active amino acid A side chain designed for reversible electron transfer beyond cysteine/tyrosine chemistry, expanding redox options.
Bulky steric-block amino acid A large aromatic side chain that can restrict folding paths or block active sites to tune structure and function.
Synthetic polar-gradient amino acid A side chain with donor/acceptor geometry not present in the canonical set to enable new hydrogen-bonding patterns.
Practical considerations for synthetic possibility include recognition by synthetases, ribosomal fit, folding effects, toxicity, and translational fidelity.
5) Where did amino acids come from before enzymes and before life started?
Amino acids can arise through prebiotic chemistry. Three common sources are:
Atmospheric chemistry: Early Earth gases plus energy (lightning, UV, heat) can generate amino acids (supported by classic Miller–Urey-type results).
Hydrothermal vents: Mineral surfaces, heat, and gradients can promote organic synthesis and concentration of building blocks.
Extraterrestrial delivery: Meteorites such as Murchison contain amino acids, showing formation can occur beyond Earth and be delivered.
Life later evolved enzymes to produce amino acids more efficiently and selectively.
6) If you make an α-helix using D-amino acids, what handedness would you expect?
A polypeptide made of D-amino acids would form a left-handed α-helix. Natural α-helices are right-handed because proteins use L-amino acids; mirroring chirality mirrors the preferred helix.
7) Can you discover additional helices in proteins?
Within natural peptide chemistry, backbone geometry is constrained by peptide bond planarity, allowed φ/ψ angles, and hydrogen bonding rules. However, we can still expand what we call “helical forms” in practice by:
identifying less common helical geometries in known proteins
designing novel helices computationally
engineering sequences that stabilize alternative helix types under specific conditions
So “new helices” are often new realizations within physical constraints rather than completely new backbone physics.
8) Why are most molecular helices right-handed?
Because biological polymers are built from chiral monomers that life selected early. L-amino acids favor right-handed α-helices; D-sugars in DNA favor right-handed B-DNA. Once one chirality dominated, evolution locked in downstream structural preferences across biology.
9) Why do β-sheets tend to aggregate? What is the driving force?
β-sheets aggregate because their edges expose backbone hydrogen bond donors and acceptors that can be satisfied by forming intermolecular hydrogen bonds. Aggregation is further stabilized by:
Backbone hydrogen bonding networks across molecules
Hydrophobic packing as β-strands often present with alternating polar/hydrophobic patterns
Planar stacking geometry enabling tight van der Waals packing
These same stabilizing forces underlie amyloid formation when misregulated.
Part B. Protein Analysis and Visualization
1) My Selected Protein And Why
I initially selected Microcin M (MccM) because it aligns directly with my project ÌṢỌ (Sentinel EcN), which focuses on context-sensitive antimicrobial response within the gut ecosystem. My selection criteria were:
narrow-spectrum antimicrobial activity
relevance to microbial competition in the gut
compatibility with a governable probiotic chassis
The sequence was retrieved in FASTA format from NCBI GenBank (CAM8152351.1) and checked to ensure the header and sequence matched the intended protein.
Most frequent amino acids: Serine(S) and Glycine(G) both occuring 12 times
Homologs (UniProt BLAST): ~100 protein sequence homologs
Protein family: Microcin (Class II) antimicrobial peptide family
Amino acid frequencies
Amino acid
Count
Percent
S
12
16.44%
G
12
16.44%
I
8
10.96%
A
7
9.59%
L
4
5.48%
T
4
5.48%
K
3
4.11%
E
3
4.11%
D
3
4.11%
P
3
4.11%
M
2
2.74%
N
2
2.74%
Q
2
2.74%
F
2
2.74%
V
2
2.74%
R
1
1.37%
Y
1
1.37%
H
1
1.37%
W
1
1.37%
3) Structure Page of My Choice Microcin Protein (RCSB)
Microcin systems, especially my initial Microcin A systems could not be resolved as standalone structures in a way that supports the expected full visualization. To meet the requirements for a high-quality structure with clear visualization features, I used TolC as the structural anchor because it is directly relevant to microcin export and is well characterized in the literature.
Protein: TolC (E. coli outer membrane export channel)
Trimer architecture, surface envelope with internal helical core
Axial top view highlighting symmetry and central channel
Surface electrochemical landscape showing charge distribution
Lateral chemical view emphasizing membrane-facing hydrophobics
Ribbon colored by residue chemistry to show lumen and interfaces
Ribbon-only structural architecture for fold clarity
Color Representation of Selected Images
Image
Title
Representation
Color
Meaning
1
Surface envelope with helical core overlay
Transparent surface + ribbon
Light grey
Outer surface
Yellow
Hydrophobic surface regions
Blue
Helical channel core
2
Central channel, axial top view
Ribbon
Yellow
Chain A
Blue
Chain B
Light grey
Chain C
3
Surface electrochemical landscape
Surface
Red
Acidic residues
Blue
Basic residues
Yellow
Hydrophobic residues
Light grey
Neutral/other
4
Outer membrane barrel, lateral chemical view
Surface
Red/Blue/Yellow/Grey
Same chemistry scheme
5
Ribbon colored by residue type
Ribbon
Red/Blue/Yellow/Grey
Residue chemistry
6
Secondary structure architecture
Ribbon
Light cyan
Backbone only
Microcin A processing pathway (my initial microcin protein choice)
Step
Protein
Function
Role in pathway
Stage
1
MccA
Precursor peptide
Scaffold for toxin
Precursor
2
MccB
Adenyltransferase
Adds AMP to C-terminus
Modification
3
MccD
Aminopropyltransferase
Adds aminopropyl group
Modification
4
MccC
Efflux pump
Exports mature microcin
Export / Resistance
5
MccE
Acetyltransferase
Detoxifies microcin in producer
Immunity
6
MccF
Serine peptidase
Cleaves toxic moiety
Immunity
Microcin M processing pathway (my current choice after further exploring the literature)
Step
Gene / protein
Function
Role in pathway
1
mcmA
MccM precursor peptide
Ribosomal scaffold
2
mcmI
Immunity protein
Producer self-protection
3
mcmL
Glycosyltransferase-like
Supports siderophore moiety preparation
4
mcmK
Esterase-like
Supports siderophore processing
5
mchC / mchD
Linker proteins
Attachment steps (biochemistry not fully resolved)
6
mchF
ABC transporter
Exports mature microcin
7
mchE
Membrane fusion protein
Works with export machinery
8
tolC
Outer membrane channel
Final export conduit
Part C. Using ML-Based Protein Design Tools
1A) Deep Mutational Scan (ESM2)
Using ESM2, I generated an unsupervised deep mutational scan across the TolC sequence. The heatmap showed multiple constrained regions, visible as vertical bands, suggesting positions that are broadly intolerant to mutation.
A clear example was residue 178. The wild-type residue is tryptophan (W). The mutation W178D produced a relative log-likelihood score of −2.38, indicating a strong model penalty. Structural inspection supports this: W178 is buried within the TolC trimeric structure. Replacing a bulky hydrophobic aromatic residue with a negatively charged aspartate is expected to disrupt local hydrophobic packing and weaken the inter-chain interface.
Supporting snapshots:
ESMFold inference (TolC chain)
Using the notebook workflow:
Sequence length: 428
Mode: mono
Device: CUDA
Prediction: pTM 0.858, mean pLDDT 90.2 (min 41.4, max 96.3)
Outputs saved: PDB, PAE, pLDDT, contacts
TolC_ChainA_ESMFold_ptm0.858_r3.pdb
TolC_ChainA_ESMFold_ptm0.858_r3.pae.txt
TolC_ChainA_ESMFold_ptm0.858_r3.plddt.txt
TolC_ChainA_ESMFold_ptm0.858_r3.contacts.txt
This combination of language-model scoring and structural context gave a consistent interpretation of constraint and stability.
Additional outputs:
1B) Latent Space Analysis (ESM2 Embeddings)
Using ESM2 embeddings, protein sequences were projected into reduced-dimensional space using t-SNE. Each sequence was represented by the mean of its final hidden state embeddings, generating a fixed-length vector per protein. Dimensionality reduction to three components revealed structured clustering rather than random dispersion.
Proteins grouped into coherent neighborhoods, suggesting the embedding captures functional and structural similarity. When placing the TolC sequence into this latent map, it localized within a neighborhood consistent with outer membrane efflux proteins. Its nearest neighbors showed similar length profiles and domain architecture, supporting the idea that sequence-only embeddings can recover meaningful structural proximity.
Overall, the clustering behavior was consistent with the embedding reflecting shared fold-level or domain-level properties, rather than superficial sequence identity alone.
2A) Folding the Protein with ESMFold
The TolC sequence (length 428 residues) was folded using ESMFold with three recycles.
Predicted pTM: 0.858
Mean pLDDT: 90.2 (min 41.4, max 96.3)
The predicted structure displayed a clear alpha-helical barrel architecture consistent with known TolC topology. Confidence was highest across the helical core and reduced mainly in flexible loop regions and termini, which is typical for long membrane-associated channels.
A structural check against experimental PDB 1EK9 showed strong global agreement in fold topology. The helical bundle organization was preserved, supporting the reliability of the prediction for this fold class.
2B) Structural Resilience to Mutation
Single mutation: W178D
Residue W178, identified as buried within the trimeric core, was mutated to aspartate (W178D). This substitution replaces a large hydrophobic aromatic residue with a charged polar residue.
ESMFold outputs:
TolC_W178D_ESMFold pTM: 0.859, mean pLDDT: 90.3 (min 41.3, max 96.4)
TolC_W178D_ESMFold_ptm0.859_r3.pdb
TolC_W178D_ESMFold_ptm0.859_r3.plddt.txt
Interpretation: the mutant maintained high overall confidence and preserved the global helical barrel architecture. The expected effect is primarily local disruption around the buried site, consistent with the ESM2 penalty, rather than a full fold collapse.
Segment mutation: alanine window (173–182)
A short segment around position 178 was mutated to alanine residues to test fold robustness under broader perturbation.
TolC_AlaWindow_173_182_ESMFold pTM: 0.845, mean pLDDT: 89.8 (min 42.7, max 96.4)
Interpretation: compared to the single-site mutation, the alanine window produced a slightly lower confidence score and broader local destabilization, but the overall topology remained recognizable. This supports that TolC’s fold stability is distributed across the structure rather than being dominated by one residue.
3A) Inverse Folding with ProteinMPNN
Using the backbone coordinates of PDB 1EK9, ProteinMPNN generated alternative sequences compatible with the fixed TolC structure.
Run details captured in output:
Model: v_48_020
Edges: 48
Noise: 0.2 Å
Designed chains: A, B, C
Sampling temperature: 0.1
Native score (lower is better): 1.6983
Best design score reported: 0.8601 (sample=2)
High-level pattern: the designed sequences remained strongly alpha-helix compatible, with many alanine, leucine, and lysine residues, consistent with maintaining a stable helical barrel scaffold.
FASTA output (ProteinMPNN_designs.fasta) was generated and evaluated for structural compatibility.
3B) Folding Designed Sequences with ESMFold
The top ProteinMPNN-designed sequence was refolded using ESMFold to assess structural compatibility. The predicted fold preserved the alpha-helical barrel topology. Differences were mainly confined to loop regions, while the core architecture remained consistent with the TolC backbone. This supports that ProteinMPNN successfully proposed sequences structurally compatible with the TolC fold.
Notebook note: the 3-chain complex folding run saved a PDB file:
TolC_3chain_ESMFold_len69_r0.pdb
3C) Structural Alignment Interpretation (I previously computed this but skipped my attention all along)
Metric
Value
Meaning
Aligned residues
22
Only a small fragment of the full TolC structure was compared
RMSD
2.49 Å
Shows reasonable backbone structural similarity within the fragment
Sequence identity
4.5%
Very low sequence similarity
TM-score (normalized by reference structure)
0.047
Low because fragment is tiny relative to the full protein
Why the TM-score is Low but RMSD is Informative
The TM-score appears low (0.047) because it is normalized by the length of the full TolC protein (423 residues). The designed model represents only 22 residues, so TM penalizes the short fragment. In contrast, RMSD is calculated over the aligned residues only, reflecting how well the fragment overlaps structurally with the native region. An RMSD of 2.49 Å indicates that the backbone conformation of the designed fragment reasonably resembles the native TolC fold.
Structural alignment between the designed TolC fragment and the native TolC structure (PDB: 1EK9) yielded an RMSD of 2.49 Å across 22 aligned residues, demonstrating moderate backbone similarity. The TM-score (0.047) is artificially low due to normalization against the full TolC protein (423 residues). Despite very low sequence identity (4.5%), the RMSD indicates that the designed fragment adopts a backbone conformation consistent with the corresponding native region.
Overall Conclusion
Across embedding analysis, forward folding, mutational perturbation, and inverse design, TolC shows:
strong structural determinism captured by sequence models
robustness of the global fold to a single-site perturbation (W178D)
broader but still localized destabilization under a short alanine-window mutation
backbone-constrained sequence flexibility under inverse folding, with high compatibility upon refolding
Overall, the results support that protein language models encode structural priors that transfer across mutation scanning, folding, and inverse design tasks.
Process Reflections
This assignment forced me to move beyond simply “running models” into understanding how each computational layer interacts with biological structure. I began with deep mutational scanning using ESM2, where selecting W178D and confirming its buried structural context in Chimera made the relationship between sequence, structure, and stability concrete rather than abstract. That step shifted my thinking from score interpretation to spatial reasoning.
In latent space analysis, I learned the importance of runtime management and reproducibility, especially when Colab resets interrupted long embedding jobs. Rebuilding Step 2 to function independently reinforced modular workflow design. ProteinMPNN inverse folding introduced another layer: generating sequences under structural constraints while interpreting native scores and recovery metrics carefully.
The most instructive challenge was ESMFold memory failure when attempting to fold the trimer as a single concatenated chain. Debugging GPU out-of-memory errors clarified how sequence length scales computational complexity. Representing the trimer properly and adjusting chunk size, precision, and recycles emphasized computational discipline.
Overall, this process strengthened my systems thinking: model outputs are not endpoints but components within an engineered pipeline requiring structural awareness, resource management, and iterative refinement
AI Prompts Employed
Why is ESMFold running out of GPU memory, and what does sequence length do to memory
How do I represent a 3-chain complex properly in ESMFold without concatenating chains
Rewrite the inverse folding protein process to minimize memory usage (half precision, chunking, fewer recycles
Add a safe CPU fallback that still saves the PDB cleanly
Week 5
Class Assignment — Week 5
Part A. SOD1 Binder Peptide Design
Background
ALS remains one of the more intractable neurodegenerative diseases partly because its genetic architecture is well-defined but hard to drug. The A4V mutation in SOD1 - a single alanine-to-valine substitution at residue 4 - is one of the most aggressive familial variants, accelerating disease progression significantly compared to other SOD1 mutations. The aggregation-prone nature of the A4V protein makes it an interesting peptide-binding target: if you can design a peptide that engages the misfolded or oligomerizing form, you potentially disrupt a key early step in motor neuron toxicity.
This part of the assignment asked us to design binders using PepMLM, evaluate them structurally in AlphaFold3, assess therapeutic properties in PeptiVerse, and then generate an optimized candidate using moPPIt. The known binder FLYRWLPSRRGG served as our experimental baseline throughout.
1) Generating Candidates with PepMLM
The SOD1 A4V sequence was generated by introducing the A→V substitution at position 4 of the canonical human SOD1 sequence (UniProt P00441). This mutant sequence served as the target for PepMLM-based peptide generation.
PepMLM produced four novel candidates alongside the known binder:
Peptide
Pseudo Perplexity
WRYYVAAAAHKE
13.27
WRYPAVAAELK
6.83
WRSPAAALALGK
6.78
WLYPVAAAEWKK
18.43
FLYRWLPSRRGG (known)
20.64
One notable observation: PepMLM generated an X at position 12 of one candidate, indicating low model confidence at that residue. The peptide was trimmed to 11 residues before structural evaluation - a practical decision that reflects an important general principle: generative model outputs require post-processing judgment, not just automated acceptance.
Lower perplexity scores indicate higher model confidence in sequence-target compatibility. WRSPAAALALGK (6.78) and WRYPAVAAELK (6.83) were the two most confidently generated peptides, which becomes an interesting data point when their structural and affinity results diverge later.
2) Structural Evaluation with AlphaFold3
How I interpret AF3 results
Three outputs guided my reading of every job. The ipTM score is the most critical — it specifically measures interface confidence, how certain AF3 is that the two chains actually interact. I use the following scale: above 0.80 indicates high confidence; 0.60–0.80 is moderate; 0.40–0.60 is uncertain; below 0.40 is poor. The pTM score is secondary — it measures overall complex fold confidence rather than interface quality specifically. A high pTM with low ipTM means AF3 predicted the protein structure well but is not sure where the peptide goes. The PAE matrix is visual confirmation: dark green signals low positional error and high confidence, while pale green or white signals uncertainty. I divided every matrix into the large SOD1 block (residues 1–153), the peptide strip at the edge, and the corner where they intersect — that corner is where interface confidence is read.
Baseline - FLYRWLPSRRGG (ipTM = 0.37, pTM = 0.69)
The known SOD1-binding peptide received an ipTM of 0.37 in AlphaFold3, falling below the 0.4 threshold for confident interface prediction. Structurally, the peptide appeared largely unstructured and surface-associated, making only minimal contact with the peripheral edge of the SOD1 β-barrel rather than engaging the N-terminal region where the A4V mutation sits or the dimer interface. This is not surprising - AF3 is known to struggle with short, intrinsically disordered peptides that lack a stable pre-binding conformation. Rather than treating this as evidence that FLYRWLPSRRGG doesn’t bind, I treated it as a calibration point: any generated peptide scoring above 0.37 would represent an improvement in predicted structural placement confidence.
PepMLM Candidates
Peptide
ipTM
pTM
Confidence
WRYYVAAAAHKE
0.37
0.71
❌ Poor
WRYPAVAAELK
0.25
0.71
❌ Poor
WRSPAAALALGK
0.61
0.87
⚠️ Moderate
WLYPVAAAEWKK
0.33
0.77
❌ Poor
FLYRWLPSRRGG
0.37
0.69
❌ Poor (baseline)
The standout result here is WRSPAAALALGK (ipTM = 0.61). Its PAE matrix showed a noticeably darker interface region compared to all other PepMLM peptides - meaning AF3 had reasonable confidence not just in the SOD1 structure itself but in where the peptide sits relative to it. The peptide visibly engaged the outer face of the β-barrel with more consistent surface contact. It was the only PepMLM peptide to cross the 0.6 threshold.
What makes this particularly interesting is that WRSPAAALALGK had the weakest PeptiVerse-predicted affinity of the entire PepMLM set (pKd/pKi = 5.147). The discrepancy between structural placement confidence and predicted binding affinity is not a contradiction - it reflects the fact that these tools are measuring different things. AF3 is asking: “Does this peptide have a defined geometric relationship with this protein?” PeptiVerse is asking: “Based on sequence properties, how tightly might this peptide bind?” Those are genuinely different questions, and this dataset illustrates why using only one metric is insufficient.
WRYPAVAAELK (ipTM = 0.25) showed the reverse pattern - highest PeptiVerse affinity (6.037) but lowest structural confidence of any peptide in the dataset. The PAE interface region was essentially pale throughout.
Job 1 — WRYYVAAAAHKE (ipTM = 0.37, pTM = 0.71)
The peptide adopted two clear alpha helices in the 3D viewer — a notable finding, since most PepMLM candidates appeared as unstructured coils. Despite the secondary structure adoption, the peptide sat above and separate from the SOD1 β-barrel with only a small contact point visible. The PAE matrix showed a confident dark-green diagonal for SOD1 (residues 1–153) and a small dark spot in the bottom-right corner confirming internal peptide confidence — but the interface strip between them was pale, meaning AF3 is uncertain about the peptide’s position relative to SOD1. The ipTM of 0.37 matches the baseline exactly, providing no structural improvement over the known binder.
Job 2 — WRYPAVAAELK (ipTM = 0.25, pTM = 0.71)
The peptide appears as an orange/red segment on the right lateral face of the SOD1 structure. The protein itself is rendered in light blue/cyan with many visible loops, suggesting lower overall confidence. The PAE matrix shows moderate internal confidence for the SOD1 block but a very light band at the peptide region — meaning AF3 is highly uncertain about where the peptide sits relative to SOD1. Binding is essentially surface-associated on the lateral β-barrel face, not near residue 4 and not at the dimer interface. Despite being our top PeptiVerse candidate (pKd/pKi = 6.037), WRYPAVAAELK scores the lowest ipTM of all peptides at 0.25. This is the clearest illustration in the dataset that PeptiVerse affinity predictions and AF3 structural confidence are not interchangeable metrics.
Job 3 — WRSPAAALALGK (ipTM = 0.61, pTM = 0.87) ⭐ Best PepMLM Result
This result is strikingly different from the others. The SOD1 structure is rendered in deep blue throughout — high confidence throughout. The peptide (yellow/gold segment) is visible at the lower right periphery, appearing to make contact with the edge of the β-barrel. Critically, the PAE matrix interface region shows moderately green signal rather than pale — this is the only PepMLM peptide where the corner where SOD1 and peptide intersect shows meaningful dark green. AF3 has reasonable confidence in where this peptide sits relative to the protein. The binding location contacts the outer face of the β-barrel near the C-terminal region of SOD1 — not directly at residue 4, but engaging a defined surface patch rather than dangling loosely. Its alanine/leucine-rich hydrophobic core may facilitate surface contact through hydrophobic complementarity — a property ESM captures but pKd/pKi does not fully weight.
Job 4 — WLYPVAAAEWKK (ipTM = 0.33, pTM = 0.77)
The protein shows moderate structural confidence. The peptide appears as an orange segment at the bottom left, extended and loosely dangling away from the SOD1 core — a classic sign of uncertain placement. The PAE matrix interface strip is lighter than Job 3, with no clear dark signal at the intersection region. Binding is peripheral surface contact at the lower face of SOD1 with minimal burial. The double-K at the C-terminus and the mixed hydrophobic/charged composition may prevent stable interface formation despite reasonable solubility.
Job 5 — GTCGTSTQYYGT (ipTM = 0.47, pTM = 0.90) ⭐ Best moPPIt Result
The SOD1 structure is deep blue and well-ordered — pTM 0.90 is the highest of all individual submissions. The peptide (yellow/orange/red gradient) makes contact near the upper surface of the β-barrel as an extended coil. The PAE matrix shows a very dark green SOD1 block with a noticeably lighter pale-green peptide strip — AF3 is confident in the SOD1 structure but uncertain about precise interface geometry. Importantly, the upper β-barrel face is in the general vicinity of the N-terminal region where A4V sits. Combined with the highest PeptiVerse affinity (6.47) of all ten peptides, this remains the strongest overall candidate.
Job 6 — YRKSVTKEEFQI (ipTM = 0.47, pTM = 0.89)
SOD1 is deep blue and well-structured. The peptide appears as a small structured element forming what looks like a short beta-turn or loop — it has some intrinsic structural propensity. The PAE matrix is very similar to Job 5: dark green SOD1 block with a pale strip at the peptide interface region. Binding is at the lower peripheral face of SOD1, away from the N-terminus. Despite a strong motif score from moPPIt (0.84) suggesting N-terminal engagement, AF3 does not confirm this structurally — another illustration that moPPIt motif scores and AF3 placement confidence are measuring different aspects of the same design problem.
moPPIt Candidates
Binder
Hemolysis
Solubility
Affinity
Motif
YRKSVTKEEFQI
0.95
0.75
5.84
0.84
GTCGTSTQYYGT
0.96
1.00
6.47
0.75
ETYNLTCEQKKD
0.98
0.92
6.35
0.87
ETEKKTCQYNCG
0.98
1.00
6.01
0.84
3) Therapeutic Property Evaluation with PeptiVerse
Peptide
Perplexity
Soluble
Hemolytic
pKd/pKi
Net Charge
MW (Da)
GRAVY
WRYYVAAAAHKE
13.27
✅ 1.000
✅ 0.018
5.678
+0.85
1464.6
-0.60
WRYPAVAAELK
6.83
✅ 1.000
✅ 0.034
6.037
+0.76
1303.5
-0.21
WRSPAAALALGK
6.78
✅ 1.000
✅ 0.020
5.147
+1.76
1240.5
+0.22
WLYPVAAAEWKK
18.43
✅ 1.000
✅ 0.037
5.484
+0.76
1461.7
-0.22
FLYRWLPSRRGG
20.64
✅ 1.000
✅ 0.047
5.968
+2.76
1507.7
-0.71
PeptiVerse predictions revealed that all five peptides — including the known binder FLYRWLPSRRGG — were classified as soluble and non-hemolytic, indicating a broadly favorable therapeutic profile across the generated library. The hemolysis probabilities ranged from 0.018 to 0.047, with WRYYVAAAAHKE being the safest (0.018) and FLYRWLPSRRGG carrying the highest risk at 0.047 — though still well within the safe range. Net charges ranged from +0.76 to +2.76, all consistent with therapeutically viable short peptides, and molecular weights were well under 1600 Da throughout.
Binding affinities were uniformly classified as “weak binding,” though meaningful differences emerged in pKd/pKi values. Notably, WRYPAVAAELK achieved the highest predicted affinity (6.037), marginally exceeding the known binder FLYRWLPSRRGG (5.968), despite having the second-lowest perplexity score (6.83) — suggesting reasonable alignment between PepMLM’s generative confidence and PeptiVerse’s affinity prediction for this peptide. This correlation did not hold universally: WRSPAAALALGK had the lowest perplexity (6.78) yet showed the weakest predicted affinity (5.147), highlighting that perplexity alone cannot substitute for multi-property therapeutic evaluation. Low perplexity is necessary but not sufficient — it needs to be read alongside independent property assessment.
The perplexity–affinity relationship across the set is worth noting: WRSPAAALALGK had the lowest perplexity (6.78) - meaning PepMLM was most confident generating it - but showed the weakest predicted affinity (5.147). WRYPAVAAELK had similarly low perplexity (6.83) and the strongest affinity. This tells me that perplexity captures sequence-level compatibility with the target but does not independently predict binding quality. Low perplexity is necessary but not sufficient - it needs to be read alongside multi-property evaluation.
4) moPPIt Optimization
moPPIt’s multi-objective guided discrete flow matching generated four peptides directed toward residues 1–8 of the A4V SOD1 mutant:
Peptide
Solubility
Affinity
Motif Score
Hemolysis
YRKSVTKEEFQI
0.75
5.84
0.84
0.95 ✅
GTCGTSTQYYGT
1.00 ✅
6.47
0.75
0.96 ✅
ETYNLTCEQKKD
0.92
6.35
0.87
0.98 ✅
ETEKKTCQYNCG
1.00 ✅
6.01
0.84
0.98 ✅
The contrast between PepMLM and moPPIt outputs is compositionally striking. PepMLM outputs were tryptophan-heavy and hydrophobic (WRYY-, WRYP-, WRSP-, WLYP-). moPPIt generated more compositionally diverse sequences incorporating charged and polar residues (E, K, T, N, C, Y), which reflects what multi-objective optimization actually does: it doesn’t just optimize for target compatibility, it simultaneously balances affinity, solubility, safety, and motif score.
GTCGTSTQYYGT achieved the highest affinity score of all ten peptides (6.47) alongside perfect solubility and strong non-hemolytic confidence. ETYNLTCEQKKD followed with a high motif engagement score (0.87) suggesting effective N-terminal targeting - which matters here because the A4V mutation sits at residue 4.
Integrated Candidate Ranking and Final Selection
Peptide
Source
ipTM
PeptiVerse Affinity
Overall Assessment
WRSPAAALALGK
PepMLM
0.61
5.147
Best structural placement
GTCGTSTQYYGT
moPPIt
0.47
6.47
Best affinity, highest pTM
WRYPAVAAELK
PepMLM
0.25
6.037
Affinity strong, structure weak
ETYNLTCEQKKD
moPPIt
0.47
6.35
Strong balanced candidate
FLYRWLPSRRGG
Known
0.37
5.968
Baseline
Peptide to advance: GTCGTSTQYYGT
Alternative candidate: ETYNLTCEQKKD. On a strictly mechanistic basis, ETYNLTCEQKKD presents a strong case for advancement. Its motif score (0.87) is the highest in the entire dataset — meaning moPPIt judged it as most effectively engaging residues 1–8, the region where the A4V substitution sits at residue 4. Its affinity (6.35) is within moPPIt’s uncertainty range of GTCGTSTQYYGT (6.47), its solubility is 0.92, and hemolysis safety is 0.98. Crucially, it is cysteine-free — avoiding the redox stability liability that two cysteine residues introduce in GTCGTSTQYYGT under physiological conditions. If the selection criterion were weighted toward N-terminal targeting specificity over raw affinity rank, ETYNLTCEQKKD would be the primary candidate.
Of all ten peptides evaluated, GTCGTSTQYYGT presents the strongest integrated profile. It achieved the highest predicted binding affinity (pKd/pKi = 6.47) of any candidate across both generation methods, perfect solubility (1.000), strong hemolysis safety (0.96), and the highest pTM score in the dataset (0.90) - indicating AF3 predicted a well-ordered SOD1 structure in its complex. Its moderate ipTM (0.47) is consistent with the general pattern seen across all peptides and does not distinguish it negatively from the field. The AF3 structural viewer showed the peptide as an extended coil making surface contact near the upper β-barrel face, in the general vicinity of the N-terminal A4V region.
Before advancing further, validation steps would include: AlphaFold3 or RoseTTAFold structural confirmation of binding near residue 4; molecular dynamics simulation for binding stability; surface plasmon resonance or isothermal titration calorimetry for experimental affinity confirmation; cell-based cytotoxicity assays in motor neuron models; and proteolytic stability assays for physiological half-life. One additional consideration specific to GTCGTSTQYYGT: the sequence contains two cysteine residues (positions 3 and 8) that may form intramolecular disulfide bonds or undergo oxidation under physiological redox conditions. A redox stability assessment and, if necessary, Cys→Ser or Cys→Ala analogues should be evaluated before committing to this scaffold.
Part B. BRD4 Drug Discovery Platform Tutorial
1) Structural Predictions in the Sandbox
Compound
Binding Confidence
Optimization Score
Structure Confidence
Hit
0.45
0.22
0.97
Lead
0.74
0.25
0.98
JQ1
0.96
0.45
0.98
Q1: Does Binding Confidence increase as you move from hit to clinical candidate?
Yes. Binding Confidence increases monotonically across the series: Hit (0.45) → Lead (0.74) → JQ1 (0.96). This is the expected pattern. Each stage represents deliberate structural elaboration optimising target complementarity, so the model’s confidence in productive binding should rise accordingly.
Deviations can occur for several reasons. A lead compound may outscore a candidate if the candidate carries solubility-improving modifications (e.g. tert-butyl ester in JQ1) that reduce direct contact with the pocket. Stereochemical complexity added during optimisation can also confuse pose prediction. Additionally, Boltz scores binding pose plausibility, not biological potency — a metabolically stable but conformationally flexible candidate may score lower than a rigid, tighter-fitting lead.
Q2: Key binding interactions in the predicted JQ1 pose
JQ1 occupies the BRD4 acetyl-lysine recognition pocket. From the predicted pose, key interactions include:
Triazolo-diazepine core — engages the conserved asparagine (Asn140) via hydrogen bonding, mimicking the acetyl-lysine carbonyl
Chlorophenyl group — sits in the WPF shelf hydrophobic subpocket (Trp81, Pro82, Phe83), contributing van der Waals contacts
Thieno ring methyl groups — pack against the ZA channel hydrophobic residues (Leu92, Val87)
tert-Butyl ester — projects toward solvent, consistent with its role as a solubilising group rather than a binding contributor
Q3: Optimization Score — JQ1 vs Lead
JQ1 (0.45) scores nearly 80% higher than the Lead (0.25). The Optimization Score reflects how well a compound’s predicted binding geometry satisfies the probe-defined pocket relative to the reference structure. JQ1’s score places it firmly in the high-confidence binder category (>0.40); the Lead sits at the lower boundary of moderate confidence.
The gap reflects the structural additions made during lead-to-candidate optimisation, particularly the triazole elaboration and stereochemical fixing of the diazepine ring, which improve shape complementarity with the BRD4 pocket. The Lead’s core is present but insufficiently decorated to achieve equivalent pocket filling.
Q1: How does JQ1 score alongside the library? Does it score as the top compound?
No. The best generated compound reaches a Binding Confidence of ~0.88 (Image 3, green line), which exceeds JQ1’s score of 0.96 from the sandbox but is competitive in this design project context. Of 1,048 candidates processed, roughly 125 exceed the 0.5 threshold, ~37 exceed 0.6, and only a handful exceed 0.8 (Image 1). This means the generative screen produced a small but meaningful set of high-confidence binders. Whether any definitively outscore JQ1 depends on where JQ1 lands after Quick Add, but the best generated compound at ~0.88 is a genuine challenger, not noise.
This is expected. The AI is optimising directly against the BRD4 pocket, so it will frequently find molecules that score at or above known inhibitors on Boltz metrics. That does not mean they are better drugs. JQ1 has decades of experimental validation behind it that no computational score can replicate.
Q2: How do top-scoring binders compare in binding pose to JQ1?
From Image 2, the parallel coordinates plot shows the top candidates cluster tightly at high Structure Confidence (0.982 range) and Binding Confidence (0.95–0.96 range), with consistent trajectories suggesting similar binding geometries. The convergence of lines across axes indicates the top hits share a common pharmacophoric profile rather than representing diverse chemotypes.
This is consistent with what you would expect from Enamine REAL space generative sampling anchored to the JQ1 probe. The model gravitates toward JQ1-like poses that satisfy the acetyl-lysine pocket geometry, particularly the Asn140 hydrogen bond and WPF shelf hydrophobic contacts. Divergent trajectories in the lower-scoring compounds (orange lines) likely represent alternative poses or partial pocket occupancy. The top hits should be inspected for conservation of the key triazole/diazepine equivalent scaffold in the 3D viewer.
2b) Generative Design Campaign (BRD4 vs BRD2 cross-selectivity)
Part C. L-Protein ESM Mutagenesis
Background
The MS2 L-protein is a 75-residue lysis protein encoded by the bacteriophage MS2. It acts by forming oligomeric pores in the inner membrane of E. coli, leading to rapid bacterial lysis. What makes it therapeutically relevant is its dependence on the host chaperone DnaJ for proper folding and function - mutations that confer DnaJ independence would expand the functional host range of MS2-derived lysis proteins, a key engineering goal in phage therapy where host chaperone availability varies across bacterial strains and resistance contexts.
The protein is divided into a soluble N-terminal domain (residues 1–40) that interacts with DnaJ, and a C-terminal transmembrane domain (residues 41–75) responsible for membrane insertion and pore assembly. Designing effective mutants requires balancing these two functional regions.
Step 1: Sequence Input and Model Setup
The wildtype MS2 L-protein sequence was submitted to the ESM2 mutational scanning notebook using the facebook/esm2_t6_8M_UR50D model. The sequence was verified against the known MS2 L-protein entry and loaded into the notebook environment running on GPU. Two scan modes were used: a full-sequence scan across all 75 positions, and a targeted scan restricted to positions 38–60 to focus resolution on the soluble/TM boundary and transmembrane domain. Both scans computed Log Likelihood Ratio (LLR) scores for every possible single amino acid substitution at every scanned position, producing a complete mutational landscape.
Step 2: ESM Mutational Scanning
ESM2 scanning was performed on the full MS2 L-protein sequence using the facebook/esm2_t6_8M_UR50D model, generating Log Likelihood Ratio (LLR) scores for every possible single amino acid substitution across all 75 positions. A targeted scan was additionally applied to positions 38–60 to focus resolution on the soluble/TM boundary and transmembrane domain.
The heatmap revealed clear patterns. Leucine substitutions were broadly favored across the TM region (bright yellow L-row). Methionine and tryptophan substitutions were consistently penalized throughout (dark purple M and W rows). The N-terminus (residues 1–3) and the conserved RRR region (~11–13) showed strong sensitivity to substitution.
Top Mutations - Full Sequence Scan (positions 1–75)
Position
WT
Mutant
LLR
Region
50
K
L
+2.561
TM
29
C
R
+2.395
Soluble
39
Y
L
+2.242
Soluble/TM boundary
29
C
S
+2.043
Soluble
9
S
Q
+2.014
Soluble
50
K
I
+1.929
TM
53
N
L
+1.865
TM
52
T
L
+1.814
TM
45
A
L
+1.539
TM
The targeted scan (positions 38–60) independently confirmed K50L (+2.561) and Y39L (+2.242) as the top two hits - a reproducibility signal that increases confidence in these positions as structurally tolerant by ESM.
Step 3: BLAST Alignment Analysis
Prior to selecting mutations, a BLAST alignment was performed against related phage L-protein sequences to identify positions that vary naturally across evolutionary homologs. Positions conserved across all aligned sequences were excluded from consideration, as conservation is a strong signal of functional essentiality that ESM LLR alone cannot capture. Positions selected for mutation — 9, 30, 45, 46, and 63 — were all confirmed as variable across the BLAST alignment, meaning natural sequence diversity at these sites exists in the phage sequence space. This provides an independent structural tolerance signal orthogonal to ESM scoring.
The sequence coverage image above shows the MSA depth available to the ESM model across L-protein positions. Coverage was critically limited to only 14 sequences — far below the ~100 sequences per position typically required for confident covariation-based prediction. This shallow MSA is one of the three major factors explaining the low confidence scores observed in the AF2-Multimer octamer prediction in Step 6. It also contextualizes the ESM2 predictions: the model is operating with sparse evolutionary signal for this protein, which is why cross-referencing with experimental lysis data is essential rather than optional.
Step 4: ESM vs. Experimental Cross-Reference
This is where things get genuinely interesting - and where the limitation of language model-based fitness prediction becomes concrete.
Position
ESM Top Hit
LLR
Experimental Lysis
Protein Level
Agreement
9 (S)
S→Q
+2.014
Not tested
-
Unconfirmed
29 (C)
C→R
+2.395
Lysis=0
0
❌ Disagree
39 (Y)
Y→L
+2.242
Y→H: Lysis=0
0
❌ Disagree
45 (A)
A→L
+1.539
A→P: Lysis=1
1
✅ Agree
50 (K)
K→L
+2.561
K→E,I,N: Lysis=0
1
❌ Disagree
53 (N)
N→L
+1.865
N→S,D,H: Lysis=0
1
❌ Disagree
30 (R)
-
-
R→Q,L: Lysis=1
1
✅ Experimental support
46 (I)
-
-
I→F: Lysis=1
1
✅ Experimental support
63 (V)
-
-
V→E: Lysis=1
1
✅ Experimental support
The pattern is striking. K50 - the highest-scoring position in the entire dataset - is experimentally lethal. Every tested K50 substitution abolished lysis. The same holds for C29 and N53. ESM scores well above zero at all three positions, predicting broad substitution tolerance. Experimentally, they are functionally non-negotiable.
ESM2 learns from evolutionary sequence statistics across millions of proteins. What it cannot learn is that K50 in the L-protein appears functionally essential - possibly for oligomerization geometry, membrane topology orientation, or interaction with a specific bacterial target. C29 mutations abolish both lysis and protein expression, suggesting a role in co-translational folding or ribosomal interaction that no language model trained on amino acid co-occurrence patterns could detect. N53 mutations preserve protein expression but abolish lysis, suggesting this residue is specifically critical to the lysis mechanism - pore formation geometry perhaps - rather than to folding per se.
This is not a failure of ESM so much as a clarification of what it is actually measuring. It identifies structurally tolerant positions in the evolutionary sense. It cannot identify which positions are biochemically essential for a specific mechanism. The two are different questions, and this dataset makes that distinction concrete.
Step 5: Five Selected Mutations
Mutations were selected by integrating ESM LLR scores with experimental lysis data. Any position where the two sources of evidence disagreed was excluded.
#
Position
WT→Mutant
LLR
Region
Experimental Lysis
Protein Level
1
9
S→Q
+2.014
Soluble
Not tested
-
2
30
R→Q
~+0.5
Soluble
✅ Lysis=1
1
3
45
A→L
+1.539
TM
✅ Lysis=1 (A→P)
1
4
46
I→F
~+0.9
TM
✅ Lysis=1
1
5
63
V→E
~+0.3
TM
✅ Lysis=1
1
Rationale:
S9Q was selected based on the highest ESM score among soluble domain positions not previously tested. S9 sits within the N-terminal DnaJ interaction region. Substitution to glutamine introduces a larger polar residue that may reduce DnaJ binding affinity - potentially conferring partial chaperone independence - while the conservative polar-to-polar change makes catastrophic folding disruption unlikely.
R30Q was selected on experimental confirmation (Lysis=1, Protein=1). R30 is part of the positively charged soluble domain, and neutralizing it to glutamine directly reduces the electrostatic surface that likely mediates DnaJ interaction, without disrupting expression or lysis competence.
A45L was selected on both ESM support (LLR = +1.539) and experimental confirmation that A45 tolerates substitution - A45P shows Lysis=1. Leucine replaces a small residue with a bulkier hydrophobic one, potentially improving hydrophobic packing in the TM helix and enhancing membrane insertion efficiency.
I46F was selected on experimental confirmation (Lysis=1, Protein=1). Phenylalanine at position 46 adds an aromatic residue to the hydrophobic TM core, which may strengthen helix-helix packing in the oligomeric pore assembly.
V63E was selected on experimental confirmation (Lysis=1, Protein=1). Glutamate at the C-terminal TM boundary introduces a negative charge at the membrane-cytoplasm interface - consistent with the positive-inside rule for membrane protein topology - which may facilitate the oligomeric pore assembly required for lysis.
All five mutations were selected at positions confirmed as non-conserved by BLAST alignment analysis. Four of five have direct experimental support for lysis competence.
ColabFold AlphaFold2-multimer v3 was used to model a hypothesized octameric pore assembly by submitting eight identical copies of the wildtype L-protein sequence as a homo-octamer. All five predicted models returned uniformly low confidence scores: pLDDT ranged from 26.6–36.9, pTM from 0.149–0.193, ipTM from 0.114–0.143. The top-ranked model (model_1, ipTM = 0.143) displayed a starburst-like arrangement in which all eight chains radiated outward from a central core, with TM domains converging centrally and N-terminal soluble domains extending as disordered tails.
This radial topology is superficially consistent with a pore-forming architecture - TM helices converging from a central bundle is exactly what you’d expect for a membrane-spanning oligomeric pore. But the confidence scores preclude any definitive structural interpretation. Three compounding factors explain the poor prediction quality: AF2-Multimer lacks membrane context, so the hydrophobic TM domain appears disordered in aqueous modeling conditions; MSA coverage was critically limited to only 14 sequences, far below the ~100 per position required for confident covariation-based prediction; and the L-protein may be genuinely intrinsically disordered until membrane insertion occurs, which AF2 cannot model.
Individual model outputs:
The consistent central TM clustering across multiple independent models does provide weak computational support for the pore-forming hypothesis - it’s something, even if it isn’t confident. This kind of result is also practically instructive: it tells you clearly where experimental validation has to carry the weight that computation cannot.
Open-Ended Question: Defining an Effective L-Protein Mutant
An effective L-protein mutant needs to satisfy five integrated criteria. First, lysis efficiency - measured via plaque assay as plaque size and clarity relative to wildtype MS2, where larger clearer plaques indicate faster or more complete bacterial killing. Second, DnaJ independence - assessed by testing infectivity in E. coli strains carrying the DnaJ chaperone resistance mutation, since this directly addresses the resistance mechanism the whole design exercise is oriented toward. Third, structural integrity - evaluated via AF2-Multimer prediction of oligomeric pore assembly, where effective mutants should maintain transmembrane topology and oligomerization capacity required for membrane perforation. Fourth, expression level - confirmed via Western blot or mass spectrometry, since a structurally competent mutant that is poorly expressed will fail in vivo regardless of intrinsic lysis activity. Fifth, evolutionary plausibility - mutations at positions that vary across a BLAST alignment of related phage L-proteins are more likely to be structurally tolerated, and this alignment serves as an independent check on ESM predictions.
Computationally, positive ESM LLR scores provide an initial structural tolerance filter. But as the K50 data demonstrate clearly, high ESM scores do not guarantee functional lysis activity. Experimental plaque assay validation remains the definitive standard. The most useful role for ESM in this workflow is not to replace experimental data but to prioritize which untested positions are worth testing next - it reduces the search space rather than eliminating the need to search.
Process Reflections
What this week reinforced most clearly is that computational tools are filters, not answers. PeptiVerse, ESM, and AlphaFold3 each measure something real and useful. None of them measures the same thing. The disagreements between them - WRSPAAALALGK’s high ipTM paired with low affinity, K50’s high LLR paired with zero experimental lysis, GTCGTSTQYYGT’s high pTM paired with moderate ipTM - are not failures of the pipeline. They are the information.
The skill is knowing what each tool is actually asking, and assembling a picture from genuinely independent lines of evidence rather than defaulting to whichever metric gives the cleanest answer. The K50 case in Part C crystallized this most sharply: a language model trained on evolutionary statistics correctly identified K50 as broadly sequence-tolerant, while experimental data showed it is biochemically non-negotiable for lysis. Both observations are true but neither alone is sufficient.
Works Cited
Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., Pritzel, A., Ronneberger, O., Willmore, L., Ballard, A. J., Bambrick, J., Bodenstein, S. W., Evans, D. A., Hung, C.-C., O’Neill, M., Reiman, D., Tunyasuvunakool, K., Wu, Z., Žemgulytė, A., Arany, Z., … Jumper, J. M. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630(8016), 493–500. https://doi.org/10.1038/s41586-024-07487-w
Bateman, A., Martin, M.-J., Orchard, S., Magrane, M., Ahmad, S., Alpi, E., Bowler-Barnett, E. H., Britto, R., Bye-A-Jee, H., Cukura, A., Denny, P., Dogan, T., Ebenezer, T., Fan, J., Garmiri, P., da Costa Gonzales, L. J., Hatton-Ellis, E., Hussein, A., Ignatchenko, A., … Wu, C. H. (2023). UniProt: The Universal Protein Knowledgebase in 2023. Nucleic Acids Research, 51(D1), D523–D531. https://doi.org/10.1093/nar/gkac1052
Chen, L. T., Quinn, Z., Dumas, M., Peng, C., Hong, L., Lopez-Gonzalez, M., Mestre, A., Watson, R., Vincoff, S., Zhao, L., Wu, J., Stavrand, A., Schaepers-Cheu, M., Wang, T. Z., Srijay, D., Monticello, C., Vure, P., Pulugurta, R., Pertsemlidis, S., … Chatterjee, P. (2025). Target sequence-conditioned design of peptide binders using masked language modeling. Nature Biotechnology. https://doi.org/10.1038/s41587-025-02761-2
Chen, T., Quinn, Z., Mishra, K., O’Connor, E. C., Silver, S. E., Zhang, Y., Valencia, M. J., Mei, Y., Behmoaras, J., Ferreira, L. M. R., & Chatterjee, P. (2026). moPPIt: De novo generation of motif-specific and functionally active peptide binders via discrete flow matching [Preprint]. bioRxiv. https://doi.org/10.1101/2024.07.31.606098
Evans, R., O’Neill, M., Pritzel, A., Antropova, N., Senior, A., Green, T., Žídek, A., Bates, R., Blackwell, S., Yim, J., Ronneberger, O., Bodenstein, S., Zielinski, M., Bridgland, A., Potapenko, A., Cowie, A., Tunyasuvunakool, K., Jain, R., Clancy, E., … Jumper, J. (2022). Protein complex prediction with AlphaFold-Multimer [Preprint]. bioRxiv. https://doi.org/10.1101/2021.10.04.463034
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., … Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589. https://doi.org/10.1038/s41586-021-03819-2
Kaplan, M., Narasimhan, S., de Heus, C., Zhao, J., Bharat, T. A. M., Young, R., & Bharat, T. A. M. (2022). Cryo-EM structure of the MS2 bacteriophage lysis protein L in complex with the DnaJ chaperone. Nature Communications, 13(1), 4102. https://doi.org/10.1038/s41467-022-31874-2
Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., & Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with ESMFold. Science, 379(6637), 1123–1130. https://doi.org/10.1126/science.ade2574
Mirdita, M., Schütze, K., Moriwaki, Y., Heo, L., Ovchinnikov, S., & Steinegger, M. (2022). ColabFold: Making protein folding accessible to all. Nature Methods, 19(6), 679–682. https://doi.org/10.1038/s41592-022-01488-1
Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo, D., Ott, M., Zitnick, C. L., Ma, J., & Fergus, R. (2021). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15), e2016239118. https://doi.org/10.1073/pnas.2016239118
Shi, Y., Iyer, A., Liu, F., & Bhattacharya, S. (2023). PeptiVerse: An integrated platform for multi-property therapeutic peptide prediction [Preprint]. bioRxiv. https://doi.org/10.1101/2023.10.11.561829
Wang, G., Heberle, F. A., Chen, R., & Sun, F. (2022). Phage lysis proteins as targeted antibacterials. Pharmaceuticals, 15(9), 1062. https://doi.org/10.3390/ph15091062
Cross-reference ESM LLR scores against experimental lysis data and identify where they agree vs. disagree
Identify the best peptide to advance using integrated AF3, PeptiVerse, and moPPIt data
Explain why ESM would score K50 highly despite experimental evidence that K50 is functionally essential
Draft rationale for each of five selected L-protein mutations that integrates ESM scores with experimental confirmation
Week 6
Class Assignment — Week 6
Part A. DNA Assembly
1. Components of Phusion High-Fidelity PCR Master Mix
A) Phusion DNA Polymerase
A DNA-binding protein subunit that ensures higher template processivity, speed, and accuracy/fidelity alongside 5´→3´ polymerase activity and 3´→5´ exonuclease activity for proofreading.
B) Phusion Reaction Buffer (HF or GC)
An optimized buffer that provides high salt concentrations used to stabilize primer-template hybridization. HF Buffer is the default for high fidelity, while GC Buffer helps with GC-rich or difficult templates.
C) MgCl₂
Provides the necessary magnesium ions for Phusion DNA polymerase activity.
D) dNTPs
Exist as Deoxynucleoside triphosphates in either dATP, dTTP, dGTP, or dCTP. They act as the building blocks for synthesizing the new DNA strand.
E) DMSO
Dimethyl sulfoxide acts alongside the Phusion reaction buffer as a PCR additive to aid the denaturation of templates with high GC content or complex secondary structures.
F) Stabilizers
Components that maintain the integrity and activity of the enzyme during storage and cycling, often including bovine serum albumin (BSA).
2. Factors Determining Primer Annealing Temperature During PCR
Primer annealing temperature in PCR is primarily determined by the melting temperature of the primer-template duplex, which represents the temperature at which 50% of the primers are bound to the template.
A) Primer Melting Temperature
Directly related to primer annealing temperature.
B) Primer Length
Directly related to primer annealing temperature; optimally 18–24 bp.
C) GC Content
Total percentage of GC content is directly related to primer annealing temperature; usually optimal at 40–60%.
D) Ionic Strength
Mg²⁺ concentration is directly related to primer annealing temperature.
E) Primer Concentration
Directly related to binding probability and therefore to primer annealing temperature.
F) Presence of Additives
DMSO, glycerol, or formamide presence is inversely related to primer annealing temperature.
G) Target DNA
When the target contains GC-rich templates, a higher primer annealing temperature is often required — i.e. directly related.
3. PCR vs. Restriction Enzyme Digests: Comparison of Two Methods for Creating Linear DNA Fragments
Mechanism
PCR uses a thermostable polymerase to exponentially amplify a target region using designed primers, starting from a tiny amount of template. It generates millions of identical copies through cycles of denaturation, annealing, and extension. A restriction enzyme (RE) digest, on the other hand, uses sequence-specific endonucleases that recognize short palindromic sequences (typically 4–8 bp) and cleave both strands at or near that site, producing non-identical fragments defined entirely by where those sites happen to fall in the existing DNA.
Ends Produced
PCR with standard primers produces blunt-ended fragments, but with Gibson-specific primers the overhangs are built into the primer sequence itself, so the linear product has the exact 20–22 bp overlap sequence that is designed. REs typically leave either sticky ends (4 bp 5’ or 3’ overhangs) or blunt ends depending on the enzyme. These sticky ends can be directly ligated but are constrained by the availability of RE recognition sites in the template.
When Each Is Preferred
PCR is the clear choice when there is a need to introduce mutations, when no convenient RE site flanks the insert, or when customized overhangs are needed especially for Gibson assembly. RE digests are preferred when working with a well-characterized vector/insert system that already has compatible sites, when high fidelity without PCR-introduced errors is required, or when performing directional cloning into a backbone pre-cut with two different enzymes.
Error Profile
PCR can introduce point mutations at a rate that depends on polymerase fidelity. Phusion HF, used in this lab protocol, has an error rate approximately 50× lower than Taq, making it appropriate for mutagenesis work where only the intended changes should be introduced. RE digests introduce no sequence errors.
4. Ensuring DNA Sequences Are Appropriate for Gibson Cloning
A) Overlapping sequences must be present and correct
Gibson exonuclease chews back 5’ ends to expose single-stranded tails that then anneal to complementary tails on the adjacent fragment. If PCR primers were designed with the correct 20–22 bp overhang matching the adjoining fragment, the overlap is automatically built in. For RE-digested fragments, it is important to confirm that the sticky ends of one fragment are complementary to those of the adjoining fragment, which typically means using compatible enzymes (e.g., BamHI + BglII both produce GATC overhangs).
B) Fragment orientation must be correct (5’→3’)
Each primer and fragment sequence should be verified in Benchling or SnapGene to confirm that directionality is preserved. A reversed insert is the most common and often the most costly error.
C) Fragment length and concentration must be within working range
After gel electrophoresis, bands must appear at the expected sizes — backbone at approximately 3 kb and insert at approximately 300 bp as expected from the mUAV plasmid. Nanodrop concentration should exceed approximately 30 ng/µL.
5. How Plasmid DNA Enters E. coli Cells During Transformation
The process involves heat-shock transformation with chemically competent DH5α cells. Competent cells are pre-treated with divalent cations (typically CaCl₂), which partially neutralize the negative charge of the cell membrane’s lipopolysaccharide layer and the DNA backbone, reducing electrostatic repulsion. When the 42°C heat shock is applied for exactly 45 seconds, it creates a transient thermal imbalance that temporarily disrupts the membrane, creating pores or channels through which the plasmid can enter by diffusion. The cells are immediately transferred back to ice to reseal the membrane. Recovery in SOC media (Super Optimal broth with Catabolite repression) for 60 minutes at 37°C allows cells to repair the membrane, express the chloramphenicol resistance gene from the newly acquired plasmid, and begin dividing so that when plated on selective media, only transformants survive. Alternatively, electroporation works more definitively by using a brief high-voltage pulse to create quantifiable electropores, which generally yields higher efficiency than heat shock.
6. Alternative Assembly Method: Golden Gate Assembly
Overview
Golden Gate Assembly is a DNA assembly method that leverages Type IIS restriction enzymes — most commonly BsaI or Esp3I — which cut outside their recognition sequence at a defined offset, generating customizable 4 bp overhangs. Unlike conventional REs, which leave their recognition site in the product, the Type IIS enzyme cuts away from itself so that the recognition site is excised along with the surrounding primer sequence, leaving a scar-free junction. Each fragment is PCR-amplified with primers that embed the BsaI site facing outward, followed by the desired 4 bp overhang unique to that junction. The enzyme cuts all fragments simultaneously, exposing these complementary 4 bp tails, which then direct fragment annealing in the correct order — because only perfectly complementary overhangs will anneal stably. T4 DNA ligase seals the nicks in the same reaction tube. The reaction cycles between the cutting temperature (~37°C) and ligation temperature (~16°C) repeatedly, driving the equilibrium toward a fully assembled, circularized product. Golden Gate can assemble up to approximately 10 fragments simultaneously with high efficiency and directional fidelity, making it especially powerful for large combinatorial pathway assembly such as building multi-part biosynthetic operons, where Gibson’s exonuclease-dependent overlap system becomes less efficient.
Golden Gate vs. Gibson Assembly
Gibson uses a 5’ exonuclease to chew back fragments and generate long (20–40 bp) single-stranded overhangs for annealing, which then require a polymerase to fill gaps and a ligase to seal them. Golden Gate uses short 4 bp Type IIS-generated overhangs and no exonuclease — simpler biochemistry, but the overhangs are shorter and specificity depends entirely on the 4 bp sequence design. Ligation of wrong-order fragments can occur if overhang sets are not carefully designed to be unique. Gibson is more forgiving for large fragments; Golden Gate is faster and more multiplexable for modular, repetitive assemblies.
Feature
Gibson Assembly
Golden Gate Assembly
Enzyme type
5’ exonuclease + polymerase + ligase
Type IIS RE + T4 ligase
Overlap length
20–40 bp
4 bp
Scars left
None
None (RE site excised)
Max fragments
5–6 efficiently
Up to 10+
Best for
Large fragments, flexible design
Modular, combinatorial assemblies
Error risk
PCR errors at junctions
Wrong-order ligation if overhangs not unique
Benchling Model
Part B. Asimov Kernel
Under the folder name: John_Adeyemo_Adedeji_Genspace
Week 7
Class Assignment — Week 7
Part A. Intracellular Artificial Neural Networks (IANNs)
1. Advantages of IANNs over Boolean Genetic Circuits
Boolean genetic circuits are fundamentally limited by their design logic: every input gets collapsed into a binary state, and the circuit operates on those discrete values. That works for simple switch-like decisions, but most physiologically relevant signals (metabolite concentrations, osmotic gradients, and quorum sensing molecule titres), exist on a continuum, and forcing them through a hard threshold discards information. IANNs avoid this by processing analog inputs directly, generating graded outputs that reflect the actual magnitude of the input rather than just which side of a threshold it fell on.
The deeper advantage is function approximation capacity. A sufficiently wide or deep network of gene-regulatory elements functioning as weighted summing nodes can approximate arbitrary continuous input-output relationships, which means you can in principle encode complex multi-factor decisions (that respond strongly when signal A is high and signal B is moderate and signal C is low, but not when all three are high) without the combinatorial explosion of logic gates that an equivalent Boolean circuit would require. Practically, this also reduces the parameterisation burden: you train the network on data rather than manually calibrating each gate’s individual threshold and transfer function, which for complex Boolean circuits is a significant experimental cost.
Noise robustness is the third real advantage. Biological systems are stochastic, and Boolean circuits that depend on clean thresholding behave poorly when input signals are noisy or when component expression varies between cells. Analog processing distributes the computation across multiple nodes, so no single component’s noise dominates the output.
2. IANN Application — ÌṢỌ / Gut Sentinel Context
The continuous modelling capacity of an IANN is directly relevant to the gut sentinel problem. The challenge with engineering E. coli Nissle 1917 as a therapeutic probiotic is that its fitness and output behaviour depend on a genuinely continuous environmental landscape — luminal pH, competing commensal species densities, pathogen metabolite concentrations, mucus layer thickness, transit rate. A Boolean circuit could in principle be designed to activate effector expression above some threshold concentration of a target metabolite, but that assumes a single clean input drives the decision. Real gut ecology doesn’t work that way.
An IANN implemented in EcN could integrate multiple continuous environmental inputs simultaneously, tetrathionate concentration, competing species quorum signals, local oxygen tension, and produce a graded effector output proportional to the true threat level rather than a binary kill switch. This is particularly relevant to the evolutionary stability question in the ÌṢỌ framework: a cell population making graded decisions about resource allocation to effector production versus growth will, under selection, behave more like a stable evolutionarily stable strategy than one operating a hard switch that either maximally expresses a costly effector or doesn’t express it at all.
The limitations are substantial though. Implementing an IANN in a living cell requires physical instantiation of weighted connections as actual molecular interactions (protein-protein binding affinities, RNA regulatory elements, transcription factor binding strengths), all of which drift under evolutionary pressure, are sensitive to cellular metabolic state, and cannot be reconfigured in situ once the cell is deployed. Training the network computationally is achievable; translating the learned weights into specific DNA sequences encoding the required regulatory strengths is not straightforward, and verifying that the implemented network actually computes what you intended in a complex in vivo environment like the gut is a significant experimental challenge. There is also a metabolic cost argument: implementing even a shallow network requires expressing multiple non-native regulatory proteins simultaneously, which imposes a fitness burden that selection will work against over time.
3. Intracellular Multilayer Perceptron
Part B. Fungal Materials
1. Examples of Existing Fungal Materials and Their Applications
The most commercially visible fungal materials are mycelium-based composites — mycelial networks grown through agricultural waste substrates like hemp hurds or corn stalks, then heat-treated to halt growth and pressed into rigid forms. Companies like Ecovative have used this to produce packaging, acoustic panels, and leather-like textiles. In construction contexts, mycelium composites offer comparable compressive strength to expanded polystyrene at a fraction of the carbon cost, with full biodegradability at end of life.
In the medical context specifically, fungal-derived materials have a longer history than the mycelium-composite trend might suggest. Chitin and its deacetylated derivative chitosan (both derived from fungal cell walls) have been extensively evaluated as wound dressings, drug delivery scaffolds, and haemostatic agents. Chitosan’s cationic character at physiological pH allows it to interact electrostatically with bacterial membranes and negatively-charged wound exudate, giving it both antimicrobial and pro-coagulant properties without the immunogenicity concerns associated with animal-derived alternatives like collagen. For biosecurity and field-medicine applications, chitosan-based haemostatic dressings are already in clinical and military deployment, HemCon dressings were among the first to translate this directly into combat casualty care.
The disadvantages are real though. Batch-to-batch consistency in fungal-derived biomaterials is harder to control than synthetic polymer manufacturing: chitin extraction yields vary with growth conditions, and residual endotoxin or beta-glucan contamination from fungal cell wall debris poses immunogenicity risks in any implantable or injectable application. Regulatory classification is also still unsettled in many jurisdictions: a mycelium-derived scaffold sits awkwardly between a device and a biological, which complicates approval pathways considerably.
For biofabrication purposes, the more interesting frontier is using fungal hyphal networks as living scaffolds for tissue engineering — mycelial architecture naturally produces interconnected porous networks at scales relevant to vascularisation, something genuinely difficult to replicate by synthetic additive manufacturing. The limitation here is that you are working with a eukaryotic organism that has its own growth agenda, and getting predictable pore geometry without precise genetic intervention remains challenging.
2. Genetic Engineering in Fungi for Biopharmaceuticals and Protein Therapeutics
The application I find most compelling is using engineered Pichia pastoris (now reclassified as Komagataella phaffii) or Saccharomyces cerevisiae as chassis for producing complex glycosylated therapeutic proteins, biologics that bacteria fundamentally cannot make correctly.
This is where the core advantage of fungal synthetic biology over bacterial systems becomes concrete: post-translational modification. Bacteria lack the endoplasmic reticulum machinery for N-linked glycosylation, disulfide bond formation in a controlled oxidising environment, and proper signal peptide processing for secretion. A therapeutic antibody fragment, a vaccine antigen, or a receptor-binding protein domain that depends on correct glycosylation for receptor recognition, serum half-life, or effector function simply cannot be produced functionally in E. coli without extensive refolding steps that introduce batch variability and reduce yield. Yeast do all of this co-translationally in a compartmentalised secretory pathway that is genuinely homologous to mammalian cells.
For vaccinology specifically, yeast-expressed virus-like particles are already an established platform, the hepatitis B surface antigen in Engerix-B is produced in S. cerevisiae, and the HPV L1 capsid proteins in Gardasil are produced in the same host. The self-assembly capacity of these proteins into immunogenic particles in a yeast secretory environment is something a bacterial chassis would struggle with. Engineering Pichia further, humanising its N-glycosylation pathway to reduce the hypermannose patterns that drive immunogenicity in native yeast glycoproteins, moves the output closer to what a mammalian CHO cell would produce, but at fermentation costs that are orders of magnitude lower.
The limitations worth being honest about: yeast genetic toolkits are less mature than bacterial ones. CRISPR-based genome editing in S. cerevisiae is well-established, but in non-model yeasts the efficiency drops sharply. Promoter libraries, ribosome binding site tuning, and the kind of fine transcriptional control you take for granted in E. coli requires considerably more development effort in a fungal host. Secretion titres for complex proteins also remain lower than CHO cells for the most demanding biologics, and hypermannose glycosylation, even with humanisation efforts, is still not identical to human-type glycans, which matters for Fc-mediated effector functions in therapeutic antibody applications.
Part C. First DNA Twist Order
The Microcin M expression cassette was designed for cloning into pUC19, a high-copy ColE1-origin plasmid carrying ampicillin resistance. pUC19 was selected primarily for its well-characterised cloning sites and broad compatibility with standard E. coli transformation protocols, practical considerations given that the immediate goal is sequence verification rather than stable expression. The MccH47 insert is flanked by EcoRI and HindIII sites for directional cloning into the multiple cloning site. The complete annotated construct is deposited in the class Benchling folder as MccH47_pUC19_EcN_construct.
For downstream ÌṢỌ deployment, the cassette would need migration to a lower-copy backbone — pSC101 or a chromosomal integration vector — to reduce metabolic burden on the EcN chassis and improve evolutionary stability under selection.
1. Advantages of Cell-Free Protein Synthesis Over In Vivo Methods
Cell-free systems decouple protein production from cell viability, giving you direct control over reaction composition, temperature, redox state, and cofactor concentrations, none of which are easily tunable in living cells.
Two cases where CFPS outperforms cell-based production:
Viral biosensors / NTDs: Rapid, open-system format allows same-day prototyping of diagnostic reagents without biosafety constraints of live pathogen handling.
Accessible diagnostic biomarkers (e.g., creatinine sensors for CKD): Low-cost E. coli extracts enable point-of-care biosensor manufacturing without fermentation infrastructure.
2. Main Components of a Cell-Free Expression System
Component
Role
A. Cell Extract
Supplies ribosomes, chaperones, tRNA, and transcription/translation machinery.
B. DNA/mRNA Template
Carries the gene of interest; linear PCR products or circular plasmids both work.
C. Energy Sources (ATP/GTP)
Drive ribosome translocation, aminoacyl-tRNA charging, and mRNA capping.
D. Amino Acids
Provide the building blocks; must be supplied exogenously since there is no cellular biosynthesis.
E. Reaction Buffers
Maintain pH, ionic strength, and Mg²⁺ concentration critical for ribosome activity.
3. Why Energy Regeneration Is Critical in Cell-Free Systems
Without regeneration, ATP is exhausted within minutes, translation stalls before any useful yield accumulates.
Method — Phosphoenolpyruvate (PEP) Regeneration:
PEP donates a phosphate group to ADP via pyruvate kinase, regenerating ATP continuously throughout the reaction.
It is the most widely used system in E. coli-based CFPS; simple to implement and well-characterised.
Alternatives:
Glucose-6-phosphate / glycolysis: Cost-effective; couples to endogenous glycolytic enzymes in the extract.
Creatine phosphate / creatine kinase: Common in eukaryotic systems; mimics the muscle energy buffering mechanism.
4. Prokaryotic vs. Eukaryotic Cell-Free Expression Systems
Feature
Prokaryotic (E. coli)
Eukaryotic (Wheat Germ / Mammalian)
Yield
High (>1 mg/mL typical)
Moderate–High (system-dependent)
Cost
Low
High
Speed
2–4 hours
Longer incubation often needed
PTMs (Glycosylation)
Absent natively
Endogenous microsomes enable PTMs
Folding
Inclusion bodies common
Excellent, specialised chaperones
Best Use
High-throughput, simple soluble proteins
Complex, transmembrane, or therapeutic proteins
Protein choice — Prokaryotic: GFP
GFP is small, soluble, and folds spontaneously without PTMs — perfect for E. coli CFPS.
Fluorescence output doubles as a real-time yield reporter; ideal for rapid system validation.
High-throughput expression kits for GFP are cheap, reproducible, and produce results in under 4 hours.
Protein choice — Eukaryotic (CHO/HeLa): IgG Monoclonal Antibody
IgG requires N-glycosylation, disulfide bond formation, and ER-assisted folding for activity.
CHO/HeLa lysates contain ER-derived microsomes with glycosylation enzymes and PDI — E. coli cannot replicate this.
Attempting IgG expression in prokaryotic CFPS typically yields insoluble, non-functional aggregates.
5. Designing a Cell-Free Experiment for Membrane Protein Expression
Membrane proteins (MPs) are notoriously difficult — aggregation, low yield, and incorrect insertion are the default failure modes. My approach centres on a Continuous Exchange Cell-Free (CECF) setup with deliberate hydrophobic stabilisation from the moment of synthesis.
Experimental Design:
Template: PCR-derived linear DNA with T7 promoter; codon-optimised for the chosen lysate; RBS positioned ~11 nt upstream of ATG.
Chassis: E. coli extract for yield; insect or HeLa lysate if the MP needs native PTMs or microsomal insertion.
Hydrophobic additives: Supplement with detergents (Brij-35, LMNG) or nanodiscs directly in the reaction to catch the MP co-translationally.
CECF mode: Use a 10× feeding solution volume to replenish ATP, amino acids, and dilute inhibitory byproducts over 4–16 hours.
Temperature: Start at 25–30 °C to slow translation and reduce aggregation kinetics.
Challenges and Solutions:
Aggregation: Add nanodiscs or lipid vesicles to provide a bilayer scaffold immediately upon synthesis.
mRNA/DNA degradation: Use GamS protein to block RecBCD exonuclease activity on linear templates.
Incorrect folding: Introduce pre-formed inverted membrane vesicles or switch to insect lysate with native microsomes.
Codon bias (eukaryotic MP in E. coli): Codon-optimise the sequence or switch to wheat germ / rabbit reticulocyte lysate.
Low-throughput screening: Miniaturise to microfluidic volumes; automate condition matrices varying detergent type and temperature.
6. Troubleshooting Low Yield in a Cell-Free System
Reason 1 — Protein Aggregation / Misfolding:
Misfolded hydrophobic stretches form inclusion bodies, reducing soluble yield.
Fix: Drop incubation temperature to 25 °C to slow translation and buy time for folding.
Fix: Add solubility tags (Mocr, GST) or co-express chaperones (DnaK/DnaJ/GrpE) in the reaction.
Reason 2 — Premature Energy Depletion:
PEP or creatine phosphate runs out before the reaction plateau, stalling ribosomes mid-synthesis.
Fix: Switch to a CECF dialysis setup to continuously feed energy substrates and remove Pi accumulation.
Fix: Supplement with additional glucose as a secondary energy source to extend reaction lifetime.
Fix: Verify T7 RNA polymerase activity separately; use circular plasmid rather than linear DNA if exonuclease degradation is suspected.
2. Homework question from Kate Adamala
Overview
The Synthetic Neuronal Mimic (SNM) is a liposome-based minimal cell designed as an interactive, safe, and visual educational tool for youth STEM leaders to understand the impact of drugs on biological systems.
1. Function Description
a. What does the SNM do? What is the input and output?
Function: The SNM acts as a miniature “biological laboratory” encapsulating a cell-free TX/TL system that produces a fluorescent signal only when a specific drug molecule is present.
Input: A drug molecule (e.g. nicotine analog, stimulant) in the surrounding environment, which diffuses through the synthetic membrane via a pore channel.
Output: sfGFP fluorescence, visible under a portable fluorescence microscope. Signal intensity is a direct visual proxy for drug dose or effect magnitude.
b. Could cell-free TX/TL alone, without encapsulation, realise this function?
No. TX/TL in a tube produces the protein but loses the educational purpose entirely.
Encapsulation creates a compartmentalised entity that behaves like a cell, not a chemical mix.
The drug must cross a synthetic membrane before the circuit responds, directly mirroring how neurons work.
Without encapsulation, you have chemistry. With it, you have a cell.
c. Could a genetically modified natural cell realise this function?
Yes, but it is the wrong tool for this context.
Engineered E. coli or yeast would require biosafety containment, specialised culture media, and are prone to mutation.
The SNM contains no living organism, making it safer to handle in outreach settings.
It is more predictable, easier to explain from first principles, and requires no microbiology infrastructure.
d. Desired outcome of SNM operation
Youth STEM leaders directly observe drug-responsive circuit logic in real time.
Input A (nicotine analog) produces Output B (high-intensity GFP fluorescence).
Participants leave with a concrete, visual understanding of how microscopic chemical signals produce measurable biological responses.
The experience serves as a practical entry point into pharmacology and neuroscience.
2. Component Design
a. Membrane composition
Phospholipid bilayer: POPC (1-palmitoyl-2-oleoyl-sn-glycero-3-phosphocholine) and cholesterol at an 80:20 molar ratio.
Cholesterol increases membrane rigidity and reduces passive leakage of internal components.
Alpha-hemolysin (alpha-HL, gene: hla) is embedded in the bilayer to create ~2 nm pores that admit small molecules up to ~2 kDa.
b. Internal encapsulation
E. coli S30 or PUREsystem cell-free extract: supplies ribosomes, RNA polymerase, tRNA, and chaperones.
Plasmid encoding sfGFP under a TetR-repressible promoter (pTet).
ATP, GTP, and a full complement of amino acids.
PEP-based ATP regeneration system (phosphoenolpyruvate + pyruvate kinase).
RNase inhibitor (e.g. RiboLock) to protect mRNA from endogenous nuclease activity.
c. TX/TL system origin: bacterial or mammalian?
Bacterial (E. coli) extract is sufficient for this design.
TetR/pTet is fully functional in prokaryotic cell-free systems; no mammalian system is required.
E. coli extract is low-cost, freeze-dryable for outreach kit distribution, and yields high sfGFP concentrations within 2 to 4 hours.
A mammalian system would only be necessary if the circuit required PTMs or mammalian-specific promoter logic, which this design does not.
d. Communication with the environment
The SNM communicates via passive diffusion through alpha-HL pores.
The drug analog (small molecule, up to ~2 kDa) enters through the pore and de-represses the TetR-controlled sfGFP promoter.
No active transport machinery or membrane receptors are required.
sfGFP (superfolder GFP); faster folding and higher quantum yield than wild-type GFP
Repressor gene
tetR (TetR repressor); released by tetracycline analogs or engineered small-molecule inducers
Promoter
pTet (tetO2 operator); drives sfGFP expression, OFF with TetR present, ON when inducer is present
Energy system
PEP/pyruvate kinase for ATP regeneration; supplemented with creatine phosphate for extended reactions
b. Measuring system function
Primary readout: Fluorescence microscopy using a portable LED scope (470 nm excitation / 510 nm emission); visible GFP signal confirms circuit activation.
Quantification: Plate reader measuring fluorescence intensity (Ex 485 nm / Em 510 nm) as a function of drug concentration to generate a dose-response curve.
Negative control: SNMs incubated without drug input; no fluorescence expected, confirming the circuit is OFF at baseline.
Positive control: SNMs with a constitutive always-on sfGFP construct; calibrates maximum signal and confirms TX/TL machinery is functional.
Validation metric: Signal-to-noise ratio of drug-treated vs. no-drug control; a minimum 5-fold induction threshold confirms adequate circuit sensitivity.
3. Homework question from Peter Nguyen
Application Field
Architecture — wellness-focused interior design using nature-based, intelligent building materials.
One-Sentence Pitch
The Neuro-BioWall is a modular interior wall panel system embedding freeze-dried cell-free biosensors within living plant scaffolds to detect indoor air pollutants and respond with enzyme-triggered aromatherapy, bridging passive biophilic design and active biological intelligence.
How It Works
The system consists of 3D-printed cellulose/alginate panels hosting living Pothos plants, with freeze-dried cell-free reactions integrated directly into the plant’s nutrient-delivery interface. When indoor VOCs such as formaldehyde exceed healthy thresholds, a toehold switch genetic circuit embedded in the cell-free system is activated, initiating synthesis of a reporter enzyme. That enzyme acts on a co-encapsulated, latent aromatherapeutic substrate to release a localised calming scent such as lavender or hinoki. Simultaneously, a colorimetric output produces a visible colour change in the biopolymer panel, giving occupants a passive, non-electronic visual cue to ventilate or pause.
Step-by-step workflow:
Pollutant intake: Indoor air flows through the porous biocellulose pot interface where plant roots and cell-free sensors reside.
Sensing: The cell-free toehold switch circuit triggers when VOC concentrations exceed the design threshold.
Wellness output: The activated circuit produces an esterase enzyme that breaks down a sealed aromatherapeutic compound, releasing scent.
Visual signal: Colorimetric reporter causes a visible change in the biopolymer scaffold, prompting occupants to take action.
Societal Challenge and Market Need
Sick building syndrome affects an estimated 30% of office buildings globally, linked to VOC accumulation from furniture, adhesives, and cleaning products.
Existing solutions are either passive (plants, carbon filters) with no active feedback, or electronic (air quality monitors) with no biological or sensory integration.
The Neuro-BioWall closes this gap: it monitors, responds, and communicates without electronics, live microbes, or occupant intervention.
It targets the growing wellness architecture and biophilic design market, where demand for nature-integrated, low-maintenance intelligent building materials is expanding rapidly.
Addressing Cell-Free System Limitations
Activation with water
The cell-free components are freeze-dried directly into the hydrogel of the plant nutrient scaffold.
Activation occurs automatically during the plant’s regular watering cycle, requiring no separate triggering step or electronic control.
Long-term stability
Components are lyophilised in a trehalose-based sugar matrix and encapsulated within a protective polymer mesh.
This configuration maintains activity at room temperature for 3 to 6 months without refrigeration.
The trehalose matrix is a well-established stabilisation strategy for cell-free systems in low-resource and distributed deployment contexts.
One-time use
The sensor is packaged as a replaceable modular bio-cartridge that clips in and out of the living panel.
Spent cartridges are fully biodegradable, consistent with the cellulose/alginate material system.
Routine cartridge replacement is designed as a simple maintenance step, analogous to changing a water filter, rather than a structural intervention.
Integrated Material Summary
Component
Material / Gene / System
Panel scaffold
3D-printed cellulose / sodium alginate composite
Living element
Pothos (Epipremnum aureum) — known VOC-absorbing houseplant
Stabilisation matrix
Trehalose-based lyophilisation matrix
Sensing circuit
Toehold switch genetic circuit, VOC-responsive
Reporter enzyme
Esterase (e.g. estA from Pseudomonas fluorescens)
Aromatic substrate
Latent linalyl acetate ester (releases lavender/hinoki scent upon cleavage)
Colorimetric reporter
Catechol-responsive chromogenic substrate for visual panel signal
TX/TL chassis
E. coli S30 cell-free extract, freeze-dried
Why This Works as a Platform
No living microbes means no biosafety concerns in occupied buildings.
No electronics means no power dependency, no failure modes from software or connectivity.
The plant’s natural water cycle doubles as the activation mechanism, making the system self-sustaining within normal building maintenance routines.
Modular cartridge design allows iterative sensor upgrades without replacing the structural panel, extending product lifetime and reducing material waste.
4. Homework question from Ally Huang
Overview
MycoLab-1 proposes a minimally functional, university-grade biological sciences laboratory for deep-space environments, built from mycelium-based composite (MBC) infrastructure and powered by freeze-dried cell-free (CFPS) molecular biology systems. The laboratory requires no refrigeration chain, no live microbial culture infrastructure, and no heavy equipment payload — making advanced biological experimentation feasible aboard lunar outposts, Mars transit vehicles, or orbital stations where mass and power budgets are severe constraints.
1. Background: The Space Biology Challenge
Long-duration spaceflight exposes crew to ionising radiation, microgravity-induced immune dysregulation, and chronic oxidative stress — all of which accelerate cellular ageing, impair DNA repair fidelity, and compromise host-pathogen defence. These stressors converge on gene expression and protein homeostasis in ways that are still poorly characterised in real microgravity. Conducting molecular biology experiments in space currently demands cold-chain infrastructure and complex equipment incompatible with deep-space payload constraints. A lightweight, room-temperature-stable biological laboratory would transform our ability to study and respond to these challenges in real time, on-orbit.
2. Molecular and Genetic Targets
Primary targets:
RAD51 and BRCA2 — homologous recombination DNA repair genes; expression altered under ionising radiation and microgravity.
Broad transcriptomic profiling via cell-free ribosome display and lateral flow readout as a low-mass omics proxy.
3. Target Relevance to the Space Biology Challenge
Radiation-induced double-strand breaks require RAD51-mediated homologous recombination for faithful repair; suppression of this pathway under microgravity increases mutation accumulation rates. NRF2 governs the antioxidant response to reactive oxygen species generated by cosmic radiation. Both pathways are dynamically regulated at the transcript and protein level, making them ideal targets for a cell-free expression-based sensing platform. Monitoring their activity in real time, using on-orbit synthesised reporters, would provide actionable data on crew molecular health without requiring live-cell culture or centrifuge-dependent assays.
4. Hypothesis and Research Goal
Hypothesis: A freeze-dried cell-free biosensor system, stabilised in trehalose matrix and embedded in mycelium-derived structural panels, can perform on-orbit transcriptomic monitoring of radiation-responsive and oxidative stress pathways (RAD51, NRF2) with sensitivity equivalent to bench-grade RT-qPCR, at a fraction of the mass and power budget.
Reasoning: CFPS reactions have been lyophilised and reactivated months later with retained fidelity. Mycelium composites provide structural, thermal, and radioprotective properties that passive aluminium panels cannot. Combining both technologies creates a laboratory architecture where the walls, benchtops, and insulation panels are themselves functional biological substrates, not passive enclosures. If validated, this platform collapses the payload mass requirement for a functional molecular biology laboratory by an order of magnitude.
5. Experimental Plan
Samples and model organisms
Primary sample: Human saliva or fingerprick blood from crew members as minimally invasive nucleic acid sources.
Biological model:Arabidopsis thaliana seedlings grown in mycelium substrate panels as a parallel plant stress model.
Radioprotection model:Cladosporium sphaerospermum melanised fungal cultures integrated into habitat wall panels as living radioprotective layer.
Core experimental modules
Module
Function
Cell-Free Component
RAD51/NRF2 transcript sensor
Toehold switch circuits triggered by target mRNA from crew blood/saliva
E. coli S30 CFPS, lyophilised in trehalose
sfGFP / colorimetric reporter
Fluorescence or colour readout of circuit activation
sfGFP (sfgfp) or catechol oxidase reporter
Ribosome display panel
Low-mass omics: cell-free translation of stress-responsive transcripts
PUREsystem, freeze-dried
Lateral flow readout
Equipment-free protein detection strip for crew-facing results
Anti-GFP or anti-His-tag lateral flow strips
Mycelium panel biosensor integration
Structural panels double as stable housing for CFPS cartridges
CFPS cartridge embedded in Ganoderma MBC panel
Mycelium laboratory infrastructure
Structural panels:Ganoderma lucidum mycelium grown on processed regolith simulant or cellulose waste; compression-moulded into benchtop, wall, and insulation panels.
Radioprotective skin layer: Melanised Cladosporium sphaerospermum integrated into outer wall MBC composite; demonstrated on-orbit aboard the ISS to attenuate ionising radiation by up to 2.42-fold.
Self-repair capacity: Living mycelium panels can re-colonise micro-fractures when rehydrated, reducing structural maintenance payload.
Thermal insulation: MBC panels provide thermal insulation comparable to expanded polystyrene at one-third the density, critical for temperature-sensitive CFPS cartridge stability.
CFPS cartridge design
Each cartridge is a replaceable unit containing lyophilised E. coli S30 extract, toehold switch plasmid, energy regeneration mix (PEP/pyruvate kinase), and amino acids.
Activation: crew adds 15 to 30 microlitres of rehydration buffer (sterile water or saliva directly).
Readout: fluorescence measured with a handheld LED torch and smartphone camera, or colorimetric readout read visually.
Cartridge stability: 12 months at room temperature in sealed foil pouch; trehalose matrix validated for long-duration storage.
Each cartridge is single-use, biodegradable, and compatible with mycelium composting for waste processing closure.
6. Addressing Space-Environment Constraints
Constraint
Challenge
Solution
Mass budget
Traditional lab equipment is prohibitively heavy
CFPS replaces PCR machines, gel rigs, centrifuges; mycelium grown in situ from waste feedstock
Cold chain
Enzymes, reagents degrade without refrigeration
Lyophilisation in trehalose; stable at room temperature for 6 to 12 months
Power budget
Fluorescence readers and thermocyclers draw significant power
Lateral flow strips and colorimetric readouts require zero power; LED torch for fluorescence
Radiation
Ionising radiation degrades DNA reagents and structural materials
Lyophilised DNA in trehalose is radiation-hardened; C. sphaerospermum wall layer attenuates dose
Waste processing
Chemical and biological waste accumulates
Biodegradable cartridges fed back into mycelium substrate as nutrient source
Crew skill ceiling
Not all crew are trained molecular biologists
Toehold switch cartridges operate as simple add-water diagnostics; results are visual and immediate
7. Significance
MycoLab-1 addresses three converging needs in space exploration. First, it provides a credible molecular health monitoring platform for crew on multi-year missions beyond low Earth orbit where medical evacuation is not an option. Second, it demonstrates in-situ resource utilisation for laboratory infrastructure, growing structural and functional lab components from waste streams rather than Earth-launched payloads. Third, it creates a proof-of-concept for distributed biological laboratories in resource-constrained environments on Earth, including field hospitals, remote clinics, and low-income research institutions. The same system that monitors astronaut DNA repair fidelity on a Mars transit vehicle could monitor antibiotic resistance gene expression in a rural West African clinic.
Key Genes and Components Reference
Gene / Component
Source Organism
Function in MycoLab-1
RAD51
Homo sapiens
DNA repair; target transcript for radiation damage sensor
NFE2L2 (NRF2)
Homo sapiens
Oxidative stress master regulator; target for ROS sensor circuit
sfgfp
Engineered (jellyfish origin)
Fluorescent reporter for toehold switch activation
Toehold switch RNA
Synthetic
Riboswitch that translates only in presence of target mRNA
dhN-melanin biosynthetic cluster
Cladosporium sphaerospermum
Melanin synthesis; radioprotective wall layer
hla (alpha-hemolysin)
Staphylococcus aureus
Optional pore channel for diffusion-based sample input into CFPS cartridge
Mycelium scaffold
Ganoderma lucidum
Structural panels, benchtops, insulation, and waste-derived growth substrate
Part B. Individual Final Project
Week 10
Class Assignment — Week 10
Homework: Final Project
ÌṢỌ is currently computational, so the “measurements” in scope are model outputs rather than physical assays. The key quantities I track are: steady-state pathogen kill rate as a function of MccH47 production, growth rate as a function of expression burden δ, biosensor activation ratio across tetrathionate concentrations, and containment escape probability over generational time. These are computed from ODE integration and Moran process simulation rather than physical instruments, but they map directly onto measurable biological quantities that would need experimental validation in a future phase of the project.
Priority measurements in the wet-lqb phase would be:
Circuit output and reporter quantification Fluorescence intensity of the sfGFP reporter (co-expressed with MccH47 under TtrR-activated promoter) measured by plate-reader fluorimetry across a tetrathionate concentration gradient. This gives the dose-response curve the biosensor model predicts and directly benchmarks the Hill coefficient and activation threshold used in the ODE.
MccH47 production and secretion Liquid chromatography coupled to mass spectrometry (LC-MS) would confirm MccH47 identity and quantify extracellular concentration. Given the focus on intact protein mass measurement, a Waters-type Xevo QTof system running native LC-MS would resolve the microcin’s intact mass (~4.9 kDa) and confirm post-translational processing of the precursor peptide, which is biologically relevant since MccH47 requires leader peptide cleavage for activity.
Pathogen kill kinetics Colony-forming unit counts on selective media over time, co-incubating engineered EcN with Salmonella Typhimurium at defined tetrathionate concentrations. This parameterizes k_kill directly.
Auxotrophy confirmation and escape frequency Growth curves in DAP-depleted media confirm the ΔdapA deletion is clean. Fluctuation assay (Luria-Delbrück) on large populations estimates reversion frequency, which feeds directly into the containment escape model.
Growth burden OD600 time-course comparing wild-type EcN, circuit-off EcN, and circuit-induced EcN. The growth rate differential quantifies δ experimentally.
The computational figures being produced now are designed to be directly comparable to these future measurements, every parameter in the model has a specific assay that would validate or revise it.
Part A. Waters Part I — Molecular Weight
1. Theoretical pI/Mw: 5.90 / 28006.60
2.1 Determination of z for adjacent pair of peaks using the given formula
From the spectrum, a good clean pair is:
• m/zn≈933
• m/zn+1≈903
These are part of the same envelope (but essentially different charge states), and the spacing is realistic.
2.2 MW of the protein using the scientific relationship
2.3 Accuracy of the measurement between both methods
Compared with theoretical MW
Typical values:
• eGFP alone ≈ 26.9–27.0 kDa
• With Histidine tag + linker → ≈ 27.5–28.5 kDa
2.4 Charged state for the zoomed-in peak in the mass spectrum picture
No, the charge state cannot be determined from the zoomed-in peak. This is because there are no clearly resolved adjacent charge-state peaks in that region of the spectrum. The signal appears as a single broadened peak without the necessary spacing pattern required to apply the adjacent charge-state method.
Part B. Waters Part II — Secondary/Tertiary structure
1. Native vs Denatured Protein conformations
When a protein is in its native, folded state, the tertiary structure buries most basic residues (lysine, arginine, histidine) inside the hydrophobic core or locks them into salt bridges and hydrogen bonds. In native electrospray ionisation (ESI), these residues are inaccessible to protonation, so the protein acquires relatively few charges, producing ions at high m/z values. This is exactly what the red spectrum shows, with the dominant ion envelope centred around m/z 2545.
When a protein unfolds, the polypeptide chain opens up and all basic residues become solvent-exposed and available for protonation. The same protein now picks up far more protons, producing many charge states compressed into the low m/z region. The green (denatured) spectrum shows this clearly, the charge state envelope spans roughly m/z 600 to 1300, with peaks spaced closely together because many adjacent charge states (z ≈ 20 through z ≈ 40+) are simultaneously represented.
The mass spectrometer determines fold state indirectly: it measures the m/z ratio of each ion. Since molecular weight is unchanged by denaturation, the shift in the m/z envelope directly reflects a change in charge state z. Higher charge means lower m/z for the same mass. The instrument does not detect conformation directly, it detects the charge acquired during ESI, which is a proxy for solvent-accessible surface area and protonatable site exposure, both of which are determined by the protein’s fold state.
The zoomed inset in the native (red) spectrum supports this interpretation. The isotope spacing at m/z ~2545 is approximately 0.18 Da, corresponding to a charge state of z = 1/0.18 ≈ 11. A native folded protein the size of eGFP (~27 kDa) carrying only 11 charges is consistent with a compact structure where most basic residues are sequestered. The denatured form distributes that same mass across charge states of z = 20 or higher, shifting the entire envelope into the low m/z window seen in the green spectrum.
2. Charge state of the peak findings
Identifying the charge state from isotope spacing
Looking at the native mass spectrum (Figure 3), the peak cluster around m/z 2799–2800 shows two resolved isotope peaks labeled 2799.4199 and 2799.6365.
The isotope spacing is 2799.6365 − 2799.4199 = 0.2166 Da
Since adjacent isotope peaks within a charge state envelope are separated by 1 Da / z, the charge state is z = 1 / 0.2166 ≈ 4.6, which rounds to +5
The charge state of the peak at ~2800 is +5.
How you can tell?
In ESI-MS, each isotope peak differs from the next by exactly 1 neutron (1 Da). Distributed across z charges, that 1 Da difference appears as a spacing of 1/z in the m/z spectrum. The ~0.2 Da spacing observed here gives 1/0.2 = 5, confirming a 5+ ion. As a rule of thumb, a singly charged ion shows isotope spacing of 1.0 Da; a doubly charged ion shows 0.5 Da; a 5+ ion shows ~0.2 Da.
What this ion likely represents?
A z = +5 ion at m/z ~2800 corresponds to a neutral mass of approximately (2800 × 5) − 5 = ~13,995 Da
This is close to half the molecular weight of intact eGFP (~27 kDa), suggesting this peak may represent a doubly charged dimer or a fragment species rather than the intact monomer. In a native direct-infusion experiment, low-abundance species like non-covalent dimers or partial assemblies can appear at unexpected m/z values. This peak is worth noting as a minor species distinct from the main z = 11 native monomer envelope centred at m/z ~2545.
Part C. Waters Part III — Peptide Mapping - primary structure
1. Lysines (K) and Arginines (R) in eGFP from Benchling
Arginines: 6
Lysines: 20
2. Peptide mapping for tryptic digestion of eGFP using PeptideMass
Trypsin cleaves after lysine (K) and arginine (R) residues. Running the eGFP sequence through ExPASy PeptideMass with trypsin, 0 missed cleavages, reduced cysteines, and a 500 Da mass cutoff returns 19 peptides, covering 90.7% of the sequence.
Mass [M+H]⁺
Position
Peptide sequence
4472.1752
170–210
HNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSK
2566.2931
217–239
DHMVLLEFVTAAGITLGMDELYK
2437.2608
5–27
GEELFTGVVPILVELDGDVNGHK
2378.2577
54–74
LPVPWPTLVTTLTYGVQCFSR
1973.9062
142–157
LEYNYNSHNVYIMADK
1503.6597
28–42
FSVSGEGEGDATYGK
1266.5783
87–97
SAMPEGYVQER
1083.4979
240–247
LEHHHHHH
1050.5214
115–123
FEGDTLVNR
982.4952
133–141
EDGNILGHK
821.3940
81–86
QHDFFK
790.3552
75–80
YPDHMK
769.3913
47–53
FICTTGK
711.2944
103–108
DDGNYK
655.3813
98–102
TIFFK
602.2780
211–215
DPNEK
579.3137
128–132
GIDFK
507.2925
164–167
VNFK
502.3235
124–127
IELK
Parameters: trypsin, 0 missed cleavages, cysteines reduced, methionines unoxidised, masses > 500 Da, monoisotopic [M+H]⁺. Theoretical pI: 5.90, average MW: 28,006.60 Da, monoisotopic MW: 27,988.96 Da.
Chromatographic peaks in the TIC (0.5 to 6 min)
Counting all peaks above 10% relative abundance in Figure 5a between 0.5 and 6 minutes, there are approximately 19 chromatographic peaks visible.
Does the peak count match the predicted peptide count?
The PeptideMass prediction returned 19 peptides above 500 Da. The chromatogram shows a comparable number of peaks, though there appear to be more peaks than predicted peptides. This is expected: a single peptide can produce multiple chromatographic peaks if it elutes as co-eluting charge states, if there are oxidised or modified variants, or if missed cleavage products are present at low levels. Additionally, some peaks may represent non-peptide matrix components or buffer adducts.
Identifying the charge state and mass of the peptide at 2.78 min (Figure 5b)
The most abundant ion in Figure 5b appears at m/z = 525.76712, with a second charge state visible at m/z = 1050.52438.
Using the isotope spacing in the inset zoom of the 525.76 peak:
The two isotope peaks are at 525.76712 and 526.25918, giving a spacing of:
526.25918 - 525.76712 = 0.4921 Da
Since isotope spacing = 1/z:
z = 1 / 0.4921 = ~2, confirming the most abundant charge state is z = +2.
The singly charged mass [M+H]⁺ is calculated as:
[M+H]⁺ = (m/z × z) - (z - 1) = (525.76712 × 2) - 1 = 1050.53424 Da
This is consistent with the observed singly charged ion at m/z 1050.52438.
Peptide identification and mass accuracy
From the PeptideMass results, the peptide with theoretical [M+H]⁺ = 1050.5214 Da at position 115-123 is FEGDTLVNR.
This is well within the typical <5 ppm accuracy expected from a Waters Xevo G3 QTof instrument.
Sequence coverage confirmed by peptide mapping
As shown in Figure 6, the BioAccord LC-MS peptide identification data confirms 88% sequence coverage of eGFP, with the unconfirmed regions corresponding primarily to small peptides below the 500 Da detection threshold and the short peptides at the N-terminus (MVS) that fall outside the tryptic detection window.
Bonus Peptide Map Questions
Peptide identification from Figure 5c
The peptide eluting at 2.78 min with [M+H]⁺ = 1050.52438 Da matches FEGDTLVNR (positions 115–123, predicted [M+H]⁺ = 1050.5214 Da, 2.84 ppm error).
The predicted fragment ion series confirms the match:
Position
Residue
B ion (m/z)
Y ion (m/z)
1
F
148.07574
1050.52149
2
E
277.11833
903.45308
3
G
334.13979
774.41049
4
D
449.16673
717.38902
5
T
550.21441
602.36208
6
L
663.29848
501.31440
7
V
762.36689
388.23034
8
N
876.40982
289.16192
9
R
1032.51093
175.11900
The observed ions in Figure 5c at m/z 774.41334, 903.44365, and 602.34777 correspond directly to Y7 (774.41049), Y8 (903.45308), and Y5 (602.36208) ions respectively, confirming the sequence read-out from the C-terminus. The B/Y ion ladder is internally consistent and the fragmentation pattern is unambiguous.
Does the peptide map confirm eGFP identity?
Yes. The data are consistent with the eGFP standard for several converging reasons. The identified peptide FEGDTLVNR is unique to eGFP and is not a common contaminant sequence. The measured mass matches the theoretical monoisotopic mass within 2.84 ppm, well within the instrument’s expected accuracy. The fragmentation spectrum produces a coherent B and Y ion series with no unexplained major peaks. Figure 6 shows 88% sequence coverage across the full eGFP chain, with the identified peptides distributed across nearly the entire length of the protein rather than clustering in one region, which would be expected if the signal were from a contaminant or partial degradation product. The small uncovered regions (approximately 12% of sequence) correspond to short peptides below the 500 Da detection threshold and the N-terminal MVS tripeptide, both of which are expected gaps given the experimental parameters rather than evidence against eGFP identity.
Part D. Waters Part IV — Oligomers
Using the subunit masses from Table 1 (7FU = 340 kDa, 8FU = 400 kDa), the observed CDMS peaks map to the following oligomeric species:
Peak (MDa)
Calculated mass
Assignment
3.4
340 kDa × 10 = 3.40 MDa
7FU Decamer
8.33
400 kDa × 20 = 8.00 MDa
8FU Didecamer
12.67
400 kDa × 30 = 12.00 MDa
8FU 3-Decamer
~16–17 (low, broad)
400 kDa × 40 = 16.00 MDa
8FU 4-Decamer
The dominant species in solution is the 8FU didecamer at ~8.33 MDa, which is the canonical functional assembly of KLH. The 7FU decamer at ~3.4 MDa appears as a lower-abundance species representing the half-molecule form. The 3-decamer at ~12.67 MDa is present at reduced intensity, and the 4-decamer is visible only as a broad low-intensity feature near 16 MDa, consistent with published observations of KLH assembly heterogeneity in solution.
The small offsets between calculated and observed masses (e.g. 8.00 MDa calculated vs. 8.33 MDa observed for the didecamer) reflect glycosylation and other post-translational modifications on KLH subunits, which are not accounted for in the bare polypeptide masses in Table 1.
Part E. Waters Part V — Did I make GFP?
Theoretical
Observed (Intact LC-MS)
PPM Mass Error
Molecular weight (kDa)
27.9890
27.9896
+2.14 ppm
Week 11
Class Assignment — Week 11
Part A. Community Bioart Reflections | The 1,536 Pixel Artwork Canvas
I contributed to the “Love” apple-shaped yellow sign at the mid-bottom of the artwork, working on the DNA assembly for that section of the plate.
What I liked most is the premise itself: that biology can be a medium for public communication, not just a laboratory tool. There is something genuinely powerful about a piece of art that is also a functional scientific artefact — 1,536 colonies, four colours, four quadrants, one coherent image, built by 154 people across 7,946 individual contributions. Projects like this do more for science outreach than most formal presentations ever will, because they meet people where curiosity lives. The collaborative structure reinforced that too. No single person could have produced this at scale. Every contribution, however small, was load-bearing. That is a lesson worth carrying into research.
For next year, a few things could sharpen the experience. The process deserves better documentation — annotated diagrams of who contributed what quadrant and colour, and a short write-up of the biological design logic mapping colony colour to fluorescent protein or pigment pathway. That record becomes an outreach asset in its own right, and for participants from under-resourced contexts it also serves as tangible evidence of having done real science. I would also push for a clearer throughline between the artistic concept and the biology: why this sequence, why this organism, why this visual. That conceptual anchoring is what separates bioart that educates from bioart that merely looks interesting from a distance.
Part B. Cell-Free Protein Synthesis | Cell-Free Reagents
BL21 (DE3) Star Lysate (includes T7 RNA Polymerase): The lysate is the reaction engine. It supplies the ribosomes, translation factors, chaperones, and metabolic enzymes needed to carry out transcription and protein synthesis. The DE3 strain harbours a chromosomal T7 RNA Polymerase gene, so the lysate comes pre-loaded with the polymerase needed to drive T7 promoter-based expression.
Salts/Buffer
Potassium Glutamate: The primary monovalent salt. It maintains ionic strength and stabilises ribosome conformation while also serving as a mild crowding agent that mimics the intracellular environment.
HEPES-KOH pH 7.5: The buffering system. It holds the reaction at a physiologically permissive pH, which matters because both ribosome activity and enzyme kinetics are sensitive to even modest pH drift over a 20-hour incubation.
Magnesium Glutamate: Magnesium is indispensable for ribosome assembly and catalytic activity. It also stabilises nucleotide triphosphates and is a cofactor for many of the enzymes active in the lysate.
Potassium Phosphate (monobasic and dibasic, 1.6:1 ratio): The phosphate pair serves dual duty: secondary pH buffering and phosphate donor pool. The specific dibasic:monobasic ratio fine-tunes the buffering capacity at pH 7.5 and feeds into nucleotide regeneration pathways.
Energy / Nucleotide System
Ribose: The carbon backbone for nucleotide biosynthesis. Cellular enzymes in the lysate phosphorylate and elaborate ribose into the nucleotide monophosphates needed for RNA synthesis, making it the upstream feedstock for the whole energy system.
Glucose: A supplementary carbon and energy source. It feeds into glycolysis within the lysate to regenerate ATP and sustain metabolic activity over the extended 20-hour window.
AMP, CMP, UMP: Nucleotide monophosphate precursors. The lysate enzymes phosphorylate these to their di- and triphosphate forms, supplying the NTPs required for transcription without the instability problems associated with adding NTPs directly.
GMP: Absent from this mix (0.00 uM in the image). Guanine is supplied instead and salvaged into GMP by the lysate’s purine salvage pathway, making direct GMP supplementation unnecessary.
Guanine: The free base precursor for guanosine nucleotides. Lysate hypoxanthine-guanine phosphoribosyltransferase (HGPRT) converts it to GMP via the purine salvage pathway, which is then phosphorylated to GDP and GTP for use in transcription.
Translation Mix (Amino Acids)
17 Amino Acid Mix: The bulk substrate pool for translation. Seventeen of the twenty standard amino acids are supplied together; tyrosine and cysteine are handled separately because of their solubility and stability constraints.
Tyrosine: Supplied at elevated pH (pH 12 stock) because tyrosine has very low aqueous solubility at neutral pH. It is added separately to avoid precipitation in the master mix.
Cysteine: Also added separately due to its tendency to oxidise in bulk amino acid stocks, which would render it unusable for translation. Keeping it isolated until reaction assembly preserves its reduced form.
Additives
Nicotinamide: An NAD+ precursor and sirtuin inhibitor. It helps maintain the NAD+/NADH redox balance needed to sustain metabolic enzyme activity across the long incubation, and may also reduce non-specific protein degradation by inhibiting NAD+-dependent deacylases in the lysate.
Backfill
Nuclease-Free Water: Brings the reaction to final volume without introducing RNases that would degrade the mRNA template and collapse expression.
Question 1: Key Differences Between the 1-Hour PEP-NTP and 20-Hour NMP-Ribose Master Mixes
The 1-hour PEP-NTP system supplies energy and nucleotides directly: preformed NTPs (ATP, GTP, CTP, UTP) plus phosphoenolpyruvate (PEP-Mono) as the immediate phosphate donor for ATP regeneration, with maltodextrin as a secondary carbon source. This makes it fast but metabolically shallow since the NTP pool is fixed at the start and depletes without robust regeneration. The 20-hour NMP-Ribose system takes the opposite approach: it supplies nucleotide monophosphates and simple sugars (ribose, glucose) as upstream precursors, letting the lysate’s own enzymes synthesise and continuously regenerate NTPs throughout the reaction, which sustains expression over a far longer window. The additives also diverge sharply: the 1-hour mix includes spermidine, DMSO, cAMP, NAD, and folinic acid to boost immediate transcription/translation efficiency, while the 20-hour mix strips these down to nicotinamide alone, reflecting a design philosophy of metabolic sustainability over peak output.
Bonus: How Does Transcription Occur If GMP Is 0.00 uM?
GMP is listed at 0.00 uM because it is not supplied directly. Guanine is present instead, and the lysate’s purine salvage machinery, specifically HGPRT, converts free guanine to GMP using PRPP (phosphoribosyl pyrophosphate) as the ribose-phosphate donor. That GMP is then phosphorylated to GDP and GTP by nucleoside monophosphate kinases and pyruvate kinase respectively. The system effectively outsources GTP synthesis to the lysate’s own enzymes rather than paying the cost of supplying pre-formed GMP that could be unstable or inhibitory at high concentrations.
Part C. Planning the Global Experiment | Cell-Free Master Mix Design
Fluorescent Protein Biophysical Properties (20-Hour NMP-Ribose Master Mix)
1. sfGFP
sfGFP was specifically engineered for robust folding under conditions where normal GFP would misfold or aggregate. It showed a 3.5-fold faster initial refolding rate than its parent frGFP and tolerated higher denaturant concentrations , which directly translates to better performance in the crowded, chaperone-limited environment of a cell-free lysate. In a 36-hour reaction, that folding robustness means a higher fraction of translated protein reaches a fluorescent state rather than being lost to misfolding.
2. mRFP1
The most relevant property here is incomplete chromophore maturation. mRFP1 shows two absorption peaks at 503 nm and 584 nm; the 503 nm peak corresponds to a green fraction that never fully matures beyond the green intermediate, with a quantum yield of only 0.27. In a cell-free system, there is no cellular quality control or folding assistance to rescue this incomplete maturation fraction, so a meaningful portion of expressed mRFP1 will likely remain dim or spectrally contaminated, reducing effective red fluorescence yield over the 36-hour incubation.
3. mKO2
mKO2 is a fast-folding variant of mKO1, engineered with 8 additional mutations for rapid maturation, though it has moderate acid sensitivity. The acid sensitivity is the property most relevant to cell-free. As the NMP-Ribose reaction runs over 36 hours, metabolic byproducts can acidify the reaction environment, and even modest pH drift below 7.0 could reduce mKO2 fluorescence output. Buffering capacity of the HEPES-KOH system is critical here specifically for mKO2.
4. mTurquoise2
mTurquoise2 has a maturation half-time of approximately 36.5 minutes , which is slow relative to other cyan variants. In a short reaction this would be a problem, but over 36 hours it is unlikely to be the bottleneck. The more relevant consideration is its complex, multi-step maturation kinetics: mTurquoise2 shows complex maturation kinetics requiring more than one kinetic step , meaning the protein accumulates through intermediate states before reaching peak fluorescence. For a 36-hour readout, this matters less than it would for a 1-hour endpoint assay.
5. mScarlet-I
mScarlet-I is one of the brightest monomeric red fluorescent proteins currently available, but it carries a known photostability limitation. The photostability of mScarlet-I is lower than mCherry under FRET imaging conditions, though under typical dynamic experiment conditions it barely loses intensity. More relevant to cell-free is that all GFP-like chromophores, including mScarlet-I’s, require molecular oxygen for maturation. In a sealed 20 uL reaction running for 36 hours, dissolved oxygen will be consumed early, meaning late-translated mScarlet-I molecules may not fully mature. This is probably the single biggest performance limiter for the red channel over long incubations.
6. Electra2
Electra2 is a blue fluorescent protein derived from mRuby3, engineered through hierarchical screening in bacterial and mammalian cells, with excitation at 403 nm and emission at 456 nm. Quantification of intracellular brightness showed Electra2 was approximately 2.1 times brighter than mTagBFP2 , which is impressive for the blue channel. The key biophysical caveat is that, like all GFP-derived beta-barrel FPs, Electra2 still requires molecular oxygen for chromophore maturation. This makes oxygen depletion over 36 hours a shared limitation with mScarlet-I, and potentially more acute for Electra2 because blue-channel chromophore formation is generally less efficient than green or red.
Hypothesis: Improving mScarlet-I Fluorescence Over 36-Hour Incubation
Protein: mScarlet-I
Problem: Oxygen-dependent chromophore maturation means late-translated mScarlet-I molecules cannot mature in a sealed, metabolically active reaction where dissolved O2 is consumed within the first few hours.
Hypothesis: Supplementing the 2 uL custom reagent slot with a controlled headspace oxygen carrier, specifically a dilute catalase-free perfluorocarbon oxygen supplement or simply increasing the dissolved O2 pre-reaction by briefly aerating the master mix before sealing, would extend the oxygen availability window and increase the proportion of mScarlet-I that reaches full chromophore maturation. Practically, within the reaction composition (6 uL lysate + 10 uL master mix + 2 uL DNA + 2 uL supplements), the 2 uL supplement volume could carry a small amount of hydrogen peroxide at sub-millimolar concentration as a slow O2 donor, with catalase from the lysate itself releasing O2 gradually throughout the incubation. Expected effect: higher peak fluorescence and a later-onset fluorescence plateau, reflecting maturation of protein translated in the middle and later phases of the 36-hour window rather than only the early burst.