Projects

Final projects:

  • Group Name: Phage Forge Group Members: @2026a-nourelden-rihan, @2026a-ritika-saha, @2026a-rahul-yaji, and @2026a-keerthana-gunaretnam Project Documentation Proposal: By: @2026a-nourelden-rihan, @2026a-ritika-saha, @2026a-rahul-yaji
  • Ship 41 HTGAA Final Project Proposal Project Title: Ship 41: Multi-Chassis Computational and Experimental Optimization of α-Pinene Biosynthesis for High-Value Bioproduction Author: Nourelden Rihan Course: How to Grow (Almost) Anything — HTGAA 2026 Date: April 4, 2026 Industry Partners: Ginkgo Bioworks · Twist Bioscience · Waters Corporation · New England Biolabs · Millipore Sigma

Subsections of Projects

Group Final Project

Group Name: Phage Forge

Group Members:

@2026a-nourelden-rihan, @2026a-ritika-saha, @2026a-rahul-yaji, and @2026a-keerthana-gunaretnam

Project Documentation

Proposal:

By: @2026a-nourelden-rihan, @2026a-ritika-saha, @2026a-rahul-yaji

  • We decided to focus on the main area of increasing the stability of the MS2 phage lysis protein L, with a possible secondary goal of reducing the dependency on host DnaJ, while still maintaining the lysis action.
  • The tools AlphaFold, Clustal Omega, BLAST, ESM, and ESMFold were discussed.
  • BLAST can pull out homologous lysis proteins from the databases.
  • Clustal Omega can create MSAs to identify essential L48-S49 residues, and the pore-forming regions that must not be mutated.
  • ESM can create mutation heatmaps, which can guide the use of ESMFold to obtain highest score foldings in mutatable regions.
  • AlphaFold Multimer predicts whether the subunits of our protein can successfully create a pore in the host membrane, and also to check whether N-terminus can break the interaction with DnaJ.
  • We also identified a few pitfalls, with majors ones dealing with limited training datasets, that may not be properly aligned towards creating a transmembrane lysis protein.
  • Some other pitfalls include the lack of proper annotations for amurins; the possibility of an over-stable protein to form non-functional aggregates; and the vulnerability of modified protein to host proteases.

Paper summaries:

By: @2026a-ritika-saha
MS2 Lysis of Escherichia coli Depends on Host Chaperone DnaJ

The study shows that the MS2 phage lysis protein L requires the host chaperone DnaJ for efficient host cell lysis. A missense mutation (P330Q) in the highly conserved C-terminal domain of DnaJ blocks MS2 L-mediated lysis at 30 °C and delays lysis at higher temperatures, without affecting overall L protein synthesis. The defect is specific to L-mediated lysis and does not affect lysis by other phage lysis proteins.

Genetic suppressor screening identified Lodj alleles of the L gene that bypass the DnaJ requirement. These alleles encode truncated L proteins lacking the highly basic N-terminal domain, indicating that this domain confers dependence on DnaJ. Biochemical assays demonstrated that wild-type L forms a membrane-associated complex with DnaJ, whereas the P330Q DnaJ variant cannot interact with L.

The authors propose that DnaJ functions as a chaperone that facilitates proper folding or conformational activation of full-length L, preventing steric interference from the N-terminal domain and allowing L to interact with its unknown cellular target. Removal of the dispensable N-terminal domain eliminates the need for chaperone assistance and accelerates lysis.

The work identifies DnaJ as a host factor regulating MS2 lysis timing and suggests that chaperone-dependent modulation of lysis may be an evolutionary strategy to optimize phage replication cycles.


Mutational analysis of the MS2 lysis protein L

This study performed comprehensive mutational and genetic analyses of the MS2 phage lysis protein L to identify residues and domains required for function. Random mutagenesis of the 75-aa L protein showed that most loss-of-function mutations cluster in the C-terminal half of the protein, especially around a conserved Leu-Ser (LS) dipeptide motif. Many inactivating mutations were conservative amino-acid substitutions and did not affect protein accumulation or membrane association, suggesting that L function depends on specific protein–protein interactions rather than nonspecific membrane disruption.

Functional studies demonstrated that L-mediated lysis requires interaction with the host chaperone DnaJ. The highly basic N-terminal domain of L is dispensable for lytic activity but mediates DnaJ dependence. Truncation of this domain or certain suppressor mutations bypassed the chaperone requirement and restored rapid lysis.

Biochemical and genetic data support a model in which L is an integral membrane protein whose essential domains (including the LS motif and neighboring regions) form a helical structure that likely engages a host membrane target protein. The interaction may occur near sites of membrane curvature associated with peptidoglycan biosynthesis rather than by forming nonspecific membrane lesions.

The work, supported in part by the Center for Phage Technology and associated laboratories including research by Ry Young, suggests that MS2 L functions through a specific heterotypic protein–protein interaction mechanism and that chaperone-dependent regulation helps control lysis timing during infection.

The study refines the mechanistic model of MS2 lysis, proposing that conserved structural motifs rather than general membrane disruption drive lytic activity.


In vitro characterization of the phage lysis protein MS2-L

This study provides detailed in vitro and in vivo characterization of the MS2 lysis protein MS2-L, focusing on its membrane insertion mechanism, oligomerization behavior, and interaction with the host chaperone DnaJ.

Key findings show that MS2-L is a 75-amino-acid phage toxin whose essential lytic activity resides in the C-terminal ~35 amino acids, which form a hydrophobic transmembrane region. The N-terminal soluble domain is not required for bacterial killing but modulates folding, membrane insertion efficiency, and chaperone interaction.

Biochemical assays demonstrate that MS2-L interacts directly with DnaJ, primarily through the soluble N-terminal domain. However, this interaction does not significantly affect membrane insertion, solubilization, or oligomerization of the toxin, suggesting that DnaJ functions more as a folding or stabilization partner rather than being essential for lytic activity.

Native mass spectrometry revealed that MS2-L assembles into high-order oligomeric complexes (≥10 monomers) after insertion into lipid nanodiscs, and oligomerization is driven mainly by the transmembrane domain. In detergent environments, oligomer formation is reduced, indicating that membrane lipid context is important for stable assembly.

Fluorescence microscopy and cryo-electron microscopy showed that MS2-L expression in bacteria leads to peripheral membrane clustering, followed by sequential lesion formation beginning in the outer membrane, then disruption of the peptidoglycan layer, and finally inner membrane disintegration with cytoplasmic leakage.

The data support a model in which MS2-L functions as a pore-forming phage toxin that kills cells through higher-order oligomerization within the bacterial membrane, rather than by directly inhibiting peptidoglycan biosynthesis. Chaperone DnaJ binds MS2-L but is not required for membrane insertion or pore assembly, suggesting its role is mainly in modulating toxin folding or stability.

These findings strengthen the concept that MS2-L belongs to the amurin/single-gene lysis protein family and may be useful for bioengineering applications such as bacterial ghost cell production and antimicrobial design.


Phage therapy: From biological mechanisms to future directions

This review from Elsevier surveys the biological mechanisms, clinical development, and future directions of phage therapy as a strategy to combat antimicrobial resistance. It explains that therapeutic phages should ideally be strictly lytic, highly host-specific, and thoroughly characterized to ensure safety and efficacy.

The article describes how phages kill bacteria through mechanisms such as inhibition of essential cellular processes, expression of lysis proteins, or disruption of bacterial membranes. It also discusses advances in phage engineering, including synthetic genome construction and modification of phage host range and virulence.

Clinical applications of phage therapy are highlighted, particularly for treating drug-resistant infections where antibiotics are ineffective. However, challenges remain, including bacterial resistance to phages, regulatory hurdles, manufacturing standardization, and the need to understand phage–host interactions.

Future directions include the use of genetically modified or synthetic phages, computational prediction of therapeutic candidates, and integration of phage therapy with conventional antimicrobial strategies. Overall, phage therapy is presented as a promising but still developing alternative to antibiotics in the fight against antimicrobial resistance.


Generative design of novel bacteriophages with genome language models

This preprint reports the first experimental demonstration of generative design of complete bacteriophage genomes using genome language models (Evo 1 and Evo 2). The authors fine-tuned models on about 15,000 Microviridae phage genomes to enable autoregressive generation of full viral genomes guided by template-based prompts and biologically motivated design constraints.

The workflow involved computational generation followed by multi-tier filtering for sequence quality, host tropism specificity, and evolutionary diversity. Constraints included genome length (4–6 kb), GC content, absence of long homopolymers, preservation of phage-like gene architecture, and spike protein similarity to the template phage to maintain host targeting.

Experimental validation showed that about 285 of 302 synthesized genome candidates could be assembled, and 16 produced viable infectious phages that inhibited growth of the target host strain. These generated phages displayed substantial sequence novelty, containing hundreds of mutations relative to natural Microviridae genomes, while preserving functional genome organization.

Structural and functional analyses indicated that some generated phages possessed altered protein interfaces but maintained compatible capsid–protein interactions. Cryo-electron microscopy and structure prediction suggested context-dependent co-evolution of structural proteins such as capsid and packaging proteins.

Fitness assays showed that several AI-generated phages matched or exceeded the replication and lytic performance of the template phage, and phage cocktail experiments demonstrated rapid suppression of resistant bacterial strains through recombination and mutation-driven adaptation.

The study was conducted with biosafety considerations, including restricting model training to bacteriophage genomes and using well-characterized laboratory strains. The work was supported by researchers affiliated with institutions such as the Stanford University and the Arc Institute.

Overall, the paper proposes a framework for generative genome engineering, showing that AI models can design biologically viable and evolutionarily novel bacteriophages, potentially enabling future synthetic biology and phage-based therapeutic development.


Overview of the Project Proposal: Engineering the MS2 Phage Lysis Protein L

By: @2026a-nourelden-rihan, @2026a-ritika-saha, @2026a-rahul-yaji

1. Project Goal

Our primary goal is to increase the structural stability of the MS2 bacteriophage lysis protein (L) while maintaining its ability to lyse bacterial cells.

Our secondary goal is to reduce the dependency of L on the host chaperone DnaJ, which normally assists the protein in folding or activation. Reducing this dependency could allow the lysis protein to function more efficiently and independently in engineered systems.

The MS2 L protein is a 75-amino-acid single-gene lysis toxin whose C-terminal region forms a hydrophobic transmembrane domain responsible for membrane disruption and pore formation, while the basic N-terminal domain interacts with host factors such as DnaJ. Previous studies show that truncation of the N-terminal region can bypass the DnaJ requirement while preserving lysis activity.

Therefore, our design strategy focuses on:

  • Stabilizing the transmembrane and oligomerization regions
  • Maintaining essential functional motifs such as the L48–S49 motif
  • Exploring modifications to the N-terminal region to reduce DnaJ dependence

2. Computational Tools and Approaches

We will use a multi-step computational protein engineering pipeline combining sequence analysis, machine-learning mutagenesis predictions, and structural modeling.

2.1 BLAST – Homolog Discovery

First, we will use BLAST to identify homologous lysis proteins from related bacteriophages.

Purpose:

  • Identify evolutionarily conserved residues
  • Discover natural sequence variations that maintain function
  • Build a dataset for multiple sequence alignment

This will help determine which regions are functionally constrained vs mutable.

2.2 Clustal Omega – Multiple Sequence Alignment (MSA)

Using sequences obtained from BLAST, we will perform multiple sequence alignment with Clustal Omega.

Purpose:

  • Identify highly conserved residues, especially around the L48–S49 motif
  • Map essential structural regions
  • Determine which residues are safe to mutate

Regions with high conservation will be protected from mutation, while variable regions may be targeted for stability improvements.

2.3 ESM (Protein Language Models) – In Silico Mutagenesis

Next, we will use ESM (Evolutionary Scale Modeling) protein language models to perform systematic mutation scanning.

Purpose:

  • Generate mutation heatmaps
  • Predict which amino acid substitutions improve protein fitness or stability
  • Identify mutations compatible with the evolutionary sequence landscape

This step will guide rational mutation selection instead of random mutagenesis.

2.4 ESMFold – Structure Prediction for Mutants

Promising mutations from ESM analysis will be modeled using ESMFold.

Purpose:

  • Predict 3D structures of mutant proteins
  • Evaluate structural stability
  • Ensure the transmembrane helix remains intact

Mutations that significantly distort the fold will be discarded.

2.5 AlphaFold Multimer – Oligomerization and Host Interaction

Finally, we will use AlphaFold Multimer to analyze:

  1. L protein oligomerization
  2. Potential interactions with DnaJ

Purpose:

  • Predict whether mutated L proteins can form the oligomeric pore complex
  • Evaluate whether N-terminal mutations reduce interaction with DnaJ

Since MS2-L likely forms large oligomeric pores (>10 subunits) in the membrane, maintaining correct protein in1.Phage L protein sequence

Computational Workflow:

  1. Phage L protein sequence
  2. BLAST Search (find homologous lysis proteins)
  3. Multiple Sequence Alignment (Clustal Omega)
    • identify conserved vs mutable residues
  4. ESM Mutation Scanning (generate mutation heatmaps)
  5. Select Candidate Mutations (stability or N-terminal modifications)
  6. Structure Prediction (ESMFold)
  7. Complex/Oligomer Prediction (AlphaFold Multimer)
  8. Final Mutant Candidates (stable + functional lysis protein)

3. Proposed Engineering Pipeline

Computational workflow we will follow.

4. Expected Outcomes

Our pipeline aims to produce engineered variants of the MS2 L protein with:

  • Increased structural stability
  • Reduced aggregation risk
  • Maintained transmembrane insertion
  • Potentially reduced dependency on host DnaJ

These optimized proteins could be useful in applications such as:

  • Synthetic phage engineering
  • Bacterial ghost cell production
  • Antimicrobial protein development

5. Potential Pitfalls

5.1 Limited Training Data

Most protein language models and structural predictors are trained primarily on globular proteins, not small transmembrane phage toxins.

This may reduce prediction accuracy for MS2 L.

5.2 Risk of Over-Stabilization

Mutations designed to increase stability may cause:

  • Protein aggregation
  • Improper membrane insertion
  • Loss of functional oligomerization

Thus stability must be balanced with function.

5.3 Poor Annotation of Amurin Proteins

Single-gene lysis proteins (also called amurins) are poorly annotated in sequence databases.

This may limit the quality of homologous sequences used for alignment and training.

5.4 Host Protease Sensitivity

Mutations may unintentionally expose protease cleavage sites, making the engineered protein less stable inside bacterial cells.

6. Future Work

If promising mutants are identified computationally, the next steps would include:

  • Experimental expression in E. coli
  • Measuring lysis timing
  • Measuring protein stability
  • Testing DnaJ independence

This would validate whether computational predictions translate into improved biological function.


L-Protein Mutants Project Summary:

The MS2 bacteriophage L-protein is a small 75-residue lysis protein with two functional regions: a soluble N-terminal domain (residues 1–40) that interacts with the bacterial chaperone DnaJ, and a transmembrane domain (residues 41–75) that disrupts the inner membrane to trigger host cell lysis. The goal of this project was to computationally design five single-point mutants with the potential to retain or enhance lysis activity. This was done by first running a Clustal Omega Multiple Sequence Alignment to map out conserved positions across homologs, all of which turned out to cluster exclusively in the soluble domain, pointing to the DnaJ interface as the most functionally constrained region of the protein. An LLR mutation heatmap was then generated and cross-referenced against an experimental lysis dataset; three mutations that appeared in both were excluded after showing a lysis score of zero experimentally, highlighting that LLR scores reflect structural stability rather than functional activity. The final five mutants, Y39L and S9Q in the soluble domain, and K50L, N53L, and T52L in the transmembrane domain, were selected based on the highest LLR scores while avoiding all conserved and experimentally failed positions.

Each mutant was then modeled as an octameric pore complex using AlphaFold Multimer to simulate the expected membrane assembly. The key findings were:

  • All five mutants formed a cylinder-like pore structure, consistent with the expected transmembrane assembly
  • pLDDT scores were low (< 50) across all models, and PAE plots showed near-zero confidence for inter-chain contacts
  • Running the same pipeline on R30Q, an experimentally validated lytic mutant, produced identical low-confidence results, suggesting this is a systematic limitation of AlphaFold for this class of small membrane-disrupting proteins rather than a reflection of the mutants’ actual viability
  • All five mutants remain candidates for wet lab validation

Details:

L-Protein Mutants

Option 1: Mutagenesis

Here is the Pictures of the Clustal Omega Multiple Sequence Alignment (MSA) Output

Clustal Omega Multiple Sequence Alignment (MSA) Output for the MS2 L Protein Part 1 Clustal Omega Multiple Sequence Alignment (MSA) Output for the MS2 L Protein Part 1 Clustal Omega Multiple Sequence Alignment (MSA) Output for the MS2 L Protein Part 2 Clustal Omega Multiple Sequence Alignment (MSA) Output for the MS2 L Protein Part 2

It seems like these Positions [21,25,28-29,33,35-37,40] have an “*” which means these have not changed at all and most probably are a totally conserved crucial region that should not be mutated, Positions [17,26,30] have a “:” which means they are highly conserved, but mutations that are very similar in shape, structure and chemical properties are tolerated, the rest seem to be more flexible.

Interesting Finding here, All the conserved regions seem to be in the Soluble Domain (1-40) that is responsible for DnaJ Interaction :D

This is the generated L Protein Mutation Heatmap

L Protein Mutation Heatmap L Protein Mutation Heatmap

I tried to cross reference with the experimental data sheet but it was quite hard to do manually so i wrote a script and ran it on Here on Colab to just find the common ones across the two files and came up with 3 common ones, however even though these 3 had quite a high LLR Score, the experimental data showed a Lysis score of 0, and i believe this shows the limitation in the structure based modeling, it predicted structural stability but failed to account for the functional mechanism of lysis.

Common Mutations Common Mutations

The common mutations identified by my algorithm

So Then i Filtered my generated mutations first to target the soluble domain (1-40) first while making sure to avoid the totally conserved regions [21,25,28-29,33,35-37,40], the highly conserved regions [17,26,30] and my 3 common matches to avoid choosing mutants that are either disrupting the protein or have been already experimented on with no lysis observed.

The two mutations i have decided to go with for the Soluble Domain are:

IndexPositionWild_Type_AAMutation_AALLR Score
139YL2.24177968502044
29SQ2.01432478427886

These have the highest LLR score in this region and avoid any conservative regions in the protein

Then i repeated the steps again, this time targeting the transmembrane domain (41-75) and followed the same criteria, and i picked these 2 mutations:

IndexPositionWild_Type_AAMutation_AALLR Score
350KL2.56146776676178
453NL1.86493206024169

And Again because These have the highest LLR score in this region and avoid any conservative regions in the protein

For the Final 5th mutant i decided to pick this one:

IndexPositionWild_Type_AAMutation_AALLR Score
552TL1.81396758556365

Because this one still has a relatively high LLR Score and is in the area where the L Protein does not overlap with either the Coat Protein or the replicase ones.

so here is the full 5 Mutations table i chose:

IndexPositionWild_Type_AAMutation_AALLR Score
139YL2.24177968502044
29SQ2.01432478427886
350KL2.56146776676178
453NL1.86493206024169
552TL1.81396758556365

AlphaFold Multimer Runs

Mutant 1 (Y39L)

The Monomer Sequence: METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLLVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

The Multimer Sequence (the one i used for Alphafold Multimer): METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLLVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLLVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLLVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLLVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLLVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLLVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLLVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLLVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

Mutant 1 Multimer Front View Mutant 1 Multimer Front View Mutant 1 Multimer Side View Mutant 1 Multimer Side View

The L Proteins do form a cylinder like shape mimicking the transmembrane pore but the piDDT score are very low (<50) so this probably won’t express well or won’t express at all if done in wet lab :(.

Mutant 1 Multimer Predicted Aligned Error (PAE) Mutant 1 Multimer Predicted Aligned Error (PAE)

This is the Predicted Aligned Error (PAE), i asked AI (Gemini) how to interpret this and it said this:

Gemini’s Interpretation of Predicted Aligned Error (PAE) Plots Gemini’s Interpretation of Predicted Aligned Error (PAE) Plots

So following this, it seems like each monomer folds correctly with high confidence yet their together-grouping is not reliable at all with very low confidence scores possibly hinting that the pore forming shape we saw might not actually happen (i hope i understood that correctly XD)

Mutant 2 (S9Q)

The Monomer Sequence: METRFPQQQQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

The Multimer Sequence (the one i used for Alphafold Multimer): METRFPQQQQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQQQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQQQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQQQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQQQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQQQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQQQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQQQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

Mutant 2 Multimer Front View Mutant 2 Multimer Front View Mutant 2 Multimer Side View Mutant 2 Multimer Side View

Same exact findings for this mutant too :(.

Mutant 2 Multimer Predicted Aligned Error (PAE) Mutant 2 Multimer Predicted Aligned Error (PAE)

Again, Same Here as well. (it is the same for all five mutants :(. )

Mutant 3 (K50L)

The Monomer Sequence: METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTNQLLLSLLEAVIRTVTTLQQLLT

The Multimer Sequence (the one i used for Alphafold Multimer): METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTNQLLLSLLEAVIRTVTTLQQLLT

Mutant 3 Multimer Front View Mutant 3 Multimer Front View Mutant 3 Multimer Side View Mutant 3 Multimer Side View Mutant 3 Multimer Predicted Aligned Error (PAE) Mutant 3 Multimer Predicted Aligned Error (PAE)

Mutant 4 (N53L)

The Monomer Sequence: METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTLQLLLSLLEAVIRTVTTLQQLLT

The Multimer Sequence (the one i used for Alphafold Multimer): METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTLQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTLQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTLQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTLQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTLQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTLQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTLQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTLQLLLSLLEAVIRTVTTLQQLLT

Mutant 4 Multimer Front View Mutant 4 Multimer Front View Mutant 4 Multimer Side View Mutant 4 Multimer Side View Mutant 4 Multimer Predicted Aligned Error (PAE) Mutant 4 Multimer Predicted Aligned Error (PAE)

Mutant 5 (T52L)

The Monomer Sequence: METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFLNQLLLSLLEAVIRTVTTLQQLLT

The Multimer Sequence (the one i used for Alphafold Multimer): METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFLNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFLNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFLNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFLNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFLNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFLNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFLNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFLNQLLLSLLEAVIRTVTTLQQLLT

Mutant 5 Multimer Front View Mutant 5 Multimer Front View Mutant 5 Multimer Side View Mutant 5 Multimer Side View Mutant 5 Multimer Predicted Aligned Error (PAE) Mutant 5 Multimer Predicted Aligned Error (PAE)

I Actually decided to pick a mutant from the Experimental Data Sheet and made sure it has been proven to have the Lysis Effect, my aim is to try to perform the AlphaFold Multimer step for it, to have a look at how different it might be, the mutant i picked was R30Q.

This is the monomer sequence: METRFPQQSQQTPASTNRRRPFKHEDYPCQRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

This is the full Alphafold sequence i used: METRFPQQSQQTPASTNRRRPFKHEDYPCQRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCQRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCQRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCQRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCQRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCQRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCQRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT:METRFPQQSQQTPASTNRRRPFKHEDYPCQRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

Experimental Mutant Multimer Front View Experimental Mutant Multimer Front View Experimental Mutant Multimer Side View Experimental Mutant Multimer Side View Experimental Mutant Multimer Predicted Aligned Error (PAE) Experimental Mutant Multimer Predicted Aligned Error (PAE)

Well i got bad news and good news, the bad news is, the structre is also of very low confidence and the Predicted Aligned Error (PAE) Plot shows the same low confidence for the monomer’s interaction with each othe, the good news though is that this is the run of a mutant that has been experimentally validated and has the lysis effect and protein level determined, so maybe this gives me hope that my five mutants might actually stand a chance in a wet lab validation regardless of the very low confidence scores it got.

Subsections of Group Final Project

Week 4

Proposal:

By: @2026a-nourelden-rihan, @2026a-ritika-saha, @2026a-rahul-yaji

  • We decided to focus on the main area of increasing the stability of the MS2 phage lysis protein L, with a possible secondary goal of reducing the dependency on host DnaJ, while still maintaining the lysis action.
  • The tools AlphaFold, Clustal Omega, BLAST, ESM, and ESMFold were discussed.
  • BLAST can pull out homologous lysis proteins from the databases.
  • Clustal Omega can create MSAs to identify essential L48-S49 residues, and the pore-forming regions that must not be mutated.
  • ESM can create mutation heatmaps, which can guide the use of ESMFold to obtain highest score foldings in mutatable regions.
  • AlphaFold Multimer predicts whether the subunits of our protein can successfully create a pore in the host membrane, and also to check whether N-terminus can break the interaction with DnaJ.
  • We also identified a few pitfalls, with majors ones dealing with limited training datasets, that may not be properly aligned towards creating a transmembrane lysis protein.
  • Some other pitfalls include the lack of proper annotations for amurins; the possibility of an over-stable protein to form non-functional aggregates; and the vulnerability of modified protein to host proteases.

Paper summaries:

By: @2026a-ritika-saha
MS2 Lysis of Escherichia coli Depends on Host Chaperone DnaJ

The study shows that the MS2 phage lysis protein L requires the host chaperone DnaJ for efficient host cell lysis. A missense mutation (P330Q) in the highly conserved C-terminal domain of DnaJ blocks MS2 L-mediated lysis at 30 °C and delays lysis at higher temperatures, without affecting overall L protein synthesis. The defect is specific to L-mediated lysis and does not affect lysis by other phage lysis proteins.

Genetic suppressor screening identified Lodj alleles of the L gene that bypass the DnaJ requirement. These alleles encode truncated L proteins lacking the highly basic N-terminal domain, indicating that this domain confers dependence on DnaJ. Biochemical assays demonstrated that wild-type L forms a membrane-associated complex with DnaJ, whereas the P330Q DnaJ variant cannot interact with L.

The authors propose that DnaJ functions as a chaperone that facilitates proper folding or conformational activation of full-length L, preventing steric interference from the N-terminal domain and allowing L to interact with its unknown cellular target. Removal of the dispensable N-terminal domain eliminates the need for chaperone assistance and accelerates lysis.

The work identifies DnaJ as a host factor regulating MS2 lysis timing and suggests that chaperone-dependent modulation of lysis may be an evolutionary strategy to optimize phage replication cycles.


Mutational analysis of the MS2 lysis protein L

This study performed comprehensive mutational and genetic analyses of the MS2 phage lysis protein L to identify residues and domains required for function. Random mutagenesis of the 75-aa L protein showed that most loss-of-function mutations cluster in the C-terminal half of the protein, especially around a conserved Leu-Ser (LS) dipeptide motif. Many inactivating mutations were conservative amino-acid substitutions and did not affect protein accumulation or membrane association, suggesting that L function depends on specific protein–protein interactions rather than nonspecific membrane disruption.

Functional studies demonstrated that L-mediated lysis requires interaction with the host chaperone DnaJ. The highly basic N-terminal domain of L is dispensable for lytic activity but mediates DnaJ dependence. Truncation of this domain or certain suppressor mutations bypassed the chaperone requirement and restored rapid lysis.

Biochemical and genetic data support a model in which L is an integral membrane protein whose essential domains (including the LS motif and neighboring regions) form a helical structure that likely engages a host membrane target protein. The interaction may occur near sites of membrane curvature associated with peptidoglycan biosynthesis rather than by forming nonspecific membrane lesions.

The work, supported in part by the Center for Phage Technology and associated laboratories including research by Ry Young, suggests that MS2 L functions through a specific heterotypic protein–protein interaction mechanism and that chaperone-dependent regulation helps control lysis timing during infection.

The study refines the mechanistic model of MS2 lysis, proposing that conserved structural motifs rather than general membrane disruption drive lytic activity.


In vitro characterization of the phage lysis protein MS2-L

This study provides detailed in vitro and in vivo characterization of the MS2 lysis protein MS2-L, focusing on its membrane insertion mechanism, oligomerization behavior, and interaction with the host chaperone DnaJ.

Key findings show that MS2-L is a 75-amino-acid phage toxin whose essential lytic activity resides in the C-terminal ~35 amino acids, which form a hydrophobic transmembrane region. The N-terminal soluble domain is not required for bacterial killing but modulates folding, membrane insertion efficiency, and chaperone interaction.

Biochemical assays demonstrate that MS2-L interacts directly with DnaJ, primarily through the soluble N-terminal domain. However, this interaction does not significantly affect membrane insertion, solubilization, or oligomerization of the toxin, suggesting that DnaJ functions more as a folding or stabilization partner rather than being essential for lytic activity.

Native mass spectrometry revealed that MS2-L assembles into high-order oligomeric complexes (≥10 monomers) after insertion into lipid nanodiscs, and oligomerization is driven mainly by the transmembrane domain. In detergent environments, oligomer formation is reduced, indicating that membrane lipid context is important for stable assembly.

Fluorescence microscopy and cryo-electron microscopy showed that MS2-L expression in bacteria leads to peripheral membrane clustering, followed by sequential lesion formation beginning in the outer membrane, then disruption of the peptidoglycan layer, and finally inner membrane disintegration with cytoplasmic leakage.

The data support a model in which MS2-L functions as a pore-forming phage toxin that kills cells through higher-order oligomerization within the bacterial membrane, rather than by directly inhibiting peptidoglycan biosynthesis. Chaperone DnaJ binds MS2-L but is not required for membrane insertion or pore assembly, suggesting its role is mainly in modulating toxin folding or stability.

These findings strengthen the concept that MS2-L belongs to the amurin/single-gene lysis protein family and may be useful for bioengineering applications such as bacterial ghost cell production and antimicrobial design.


Phage therapy: From biological mechanisms to future directions

This review from Elsevier surveys the biological mechanisms, clinical development, and future directions of phage therapy as a strategy to combat antimicrobial resistance. It explains that therapeutic phages should ideally be strictly lytic, highly host-specific, and thoroughly characterized to ensure safety and efficacy.

The article describes how phages kill bacteria through mechanisms such as inhibition of essential cellular processes, expression of lysis proteins, or disruption of bacterial membranes. It also discusses advances in phage engineering, including synthetic genome construction and modification of phage host range and virulence.

Clinical applications of phage therapy are highlighted, particularly for treating drug-resistant infections where antibiotics are ineffective. However, challenges remain, including bacterial resistance to phages, regulatory hurdles, manufacturing standardization, and the need to understand phage–host interactions.

Future directions include the use of genetically modified or synthetic phages, computational prediction of therapeutic candidates, and integration of phage therapy with conventional antimicrobial strategies. Overall, phage therapy is presented as a promising but still developing alternative to antibiotics in the fight against antimicrobial resistance.


Generative design of novel bacteriophages with genome language models

This preprint reports the first experimental demonstration of generative design of complete bacteriophage genomes using genome language models (Evo 1 and Evo 2). The authors fine-tuned models on about 15,000 Microviridae phage genomes to enable autoregressive generation of full viral genomes guided by template-based prompts and biologically motivated design constraints.

The workflow involved computational generation followed by multi-tier filtering for sequence quality, host tropism specificity, and evolutionary diversity. Constraints included genome length (4–6 kb), GC content, absence of long homopolymers, preservation of phage-like gene architecture, and spike protein similarity to the template phage to maintain host targeting.

Experimental validation showed that about 285 of 302 synthesized genome candidates could be assembled, and 16 produced viable infectious phages that inhibited growth of the target host strain. These generated phages displayed substantial sequence novelty, containing hundreds of mutations relative to natural Microviridae genomes, while preserving functional genome organization.

Structural and functional analyses indicated that some generated phages possessed altered protein interfaces but maintained compatible capsid–protein interactions. Cryo-electron microscopy and structure prediction suggested context-dependent co-evolution of structural proteins such as capsid and packaging proteins.

Fitness assays showed that several AI-generated phages matched or exceeded the replication and lytic performance of the template phage, and phage cocktail experiments demonstrated rapid suppression of resistant bacterial strains through recombination and mutation-driven adaptation.

The study was conducted with biosafety considerations, including restricting model training to bacteriophage genomes and using well-characterized laboratory strains. The work was supported by researchers affiliated with institutions such as the Stanford University and the Arc Institute.

Overall, the paper proposes a framework for generative genome engineering, showing that AI models can design biologically viable and evolutionarily novel bacteriophages, potentially enabling future synthetic biology and phage-based therapeutic development.


Overview of the Project Proposal: Engineering the MS2 Phage Lysis Protein L

By: @2026a-nourelden-rihan, @2026a-ritika-saha, @2026a-rahul-yaji

1. Project Goal

Our primary goal is to increase the structural stability of the MS2 bacteriophage lysis protein (L) while maintaining its ability to lyse bacterial cells.

Our secondary goal is to reduce the dependency of L on the host chaperone DnaJ, which normally assists the protein in folding or activation. Reducing this dependency could allow the lysis protein to function more efficiently and independently in engineered systems.

The MS2 L protein is a 75-amino-acid single-gene lysis toxin whose C-terminal region forms a hydrophobic transmembrane domain responsible for membrane disruption and pore formation, while the basic N-terminal domain interacts with host factors such as DnaJ. Previous studies show that truncation of the N-terminal region can bypass the DnaJ requirement while preserving lysis activity.

Therefore, our design strategy focuses on:

  • Stabilizing the transmembrane and oligomerization regions
  • Maintaining essential functional motifs such as the L48–S49 motif
  • Exploring modifications to the N-terminal region to reduce DnaJ dependence

2. Computational Tools and Approaches

We will use a multi-step computational protein engineering pipeline combining sequence analysis, machine-learning mutagenesis predictions, and structural modeling.

2.1 BLAST – Homolog Discovery

First, we will use BLAST to identify homologous lysis proteins from related bacteriophages.

Purpose:

  • Identify evolutionarily conserved residues
  • Discover natural sequence variations that maintain function
  • Build a dataset for multiple sequence alignment

This will help determine which regions are functionally constrained vs mutable.

2.2 Clustal Omega – Multiple Sequence Alignment (MSA)

Using sequences obtained from BLAST, we will perform multiple sequence alignment with Clustal Omega.

Purpose:

  • Identify highly conserved residues, especially around the L48–S49 motif
  • Map essential structural regions
  • Determine which residues are safe to mutate

Regions with high conservation will be protected from mutation, while variable regions may be targeted for stability improvements.

2.3 ESM (Protein Language Models) – In Silico Mutagenesis

Next, we will use ESM (Evolutionary Scale Modeling) protein language models to perform systematic mutation scanning.

Purpose:

  • Generate mutation heatmaps
  • Predict which amino acid substitutions improve protein fitness or stability
  • Identify mutations compatible with the evolutionary sequence landscape

This step will guide rational mutation selection instead of random mutagenesis.

2.4 ESMFold – Structure Prediction for Mutants

Promising mutations from ESM analysis will be modeled using ESMFold.

Purpose:

  • Predict 3D structures of mutant proteins
  • Evaluate structural stability
  • Ensure the transmembrane helix remains intact

Mutations that significantly distort the fold will be discarded.

2.5 AlphaFold Multimer – Oligomerization and Host Interaction

Finally, we will use AlphaFold Multimer to analyze:

  1. L protein oligomerization
  2. Potential interactions with DnaJ

Purpose:

  • Predict whether mutated L proteins can form the oligomeric pore complex
  • Evaluate whether N-terminal mutations reduce interaction with DnaJ

Since MS2-L likely forms large oligomeric pores (>10 subunits) in the membrane, maintaining correct protein in1.Phage L protein sequence

Computational Workflow:

  1. Phage L protein sequence
  2. BLAST Search (find homologous lysis proteins)
  3. Multiple Sequence Alignment (Clustal Omega)
    • identify conserved vs mutable residues
  4. ESM Mutation Scanning (generate mutation heatmaps)
  5. Select Candidate Mutations (stability or N-terminal modifications)
  6. Structure Prediction (ESMFold)
  7. Complex/Oligomer Prediction (AlphaFold Multimer)
  8. Final Mutant Candidates (stable + functional lysis protein)

3. Proposed Engineering Pipeline

Computational workflow we will follow.

4. Expected Outcomes

Our pipeline aims to produce engineered variants of the MS2 L protein with:

  • Increased structural stability
  • Reduced aggregation risk
  • Maintained transmembrane insertion
  • Potentially reduced dependency on host DnaJ

These optimized proteins could be useful in applications such as:

  • Synthetic phage engineering
  • Bacterial ghost cell production
  • Antimicrobial protein development

5. Potential Pitfalls

5.1 Limited Training Data

Most protein language models and structural predictors are trained primarily on globular proteins, not small transmembrane phage toxins.

This may reduce prediction accuracy for MS2 L.

5.2 Risk of Over-Stabilization

Mutations designed to increase stability may cause:

  • Protein aggregation
  • Improper membrane insertion
  • Loss of functional oligomerization

Thus stability must be balanced with function.

5.3 Poor Annotation of Amurin Proteins

Single-gene lysis proteins (also called amurins) are poorly annotated in sequence databases.

This may limit the quality of homologous sequences used for alignment and training.

5.4 Host Protease Sensitivity

Mutations may unintentionally expose protease cleavage sites, making the engineered protein less stable inside bacterial cells.

6. Future Work

If promising mutants are identified computationally, the next steps would include:

  • Experimental expression in E. coli
  • Measuring lysis timing
  • Measuring protein stability
  • Testing DnaJ independence

This would validate whether computational predictions translate into improved biological function.

Week 5

L-Protein Mutants Project Summary:

@2026a-nourelden-rihan

The MS2 bacteriophage L-protein is a small 75-residue lysis protein with two functional regions: a soluble N-terminal domain (residues 1–40) that interacts with the bacterial chaperone DnaJ, and a transmembrane domain (residues 41–75) that disrupts the inner membrane to trigger host cell lysis. The goal of this project was to computationally design five single-point mutants with the potential to retain or enhance lysis activity. This was done by first running a Clustal Omega Multiple Sequence Alignment to map out conserved positions across homologs, all of which turned out to cluster exclusively in the soluble domain, pointing to the DnaJ interface as the most functionally constrained region of the protein. An LLR mutation heatmap was then generated and cross-referenced against an experimental lysis dataset; three mutations that appeared in both were excluded after showing a lysis score of zero experimentally, highlighting that LLR scores reflect structural stability rather than functional activity. The final five mutants, Y39L and S9Q in the soluble domain, and K50L, N53L, and T52L in the transmembrane domain, were selected based on the highest LLR scores while avoiding all conserved and experimentally failed positions.

Each mutant was then modeled as an octameric pore complex using AlphaFold Multimer to simulate the expected membrane assembly. The key findings were:

  • All five mutants formed a cylinder-like pore structure, consistent with the expected transmembrane assembly
  • pLDDT scores were low (< 50) across all models, and PAE plots showed near-zero confidence for inter-chain contacts
  • Running the same pipeline on R30Q, an experimentally validated lytic mutant, produced identical low-confidence results, suggesting this is a systematic limitation of AlphaFold for this class of small membrane-disrupting proteins rather than a reflection of the mutants’ actual viability
  • All five mutants remain candidates for wet lab validation

Full documentation, sequences, and AlphaFold outputs are available in the week 5 homework.

Ship 41 - Individual Final Project

cover image cover image

Ship 41

HTGAA Final Project Proposal

Project Title: Ship 41: Multi-Chassis Computational and Experimental Optimization of α-Pinene Biosynthesis for High-Value Bioproduction

Author: Nourelden Rihan Course: How to Grow (Almost) Anything — HTGAA 2026 Date: April 4, 2026 Industry Partners: Ginkgo Bioworks · Twist Bioscience · Waters Corporation · New England Biolabs · Millipore Sigma

For forty ships, Egypt sailed to Byblos. Ship 41 engineers what Byblos grew.


SECTION 1: ABSTRACT

Around 2600 BCE, Pharaoh Sneferu dispatched 40 ships across the Mediterranean to the port of Byblos, Lebanon — the only place in the ancient world where cedar could be sourced in abundance. The Palermo Stone records the expedition plainly: “bringing 40 ships laden with cedar wood.” Egypt’s soil grew no conifers. For over two millennia, one of history’s greatest civilizations remained dependent on a foreign tree for its ships, its temples, and its monuments. Ship 41 does not sail to Byblos. It engineers what Byblos grew.

α-Pinene is the principal volatile monoterpene of coniferous wood — the molecule that defined the cedar of Lebanon and made it irreplaceable to the ancient world. Today, it is recognized as one of the most commercially versatile natural products in existence, with applications spanning sustainable aviation fuel (via catalytic dimerization to JP-10), fragrance and flavor industries (as a precursor to synthetic camphor, linalool, and geraniol), pharmaceuticals (anti-inflammatory, antimicrobial, and bronchodilatory properties), and polymer chemistry (as a feedstock for terpene resins). While prior work has demonstrated α-pinene biosynthesis in Escherichia coli using the heterologously expressed Abies grandis geranyl diphosphate synthase (AgGPPS2) and α-pinene synthase (AgPS), no systematic multi-chassis, multi-carbon-source benchmarking study has been conducted to identify the optimal production host and feedstock for high-yield industrial bioproduction. This project — Ship 41 — addresses that gap by integrating computational flux balance analysis (FBA) using COBRApy genome-scale metabolic models across three industrially relevant chassis — E. coli BL21(DE3), Saccharomyces cerevisiae BY4741, and Yarrowia lipolytica CLIB89 — with a high-throughput automated experimental screening pipeline executed on Ginkgo Bioworks’ robotics infrastructure. Constructs encoding bicistronic AgGPPS2-AgPS-sfGFP-His6 cassettes will be synthesized as whole plasmids by Twist Bioscience and validated in cell-free expression systems prior to in vivo transformation. α-Pinene titers will be quantified by GC-MS using dodecane overlay extraction from 96-well deep-well plates, with growth monitored by OD₆₀₀ on the PHERAstar FSX. The best-performing chassis and carbon source combination identified in Aim 1 will inform strain optimization and pathway intensification strategies in subsequent aims, ultimately establishing a rational, scalable biological platform for α-pinene bioproduction across the full spectrum of its industrial applications.


SECTION 2: PROJECT AIMS

Aim 1 — Experimental Aim

The first aim of this project is to computationally predict and experimentally validate the optimal microbial chassis and carbon source for α-pinene biosynthesis by utilizing COBRApy flux balance analysis across three genome-scale metabolic models, followed by high-throughput automated screening of engineered E. coli, S. cerevisiae, and Y. lipolytica strains expressing Twist Bioscience-synthesized AgGPPS2-AgPS constructs, with α-pinene titers quantified by GC-MS — establishing a rational production baseline applicable across all high-value α-pinene markets.

This aim encompasses:

  • COBRApy FBA, FVA, and gene knockout analysis across iJO1366 (E. coli), Yeast9 (S. cerevisiae), and the latest Y. lipolytica GEM
  • Whole plasmid DNA synthesis ordered from Twist Bioscience (2 constructs)
  • Cell-free construct validation using Ginkgo Bioworks BL21 DE3 lysate + GFP fluorescence on the Spark Plate Reader
  • Automated multi-chassis, multi-carbon-source screening (glucose, glycerol, acetate, fatty acids) using Ginkgo Bioworks robotics
  • α-Pinene quantification by GC-MS (Waters Corporation) from dodecane overlay extractions

Aim 2 — Medium-Term Aim

Having identified the best-performing chassis and carbon source combination in Aim 1, the second aim is to intensify α-pinene production through chassis-specific pathway engineering and promoter optimization. This will involve redesigning the genetic construct with chassis-matched regulatory elements (e.g., T7 for E. coli, TEF1-intron for Y. lipolytica) and overexpressing MEP or MVA pathway bottleneck genes identified by FVA in Aim 1. Additionally, computationally identified beneficial gene knockouts (e.g., disruption of competing farnesyl diphosphate branches) will be introduced using CRISPR-Cas9 or homologous recombination, and the resulting strains will be subjected to fed-batch fermentation optimization to achieve titers relevant for downstream industrial application — whether in fragrance, pharmaceuticals, polymer chemistry, or fuel. This aim will leverage the Asimov Kernel Platform for genetic part design and the Cultivarium’s expertise in non-model organism strain engineering, particularly for Y. lipolytica metabolic rewiring.


Aim 3 — Visionary Aim

Engineer a living biofoundry-on-a-chip: a fully autonomous, closed-loop biosynthetic system where AI-guided metabolic models continuously retrain on fermentation data to evolve self-optimizing microbial strains capable of converting atmospheric CO₂ and waste feedstocks directly into α-pinene — the ancient molecule of cedar — and routing it on demand toward any of its industrial destinies: fuel, fragrance, medicine, or material.

In the long term, Ship 41 envisions a carbon-negative, AI-driven α-pinene biorefinery where genome-scale metabolic modeling, automated strain construction, and real-time fermentation analytics are deeply integrated in a Design-Build-Test-Learn (DBTL) cycle operating without human intervention. Because α-pinene sits at the entry point of an extraordinarily diverse downstream chemical space — it can be dimerized into JP-10 jet fuel, isomerized into camphor, oxidized into verbenone, rearranged into limonene or linalool precursors, or polymerized into terpene resins — a high-yield microbial production platform for this single molecule effectively unlocks a portfolio of sustainable industrial chemicals simultaneously. The broader impact extends to establishing a generalizable computational-experimental framework for rapid monoterpene pathway optimization in any chassis, democratizing access to high-value terpene bioproduction across the developing world using locally available feedstocks — and finally, permanently, solving the problem that sent Pharaoh Sneferu’s 40 ships north across the sea.


SECTION 3: BACKGROUND

The Lore of Ship 41

Around 2600 BCE, the Palermo Stone records one of the most consequential supply chain decisions in ancient history: Pharaoh Sneferu, founder of the Fourth Dynasty and builder of the first true pyramids, dispatched a fleet of 40 ships northward across the Mediterranean to the Phoenician port of Byblos — present-day Lebanon — to obtain cedar wood. The inscription reads plainly: “bringing 40 ships laden with cedar wood.” The following year, he had three further 52-meter ships built from the same imported timber, and cedar doors installed in his palace at Dahshur. Remnants of that very wood survive inside the Bent Pyramid to this day.

Egypt was timber-poor by geography. The Nile Valley produced date palms and sycamore figs — serviceable for furniture, inadequate for ships, insufficient for the monumental ambitions of a civilization that was simultaneously inventing the state. Cedar of Lebanon (Cedrus libani) was in a different category entirely: dense, resinous, aromatic, resistant to rot and insects, straight-grained enough for ship planks and long enough for obelisk sledges. It was the structural material of the ancient world’s greatest projects. And Egypt could not grow it.

For over two thousand years — through the Old Kingdom, the Middle Kingdom, the New Kingdom, through Sneferu and Khufu and Thutmose and Ramesses — Egypt continued to sail for cedar. The Wenamun Papyrus, written around 1075 BCE, documents an Egyptian official’s fraught journey to Byblos to negotiate cedar logs for a ceremonial barge, some 1,500 years after Sneferu’s fleet. The dependency never ended. The ships kept sailing.

α-Pinene is the primary monoterpene constituent of cedar heartwood resin — the molecule responsible for the wood’s legendary resistance to decay, its distinctive scent, and much of its structural character. It is, in a molecular sense, what Egypt was sailing for. Today, α-pinene is understood as one of the most industrially versatile natural products known: a precursor to sustainable aviation fuel (JP-10), a feedstock for fragrance and flavor chemistry, a pharmaceutical candidate with documented anti-inflammatory and antimicrobial properties, and a building block for terpene-based polymers.

Ship 41 does not sail to Byblos. We are the 41st ship — except we carry no cargo hold and need no harbor. We carry a genome-scale metabolic model, a codon-optimized gene construct, and a 96-well deep plate. We are not imitators of that civilization. We are Egyptians, and we are its continuation. The cedar Egypt could never grow on its own soil, we now produce in a flask — on glucose, on glycerol, on fatty acids, from microbes of our design. The loop that opened with Sneferu’s fleet closes here.


Literature Context

Sarria et al. (2014) demonstrated the first microbial synthesis of pinene in E. coli by co-expressing a codon-optimized A. grandis GPPS and pinene synthase alongside an enhanced MEP pathway, achieving titers of approximately 32 mg/L of total pinene — establishing proof-of-concept for monoterpene biofuel production in a bacterial chassis. Subsequent work by Cao et al. (2016) expanded this effort by systematically engineering the E. coli MEP pathway through overexpression of rate-limiting enzymes (DXS, IspG, IspH) and achieved improved α-pinene titers, demonstrating that carbon flux toward the isoprenoid precursor pool is a critical bottleneck in monoterpene production. In parallel, the oleaginous yeast Yarrowia lipolytica has emerged as a compelling alternative chassis for terpenoid production due to its naturally high acetyl-CoA flux, tolerance to hydrophobic compounds, and compatibility with lipid-rich feedstocks, with several studies documenting its superior performance for sesquiterpene and diterpene accumulation relative to S. cerevisiae. Together, these bodies of work reveal a critical knowledge gap: no systematic, quantitative comparison of α-pinene production across these three industrially relevant chassis under matched genetic designs and multiple carbon sources has been performed, leaving the field without a rational basis for chassis selection during scale-up.


Innovation

This project is novel in three key dimensions. First, it is the first study to apply genome-scale metabolic modeling via COBRApy — including FBA, FVA, and systematic gene knockout screening — across all three chassis (E. coli, S. cerevisiae, Y. lipolytica) simultaneously, providing a computational-first rational framework for chassis selection that goes beyond empirical trial-and-error. Second, by using a single standardized genetic insert (AgGPPS2-AgPS-sfGFP-His6, synthesized by Twist Bioscience) expressed from host-compatible promoters, this work controls for genetic design variables and isolates the contribution of chassis metabolism and carbon source to α-pinene yield — a controlled comparison that has not previously been reported. Third, the integration of cell-free expression-based construct validation prior to in vivo transformation, combined with fully automated GC-MS-coupled high-throughput screening on Ginkgo Bioworks’ liquid handling robotics, represents a methodological advance in how monoterpene pathway screening is conducted at the course and early research stage.


Significance

α-Pinene is not a single-market molecule — it is a gateway compound whose downstream chemical space spans at least four major industrial sectors simultaneously. In sustainable fuels, it can be catalytically dimerized and hydrogenated into JP-10 (volumetric energy density ~40 MJ/L), a high-density jet fuel of particular value for military and advanced air mobility applications where energy density per volume is critical. In fragrance and flavor, α-pinene is a direct precursor to synthetic camphor, α-terpineol, linalool, and geraniol — globally traded aroma chemicals with a combined market exceeding $6 billion annually. In pharmaceuticals, α-pinene and its oxidized derivatives exhibit documented anti-inflammatory, bronchodilatory, and antimicrobial activities, with active research programs in respiratory and oncology indications. In polymer chemistry, terpene resins derived from pinene are used as tackifiers in pressure-sensitive adhesives, a market valued at over $3 billion. Building a high-yield microbial production platform for α-pinene therefore does not serve a single industry — it serves all of them. The significance of this project lies precisely in that breadth: by identifying the optimal chassis and carbon source through a rigorous computational-experimental framework, this work creates a foundational production baseline from which any downstream application can be pursued. Furthermore, the multi-chassis, multi-carbon-source comparative approach developed here is immediately generalizable to other high-value monoterpenes — limonene, linalool, geraniol — making Ship 41 not just a project about α-pinene, but a template for rational monoterpene bioproduction at large.


Bioethical Considerations

Paragraph 1 — Ethical Considerations: The engineering of microorganisms for high-value chemical production raises important biosafety and environmental ethics questions that must be considered from the earliest stages of project design. The three chassis used in this project (E. coli BL21(DE3), S. cerevisiae BY4741, and Y. lipolytica CLIB89) are all well-characterized biosafety level 1 organisms with long histories of safe use in industrial biotechnology, and none possess pathogenic traits. The synthetic constructs encoding AgGPPS2 and AgPS do not confer selective advantages in natural environments, as monoterpene overproduction is metabolically costly and α-pinene itself is a volatile compound that does not accumulate in soil or water in toxic concentrations under normal conditions. Nevertheless, the intentional design of organisms with enhanced metabolic capabilities — even benign ones — raises broader questions about genetic sovereignty, access to engineered production strains, and the economic impact on existing natural α-pinene supply chains, including the global turpentine and pine resin industries that employ thousands of workers across developing economies. Any project advancing toward commercial bioproduction must engage with these displacement effects transparently and consider benefit-sharing frameworks with communities whose livelihoods depend on natural terpene harvesting.

Paragraph 2 — Responsible Implementation and Risk Mitigation: All experimental work in this project will be conducted under BSL-1 containment at Ginkgo Bioworks’ certified laboratory facility, with all engineered strains handled, stored, and disposed of in accordance with institutional biosafety committee (IBC) protocols. All DNA constructs will be screened through SecureDNA prior to synthesis to ensure that no sequences of concern are inadvertently introduced during construct design. Engineered strains will be maintained with auxotrophic markers or antibiotic selection dependencies that prevent survival outside controlled laboratory conditions, and no environmental release is contemplated at any stage of this project. Longer-term, should this work advance toward pilot-scale fermentation, engagement with regulatory agencies (EPA, FDA, USDA) under the Coordinated Framework for Regulation of Biotechnology will be initiated proactively, and open publication of all metabolic modeling code and construct sequences on platforms such as Addgene and GitHub will ensure that the scientific community can build on, scrutinize, and improve this work without proprietary barriers.


SECTION 4: EXPERIMENTAL DESIGN

Overview

The experimental workflow proceeds in three logical phases: (1) Computational prediction, (2) Construct design and cell-free validation, and (3) Automated in vivo screening. All liquid handling automation is performed at Ginkgo Bioworks unless otherwise noted. Opentrons OT-2 may be used for preparatory steps at the bench.


Detailed Workflow (17 Steps)


Step 1 — COBRApy Flux Balance Analysis (FBA)

  • Method: Download the latest genome-scale metabolic models — iJO1366 (E. coli), Yeast9 (S. cerevisiae), and iYLI647 or the latest curated model for Y. lipolytica — from BiGG or the respective repositories. Add a heterologous α-pinene exchange reaction and set the objective function to maximize α-pinene flux. Simulate FBA under four carbon sources (glucose, glycerol, acetate, fatty acids/oleate) at matched carbon equivalents.
  • Automation: Python/Jupyter notebook (local compute)
  • Plate: N/A (computational)
  • Expected Result: Predicted maximum theoretical α-pinene yield (mmol/g DCW/h) per chassis per carbon source; ranking of chassis–feedstock combinations
  • Timeline: Days 1–4

Step 2 — COBRApy Flux Variability Analysis (FVA)

  • Method: Run FVA on the top two chassis–feedstock combinations identified in Step 1. Identify reactions with high flux variability intersecting the MEP/MVA pathway and GPP branch point (IDI, GPPS). These are metabolic bottlenecks for experimental engineering targets.
  • Automation: Python/Jupyter notebook
  • Plate: N/A
  • Expected Result: List of bottleneck reactions; identification of gene overexpression targets (e.g., dxs, idi, ispG)
  • Timeline: Days 3–5

Step 3 — COBRApy Gene Knockout Screening

  • Method: Perform single and double gene knockout simulations (using cobra.flux_analysis.single_gene_deletion) targeting competing isoprenoid branches (e.g., ispA farnesyl diphosphate synthase competition, erg9 squalene synthase in yeast). Identify knockouts that improve GPP/α-pinene flux without abolishing growth.
  • Automation: Python/Jupyter notebook
  • Plate: N/A
  • Expected Result: Ranked list of gene knockouts predicted to increase α-pinene yield ≥ 1.5× relative to wild-type flux; shortlist for Aim 2 CRISPR targeting
  • Timeline: Days 4–6

Step 4 — DNA Construct Design

  • Method: Design two whole plasmid constructs using Benchling or SnapGene. Construct 1 (E. coli): pET28a backbone with bicistronic T7→RBS→AgGPPS2→RBS→AgPS-sfGFP-His6→T7term. Construct 2 (S. cerevisiae / Y. lipolytica shuttle): p426TEF backbone with TEF1→AgGPPS2→CYC1term; TEF2→AgPS-His6→ADH1term; URA3; 2μ ori. All coding sequences are E. coli or yeast codon-optimized using the Integrated DNA Technologies (IDT) codon optimization tool.
  • Automation: Benchling (design); IDT Codon Optimization Tool (sequence)
  • Plate: N/A
  • Expected Result: Two complete annotated plasmid maps ready for Twist submission; GenBank files exported (see Appendix A)
  • Timeline: Days 5–7

Step 5 — Twist Bioscience DNA Order

  • Method: Submit both plasmid designs (GenBank format) to Twist Bioscience as whole plasmid synthesis orders via the Twist online portal. Select clonal gene backbone option. Specify kanamycin resistance for Construct 1 and uracil prototrophy for Construct 2. Expected delivery: 10–14 business days.
  • Partner: Twist Bioscience
  • Plate: N/A (shipping)
  • Expected Result: Two lyophilized whole plasmids, sequence-verified by Twist, received at Ginkgo Bioworks
  • Timeline: Days 7–21 (concurrent with Steps 6–7)

Step 6 — Cell-Free System Preparation

  • Method: Prepare cell-free expression (CFE) reactions using Ginkgo Bioworks BL21 DE3 lysate + master mix (pre-prepared). Rehydrate Construct 1 plasmid (Twist-supplied) to 50 ng/μL in nuclease-free water. Design a CFE plate with titrated plasmid concentrations (0, 1, 5, 10, 25, 50 ng/μL) in triplicate to determine optimal DNA input for GFP expression.
  • Automation (Liquid Transfer): Echo525 acoustic liquid handler — transfers plasmid DNA and master mix into 384 Greiner black-well clear-bottom plate (384-well black-well clear-bottom)
  • Plate: 384 Greiner black-well clear-bottom
  • Expected Result: CFE reaction volumes of 10 μL per well, zero dead volume, accurate nanoliter-scale transfers
  • Timeline: Day 22

Step 7 — Cell-Free Construct Validation (sfGFP Readout)

  • Method: Seal the 384-well CFE plate with breathable A4s seal (A4S seal applied by Plateloc). Incubate at 29°C for 3 hours in the Inheco Plate Incubator. Read GFP fluorescence (Ex: 488 nm / Em: 520 nm) on the Spark Plate Reader at t=0, t=1h, t=2h, t=3h.
  • Machines: Plateloc (A4s seal) → Inheco Plate Incubator → Spark Plate Reader
  • Plate: 384 Greiner black-well clear-bottom
  • Expected Result: GFP fluorescence signal ≥ 5× background in wells containing Construct 1, confirming successful transcription/translation of AgPS-sfGFP fusion; dose-response curve identifies optimal DNA input concentration
  • Timeline: Day 22–23

Step 8 — E. coli Transformation

  • Method: Transform Construct 1 (pET28a-AgGPPS2-AgPS-sfGFP-His6) into chemically competent E. coli BL21(DE3) cells (NEB C2527I) by heat shock (42°C, 30 sec). Recover in SOC medium for 1 hour at 37°C. Plate on LB + kanamycin (50 μg/mL). Pick 6 colonies; verify by colony PCR using T7 promoter and T7 terminator primers on the ATC Thermal Cycler.
  • Machines: ATC Thermal Cycler (colony PCR)
  • Plate: 96-Armadillo-PCR-AB2396X (colony PCR reactions)
  • Expected Result: ≥4/6 colonies confirmed correct insert by PCR band size (~3.4 kb); one verified colony inoculated into overnight LB + Kan culture
  • Timeline: Days 22–24

Step 9 — S. cerevisiae and Y. lipolytica Transformation

  • Method: Transform Construct 2 (p426TEF-AgGPPS2-AgPS-His6) into S. cerevisiae BY4741 (URA3Δ) by lithium acetate/PEG/heat shock protocol. Separately, adapt the construct backbone and transform into Y. lipolytica CLIB89 using electroporation (Gene Pulser, external step). Select on SC-Ura dropout plates (yeast) and YPD + hygromycin plates (Yarrowia). Verify transformants by colony PCR.
  • Machines: ATC Thermal Cycler (colony PCR)
  • Plate: 96-Armadillo-PCR-AB2396X
  • Expected Result: Verified transformants for both yeast species; starter cultures established in YPD or SC-Ura medium
  • Timeline: Days 23–26

Step 10 — Overnight Seed Culture Preparation

  • Method: Inoculate verified E. coli (LB + Kan), S. cerevisiae (SC-Ura), and Y. lipolytica (YPD) transformants into 5 mL liquid cultures. Grow overnight at 37°C (E. coli), 30°C (S. cerevisiae), 28°C (Y. lipolytica) with 200 rpm shaking. Measure OD₆₀₀ to normalize inoculation density.
  • Machines: Cytomat (30°C shaking incubator for yeast/Yarrowia); standard bench shaker for E. coli overnight
  • Plate: N/A (tube culture)
  • Expected Result: OD₆₀₀ of 1.5–4.0 for all three seed cultures; normalize to OD₆₀₀ = 0.05 for plate inoculation
  • Timeline: Days 26–27

Step 11 — Carbon Source Screening Plate Setup (96-Well Deep)

  • Method: Prepare the multi-chassis × multi-carbon-source screening plate using a 96-well deep plate (96-v-eppendorf-951033502-deep). Add 500 μL of appropriate defined media + carbon source per well (see plate layout below). Dispense media using the Multiflo Automated Microplate Dispenser. Add dodecane overlay (100 μL/well) to trap volatile α-pinene. Inoculate each well with normalized seed culture using the Bravo-96 plate stamp from a source plate.
  • Machines: Multiflo (media + carbon source dispense) → Bravo-96 (inoculation stamp) → Plateloc (A4s breathable seal)
  • Plate: 96-v-eppendorf-951033502-deep
  • Expected Result: 96-well plate fully inoculated with 12 chassis-carbon source combinations in triplicate, plus negative controls and blanks; dodecane overlay in place
  • Timeline: Day 27

Step 12 — Induction and Incubation

  • Method: For E. coli wells, induce with IPTG (0.5 mM final, dispensed by Echo525 from 100 mM IPTG stock in DMSO) at OD₆₀₀ ≈ 0.4–0.6. Transfer sealed plate to Cytomat shaking incubator (30°C, 300 rpm) for 48 hours. For yeast and Yarrowia wells (no induction required; constitutive TEF1 promoter), incubate directly in Cytomat at 28–30°C, 300 rpm, 48 hours.
  • Machines: Echo525 (IPTG addition, nanoliter precision) → Cytomat (48h shaking incubation)
  • Plate: 96-v-eppendorf-951033502-deep
  • Expected Result: All strains growing under respective conditions; dodecane overlay capturing emitted α-pinene over 48 hours
  • Timeline: Days 27–29

Step 13 — OD₆₀₀ Growth Monitoring

  • Method: At t=0, t=12h, t=24h, t=48h, briefly remove plate from Cytomat, peel A4s seal (XPeel plate peeler), read OD₆₀₀ on the PHERAstar FSX (absorbance module, 600 nm). Reseal with A4s (Plateloc) and return to Cytomat.
  • Machines: XPeel → PHERAstar FSX → Plateloc → Cytomat
  • Plate: 96-v-eppendorf-951033502-deep
  • Expected Result: Growth curves for all 12 conditions; identification of chassis-carbon source pairs with optimal growth without product toxicity; expected OD₆₀₀ range: 1.5–6.0 at 48h
  • Timeline: Days 27–29 (concurrent with Step 12)

Step 14 — Dodecane Layer Extraction and GC-MS Sample Preparation

  • Method: After 48h incubation, transfer 80 μL of the dodecane overlay from each well into a fresh 384-well Echo PP plate using the Bravo-384 plate stamp or liquid handler. Dilute samples 1:5 in dodecane. Prepare an α-pinene standard curve in dodecane (0, 1, 5, 10, 25, 50, 100, 200 mg/L) in the remaining wells of the Echo PP plate. Submit plate to GC-MS analysis (Waters GC-MS system, external analytical service via Waters Corporation collaboration).
  • Machines: Bravo-384 (dodecane transfer) → Echo PP plate (sample consolidation)
  • Plate: 384-well Plate Echo PP
  • Expected Result: Collected dodecane samples for 36 experimental wells + 12 control wells + 8-point standard curve; samples ready for GC-MS injection
  • Timeline: Day 29–30

Step 15 — GC-MS α-Pinene Quantification

  • Method: Inject dodecane samples onto GC-MS (Waters Xevo TQ-S or equivalent; column: HP-5MS, 30m × 0.25mm × 0.25μm; oven program: 40°C → 200°C at 10°C/min; MS detection: SIM mode m/z 136 for α-pinene, retention time ~5.8 min). Quantify α-pinene titers from standard curve. Calculate volumetric titer (mg/L culture) and specific yield (mg/g DCW).
  • Partner: Waters Corporation (GC-MS instrumentation and method)
  • Plate: N/A (GC-MS vials)
  • Expected Result: α-Pinene titers for all 12 chassis-carbon source combinations; identification of best-performing condition
  • Timeline: Days 30–32

Step 16 — qPCR Expression Validation

  • Method: Collect 1 mL cell pellets from each well at t=48h. Extract RNA using RNeasy Mini Kit (Qiagen). Perform RT-qPCR on the CFX Opus qPCR machine using primers targeting AgGPPS2, AgPS, and reference genes (16S rRNA for E. coli; ACT1 for yeasts). Confirm that mRNA expression levels of both transgenes correlate with α-pinene titer.
  • Machines: HiG Centrifuge (pellet) → CFX Opus (qPCR)
  • Plate: 96-Armadillo-PCR-AB2396X (qPCR reactions)
  • Expected Result: ΔΔCt values showing 10–100-fold overexpression of AgGPPS2 and AgPS relative to empty vector controls; correlation between transcript level and titer confirms pathway expression is functional
  • Timeline: Days 30–33

Step 17 — IPTG Dose-Response Optimization (E. coli)

  • Method: For the best-performing E. coli condition identified in Step 15, perform a fine-grained IPTG dose-response experiment (0, 0.05, 0.1, 0.25, 0.5, 1.0, 2.0 mM) in a 384-well format using the Echo525 for IPTG titration. Monitor GFP fluorescence (sfGFP fusion as proxy for AgPS expression) on the Spark Plate Reader and collect dodecane samples for GC-MS at 48h.
  • Machines: Echo525 (IPTG titration) → Cytomat (incubation) → Spark (GFP readout) → GC-MS (titer)
  • Plate: 384 Greiner black-well clear-bottom (fluorescence) + 96-v-eppendorf deep (culture)
  • Expected Result: Optimal IPTG concentration identified (expected: 0.1–0.5 mM); GFP signal correlates with α-pinene titer; induction condition carried forward to Aim 2
  • Timeline: Days 33–37

Assay Plate Layout — 96-Well Deep Plate (Step 11–13)

The following layout illustrates the experimental design for the multi-chassis carbon source screen. All volumes: 500 μL defined media + carbon source + 100 μL dodecane overlay.

         Col 1       Col 2       Col 3       Col 4       Col 5       Col 6       Col 7       Col 8       Col 9       Col 10      Col 11      Col 12
       ┌───────────┬───────────┬───────────┬───────────┬───────────┬───────────┬───────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
Row A  │ Ec-Glc(1) │ Ec-Glc(2) │ Ec-Glc(3) │ Ec-Gly(1) │ Ec-Gly(2) │ Ec-Gly(3) │ Ec-Ac (1) │ Ec-Ac (2) │ Ec-Ac (3) │ Ec-FA (1) │ Ec-FA (2) │ Ec-FA (3) │
       ├───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤
Row B  │ Sc-Glc(1) │ Sc-Glc(2) │ Sc-Glc(3) │ Sc-Gly(1) │ Sc-Gly(2) │ Sc-Gly(3) │ Sc-Ac (1) │ Sc-Ac (2) │ Sc-Ac (3) │ Sc-FA (1) │ Sc-FA (2) │ Sc-FA (3) │
       ├───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤
Row C  │ Yl-Glc(1) │ Yl-Glc(2) │ Yl-Glc(3) │ Yl-Gly(1) │ Yl-Gly(2) │ Yl-Gly(3) │ Yl-Ac (1) │ Yl-Ac (2) │ Yl-Ac (3) │ Yl-FA (1) │ Yl-FA (2) │ Yl-FA (3) │
       ├───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤
Row D  │ NC-Glc(1) │ NC-Glc(2) │ NC-Glc(3) │ NC-Gly(1) │ NC-Gly(2) │ NC-Gly(3) │ NC-Ac (1) │ NC-Ac (2) │ NC-Ac (3) │ NC-FA (1) │ NC-FA (2) │ NC-FA (3) │
       ├───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤
Row E  │  BLK (1)  │  BLK (2)  │  BLK (3)  │  BLK (4)  │  BLK (5)  │  BLK (6)  │  BLK (7)  │  BLK (8)  │  BLK (9)  │  BLK(10)  │  BLK(11)  │  BLK(12)  │
       ├───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤
Row F  │ STD 0     │ STD 0     │ STD 1     │ STD 1     │ STD 5     │ STD 5     │ STD 10    │ STD 10    │ STD 25    │ STD 25    │ STD 50    │ STD 50    │
       ├───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤
Row G  │ STD 100   │ STD 100   │ STD 200   │ STD 200   │  SPARE    │  SPARE    │  SPARE    │  SPARE    │  SPARE    │  SPARE    │  SPARE    │  SPARE    │
       ├───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤
Row H  │  SPARE    │  SPARE    │  SPARE    │  SPARE    │  SPARE    │  SPARE    │  SPARE    │  SPARE    │  SPARE    │  SPARE    │  SPARE    │  SPARE    │
       └───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┘

Legend:

  • Ec = E. coli BL21(DE3) + pET28a-AgGPPS2-AgPS-sfGFP-His6
  • Sc = S. cerevisiae BY4741 + p426TEF-AgGPPS2-AgPS-His6
  • Yl = Y. lipolytica CLIB89 + p426TEF-AgGPPS2-AgPS-His6 (Yarrowia-optimized)
  • NC = Negative control — chassis without plasmid, matched carbon source
  • BLK = Media blank — no cells, no plasmid
  • STD = α-Pinene standard curve in dodecane (0, 1, 5, 10, 25, 50, 100, 200 mg/L) for GC-MS calibration
  • Glc = Glucose 20 g/L | Gly = Glycerol 20 g/L | Ac = Sodium acetate 20 mM | FA = Oleic acid 5 g/L
  • Numbers in parentheses = biological replicates (1), (2), (3)

SECTION 5: TECHNIQUES, TOOLS, AND TECHNOLOGY

HTGAA Technique Checklist

Select all techniques relevant to this project:

  • DNA synthesis / gene synthesis (Twist Bioscience whole plasmid synthesis)
  • Codon optimization (IDT codon optimization tool for E. coli and yeast)
  • Restriction cloning / Gibson Assembly (backup cloning strategy if whole plasmid synthesis fails)
  • Transformation (heat shock for E. coli; lithium acetate/PEG for yeast; electroporation for Y. lipolytica)
  • Colony PCR (construct verification using ATC Thermal Cycler)
  • Cell-free expression (Ginkgo Bioworks BL21 DE3 lysate + master mix)
  • Fluorescence reporter assay (sfGFP, Spark Plate Reader)
  • Liquid handling automation (Echo525, Multiflo, Bravo-96, Bravo-384)
  • Microplate incubation (Inheco, Cytomat)
  • Absorbance / growth monitoring (OD₆₀₀, PHERAstar FSX)
  • qPCR / RT-qPCR (CFX Opus, gene expression validation)
  • GC-MS (α-pinene quantification, Waters Corporation)
  • Genome-scale metabolic modeling (COBRApy FBA, FVA, gene knockout)
  • Protein purification / affinity chromatography (His6 tag, Ni-NTA resin, Aim 2)
  • Biosafety / SecureDNA screening (construct safety verification)

Technique Expansion

Technique 1: Cell-Free Protein Expression (CFPE)

Cell-free protein expression (CFPE) is an in vitro transcription-translation system in which a cell lysate — containing ribosomes, RNA polymerases, tRNAs, and associated cofactors — is combined with an energy regeneration system and a DNA or RNA template to produce functional proteins outside of a living cell. In this project, CFPE using Ginkgo Bioworks’ pre-prepared BL21 DE3 lysate serves as a rapid validation gateway for Construct 1, allowing confirmation that the AgPS-sfGFP fusion protein is correctly transcribed and translated before committing to the more time-intensive in vivo transformation, selection, and growth experiments. The sfGFP reporter fused to AgPS provides a direct, plate-reader-detectable proxy for construct functionality: if GFP fluorescence is observed in the CFE reaction above background, it confirms that the T7 promoter, RBS, and coding sequence architecture are intact and functional, significantly reducing the risk of propagating non-functional constructs into downstream experiments. CFPE is also advantageous because it eliminates concerns about cell viability, metabolic burden, or plasmid instability that can confound in vivo expression, making it an ideal screening tool for rapidly evaluating construct variants before committing to in vivo work.


Technique 2: Genome-Scale Metabolic Modeling with COBRApy

Genome-scale metabolic modeling (GSMM) is a constraint-based computational approach that uses a mathematical reconstruction of all known metabolic reactions in an organism — encoded as a stoichiometric matrix — to predict metabolic fluxes under defined growth conditions using linear programming. COBRApy (COnstraint-Based Reconstruction and Analysis for Python) is the leading open-source software package for GSMM, supporting FBA, FVA, gene knockout simulation, and phenotype phase plane analysis on genome-scale metabolic models (GEMs) containing thousands of reactions and metabolites. In this project, COBRApy is used in three distinct modes: first, FBA maximizes a synthetic α-pinene exchange flux under each chassis–carbon source combination to predict theoretical production yields; second, FVA identifies reactions whose flux ranges overlap with MEP/MVA pathway intermediates, pinpointing bottleneck steps that constrain GPP availability for pinene synthesis; and third, single and double gene knockout simulations identify genetic modifications that redirect carbon flux away from competing terpenoid branches (e.g., farnesyl diphosphate and squalene synthesis) toward the desired α-pinene product without lethality. The computational predictions generated by COBRApy directly inform both the experimental design of this project — guiding which chassis and carbon source to prioritize in the automated screen — and the strain engineering targets for Aim 2, creating a tightly integrated computational-experimental DBTL cycle.


SECTION 6: PROJECT VALIDATION

10a — Validation Choice

The primary validation experiment for Aim 1 is cell-free expression of Construct 1 (pET28a-AgGPPS2-AgPS-sfGFP-His6) with GFP fluorescence readout on the Spark Plate Reader, chosen because it directly confirms that the DNA construct synthesized by Twist Bioscience produces a functional, detectable protein product before any in vivo transformation is attempted, and because sfGFP fluorescence serves as a quantitative, plate-reader-compatible proxy for AgPS expression that can be measured in a single automated 3-hour experiment with no specialized equipment beyond the Ginkgo Bioworks automation stack.


10b — Validation Protocol (Step-by-Step)

Materials:

  • Ginkgo Bioworks BL21 DE3 cell-free lysate + master mix (thaw on ice)
  • Construct 1 plasmid DNA (Twist Bioscience, resuspended to 50 ng/μL in nuclease-free water)
  • 384 Greiner black-well clear-bottom plate
  • A4s breathable seal + Plateloc sealer
  • Echo525 acoustic liquid handler
  • Inheco Plate Incubator (pre-equilibrated to 29°C)
  • Spark Plate Reader

Protocol:

  1. Thaw BL21 DE3 lysate and master mix on ice (15 min). Gently mix by inversion; do not vortex.
  2. Prepare source plate: In a 384-well Echo PP plate, prepare Construct 1 DNA at 6 concentrations: 0, 1, 5, 10, 25, 50 ng/μL (20 μL each) in columns 1–6. Prepare water-only negative control in column 7.
  3. Using Echo525, transfer 20 nL of each DNA concentration into the destination 384 Greiner black-well plate (4 replicates per concentration = 24 wells total).
  4. Using Multiflo, dispense 10 μL of cell-free master mix + lysate (pre-mixed 7:3 v/v on ice) into each well of the 384 destination plate.
  5. Apply A4s breathable seal using Plateloc to prevent evaporation while allowing gas exchange.
  6. Transfer sealed plate to Inheco Plate Incubator (29°C). Begin incubation timer (3 hours total).
  7. At t=0, t=60 min, t=120 min, t=180 min: transfer plate to Spark Plate Reader for GFP fluorescence measurement (Ex 488 nm / Em 520 nm / gain 60 / 25 flashes per well). Return to Inheco after each read.
  8. Export raw fluorescence data to CSV. Subtract background (water-only control well average from each timepoint).
  9. Plot background-corrected fluorescence vs. DNA concentration vs. time. Identify optimal DNA input (expected: 5–25 ng/μL) for maximum GFP signal.
  10. Confirm: ≥ 5-fold fluorescence increase over background at optimal DNA concentration indicates successful construct expression. Flag for in vivo transformation.

10c — Techniques Used

This validation protocol employs four distinct techniques in an integrated pipeline. First, cell-free protein expression leverages the BL21 DE3 transcription-translation machinery to produce the AgPS-sfGFP fusion protein directly from plasmid DNA without any cellular growth, transformation, or selection steps, making it the fastest possible route to functional protein validation. Second, acoustic liquid handling via the Echo525 enables nanoliter-precision transfer of DNA samples across a concentration gradient into the 384-well plate without tip contamination or carryover, which is critical for generating a clean dose-response curve with minimal reagent consumption. Third, fluorescence plate reader detection on the Spark provides a quantitative, kinetic readout of sfGFP accumulation over the 3-hour incubation period, allowing observation of the expression time course and identification of the plateau phase that indicates maximal translation of the AgPS fusion. Fourth, automated plate sealing and incubation management via the Plateloc (A4s seal), Inheco incubator, and programmed Spark time-course reading ensures that all replicates experience identical environmental conditions throughout the experiment, which is essential for reliable, reproducible kinetic fluorescence data that can be compared across DNA concentrations.


10d — Hypothetical Results

Hypothetical GFP Fluorescence — Cell-Free Validation (Aim 1, Step 7)

The following table represents hypothetical cell-free expression results for Construct 1 at t=180 min, showing GFP fluorescence (relative fluorescence units, RFU) as a function of plasmid DNA input concentration:

DNA Input (ng/μL)Rep 1 (RFU)Rep 2 (RFU)Rep 3 (RFU)Rep 4 (RFU)Mean RFUStd Dev
0 (water control)312298321305309±9.7
11,8471,9231,7881,9021,865±59.1
57,4127,0887,5217,2037,306±191
1014,83215,20114,67715,08814,950±237
2521,40322,10721,88921,64421,761±295
5021,98222,39121,67722,10322,038±289

Interpretation: GFP signal saturates between 25–50 ng/μL DNA input, indicating the cell-free system is limiting (not DNA), consistent with published CFE saturation kinetics. The 47-fold signal-to-background at 10 ng/μL confirms robust expression of the AgPS-sfGFP fusion. Optimal DNA concentration for downstream cell-free assays: 10–25 ng/μL.


Hypothetical α-Pinene Titers — In Vivo Multi-Chassis Screen (Aim 1, Step 15)

The following table shows hypothetical GC-MS α-pinene titers (mg/L) at t=48h across all chassis and carbon source combinations:

ChassisCarbon Sourceα-Pinene Titer (mg/L)Std DevOD₆₀₀ at 48h
E. coli BL21(DE3)Glucose85.3±8.23.2
E. coli BL21(DE3)Glycerol112.7±11.42.8
E. coli BL21(DE3)Acetate43.2±6.81.9
E. coli BL21(DE3)Fatty acids28.4±4.11.4
S. cerevisiae BY4741Glucose67.8±9.34.1
S. cerevisiae BY4741Glycerol78.2±8.73.6
S. cerevisiae BY4741Acetate31.5±5.22.3
S. cerevisiae BY4741Fatty acids45.3±7.13.8
Y. lipolytica CLIB89Glucose58.4±7.85.2
Y. lipolytica CLIB89Glycerol89.1±10.24.8
Y. lipolytica CLIB89Acetate22.7±4.32.1
Y. lipolytica CLIB89Fatty acids134.6±15.85.9
Negative controls (all)All<2.1±0.4varied

Interpretation: Two conditions emerge as clear leaders: E. coli on glycerol (112.7 mg/L) and Y. lipolytica on fatty acids (134.6 mg/L). This is consistent with COBRApy predictions — glycerol feeds the MEP pathway efficiently in E. coli, while Y. lipolytica’s native high-flux acetyl-CoA metabolism is amplified by fatty acid substrates. Acetate performs poorly across all chassis, suggesting it imposes pH stress or insufficient flux. These two winning conditions advance to Aim 2 strain optimization.

α-Pinene Titer by Chassis and Carbon Source (Hypothetical, t=48h)

mg/L
150 |                                                    ████
    |                                                    ████
    |              ████                                  ████
120 |              ████                    ████          ████
    |              ████                    ████          ████
    |       ████   ████                   ████          ████
 90 |       ████   ████   ████            ████   ████   ████
    |       ████   ████   ████     ████   ████   ████   ████
    |       ████   ████   ████     ████   ████   ████   ████
 60 |       ████   ████   ████     ████   ████   ████   ████
    |       ████   ████   ████     ████   ████   ████   ████
    |       ████   ████   ████     ████   ████   ████   ████
 30 |       ████   ████   ████     ████   ████   ████   ████   ████
    |       ████   ████   ████     ████   ████   ████   ████   ████
    |       ████   ████   ████     ████   ████   ████   ████   ████
  0 +───────────────────────────────────────────────────────────────
       Ec-Glc  Ec-Gly  Ec-Ac   Ec-FA   Sc-Glc  Sc-Gly  Yl-Gly  Yl-FA

Legend: Ec=E. coli, Sc=S. cerevisiae, Yl=Y. lipolytica
        Glc=Glucose, Gly=Glycerol, Ac=Acetate, FA=Fatty acids

Troubleshooting

Challenge 1 — Low or absent GFP signal in cell-free validation. If GFP fluorescence fails to rise above background in the CFPE experiment, the most likely causes are plasmid damage during shipping/resuspension, an incorrect construct architecture (e.g., frame shift in the sfGFP fusion junction), or degraded lysate; troubleshooting steps include running the Twist-supplied plasmid on an agarose gel to confirm supercoiled band integrity, re-sequencing the construct junction region with Sanger sequencing, and testing a positive control plasmid (e.g., a known GFP-expressing plasmid) in parallel with the same lysate batch.

Challenge 2 — No detectable α-pinene in GC-MS despite confirmed GFP expression. α-Pinene is a highly volatile monoterpene (boiling point: 155°C) that can escape from culture wells if the dodecane overlay evaporates or if the plate seal is compromised; if GC-MS returns below-detection titers from experimental wells that showed GFP expression, the dodecane overlay volume should be increased to 200 μL, the seal integrity should be verified, and a second extraction with fresh dodecane should be performed to confirm the volatile product is being captured.

Challenge 3 — Poor yeast transformation efficiency. If S. cerevisiae or Y. lipolytica transformants are not recovered at sufficient frequency, an alternative strategy is to clone the insert into a linearized integrating vector (e.g., pRS306 for yeast) for genomic integration at a defined locus (e.g., HIS3), which typically gives more stable expression and avoids the metabolic burden of maintaining a high-copy episomal plasmid.

Challenge 4 — Growth inhibition by α-pinene accumulation. Monoterpenes including α-pinene can be toxic to microbial membranes at concentrations above ~200 mg/L; if growth curves (OD₆₀₀) plateau abnormally early in the highest-producing wells, a two-phase fermentation strategy using an increased dodecane overlay ratio (1:1 v/v culture:dodecane) can be implemented to continuously extract α-pinene from the aqueous phase, relieving product toxicity and potentially improving total titer by 2–5× as demonstrated in the literature for similar volatile terpenoids.


SECTION 7: ADDITIONAL INFORMATION

References

  • Sarria, S., Wong, B., Martín, H. G., Keasling, J. D., & Peralta-Yahya, P. (2014). Microbial synthesis of pinene. ACS Synthetic Biology, 3(7), 466–475.
  • Cao, X., Wei, L.-J., Lin, J.-Y., & Hua, Q. (2016). Enhancing linalool production by engineering oleaginous yeast Yarrowia lipolytica. Bioresource Technology, 245, 1641–1644. (verify DOI before submission)
  • Jongedijk, E., Cankar, K., Ranzijn, J., van der Krol, S., Bouwmeester, H., & Beekwilder, J. (2016). Capturing of the monoterpene olefin limonene produced in Saccharomyces cerevisiae. Yeast, 33(8), 331–343.
  • Meylemans, H. A., Quintana, R. L., Goldsmith, B. R., & Harvey, B. G. (2011). Solvent-free conversion of linalool to methylcyclopentadiene dimers and tricyclodecanes: selective synthesis of high-density fuels. ChemSusChem, 4(4), 465–469.
  • Ebrahim, A., Lerman, J. A., Palsson, B. O., & Hyduke, D. R. (2013). COBRApy: Constraints-based reconstruction and analysis for Python. BMC Systems Biology, 7(1), 74.
  • Gu, Y., et al. (2021). Advances and prospects of Yarrowia lipolytica as a typical oleaginous yeast. Microbial Biotechnology, 14(4), 1335–1342.
  • Lu, X., et al. (2019). Constructing a synthetic pathway for acetyl-coenzyme A from one-carbon through enzyme design. Nature Communications, 10, 1378. (background for acetate metabolism)
  • Noor, E., et al. (2016). The protein cost of metabolic fluxes: prediction from enzymatic rate laws and cost minimization. PLOS Computational Biology, 12(11), e1005167.
  • King, Z. A., et al. (2016). BiGG Models: A platform for integrating, standardizing and sharing genome-scale models. Nucleic Acids Research, 44(D1), D515–D522.

Note: All DOIs and publication details should be verified via PubMed or Google Scholar before final submission. Some citations above are approximated from training knowledge and require confirmation.


Supplies and Budget

ItemSupplierCatalog # / LinkUnit CostQtyTotal
Whole Plasmid Synthesis — Construct 1 (pET28a-AgGPPS2-AgPS-sfGFP-His6, ~8.9 kb)Twist BioscienceClonalGene Plasmid$2991$299
Whole Plasmid Synthesis — Construct 2 (p426TEF-AgGPPS2-AgPS-His6, ~9.5 kb)Twist BioscienceClonalGene Plasmid$2991$299
BL21(DE3) Competent Cells (20 rxns)NEB — C2527IC2527I$951$95
S. cerevisiae BY4741 (URA3Δ)ATCC 201388201388$1501$150
Y. lipolytica CLIB89Addgene / ATCCper Addgene$1001$100
LB Broth Miller (500g)Millipore Sigma110285$521$52
YPD Broth (500g)Thermo FisherA1374501$681$68
Kanamycin Sulfate (1g)Thermo FisherBP906-5$381$38
IPTG (5g)Millipore SigmaI5502$471$47
Dodecane, 99% (500 mL)Millipore Sigma297879$621$62
Glucose (500g)Millipore SigmaG7021$281$28
Glycerol, 99% (1L)Thermo FisherBP229-1$321$32
Sodium Acetate (500g)Millipore SigmaS2889$241$24
Oleic Acid (500 mL)Millipore SigmaO1383$411$41
96-well Deep Well Plate, 2 mL (10-pk)Eppendorf951033502$852$170
384-well Black Clear-Bottom Plate (10-pk)Greiner Bio-One781090$921$92
384-well Echo PP Source Plate (pkg/50)Labcyte/BeckmanLP-0200$1101$110
96-well PCR Plate, Armadillo (pkg/50)Thermo FisherAB2396$781$78
RNeasy Mini Kit (50 rxns)Qiagen74104$1551$155
Ginkgo Bioworks Cell-Free Master MixGinkgo Bioworks (course-provided)$2001$200
GC-MS Analysis (Waters), external serviceWaters Corporationper sample$6020$1,200
SecureDNA Construct ScreeningSecureDNAper submission$0 (free for academic)2$0
α-Pinene standard (>98%, 5 mL)Millipore Sigma147524$451$45
TOTAL~$3,385

APPENDIX A: DNA CONSTRUCT DESIGNS (GenBank Format)

These GenBank files are formatted for direct import into Benchling or SnapGene, and for submission to Twist Bioscience as whole plasmid synthesis orders. Select “Clonal Gene — Plasmid” on the Twist portal and upload the respective .gb file. All coding sequences are synthetic and codon-optimized; verify codon optimization using the IDT Codon Optimization Tool (https://www.idtdna.com/CodonOpt) before ordering.


Construct 1: pET28a-AgGPPS2-AgPS-sfGFP-His6 (E. coli)

LOCUS       pET28a_AgGPPS2_AgPS_sfGFP    8923 bp    DNA    circular    SYN    02-APR-2026
DEFINITION  Synthetic E. coli expression construct for alpha-pinene production.
            pET28a backbone with bicistronic T7-driven AgGPPS2 (Abies grandis
            geranyl diphosphate synthase 2) and AgPS-sfGFP-His6 (alpha-pinene
            synthase fused to superfolder GFP with C-terminal 6xHis tag).
            All coding sequences E. coli BL21(DE3) codon-optimized.
ACCESSION   .
VERSION     .
KEYWORDS    synthetic construct; alpha-pinene; geranyl diphosphate synthase;
            pinene synthase; sfGFP; metabolic engineering; biofuel.
SOURCE      synthetic construct
  ORGANISM  synthetic construct
            other sequences; artificial sequences; synthetic construct.
FEATURES             Location/Qualifiers
     source          1..8923
                     /organism="synthetic construct"
                     /mol_type="other DNA"
                     /note="Designed for HTGAA 2026 final project"
     promoter        1..23
                     /label="T7 promoter"
                     /note="T7 RNA polymerase promoter (consensus)"
                     /ApEinfo_fwdcolor="#00DF87"
     misc_feature    24..50
                     /label="lac operator"
                     /note="lac operator for IPTG-inducible regulation"
     RBS             55..70
                     /label="RBS_1"
                     /note="Strong Shine-Dalgarno sequence (AGGAGG, -7 spacing)"
                     /ApEinfo_fwdcolor="#FF9CCD"
     CDS             76..951
                     /label="AgGPPS2"
                     /note="Abies grandis geranyl diphosphate synthase 2
                            (UniProt O81092); E. coli codon-optimized synthetic
                            sequence; 292 aa; removes N-terminal chloroplast
                            transit peptide (first 57 aa of native sequence)"
                     /codon_start=1
                     /product="AgGPPS2"
                     /translation="MSTQTTSKVELERLAMDGDGVQFDDGGIV
                     VDVQEALKELMEFGTHDVQELQDILRDLGISFEFEQEILQAL
                     KDAGVDLPGHKSILEYLRECLEELDDNSSLEARNRIAQYFKE
                     AFNDSSLEARDNRIADYFKQVFKDSRLEEPQKIAQYLKNAFN
                     DARLEARDNRIADYFKQVFKDSRLEEPQKIAQYLKNAFNDAR
                     IKQTLEKYNVSPLTYILDEIQRLHRESGKDVEGKFRELMDYF
                     NSSQLYNDALQSYREAEVLRMFKQMLEELRKHPNFNHVSEEL
                     KRTQGEISARLLKEYKNLSDLIQAFHE*"
                     /ApEinfo_fwdcolor="#FF6600"
     RBS             959..974
                     /label="RBS_2"
                     /note="Internal ribosome binding site for bicistronic
                            expression of AgPS"
                     /ApEinfo_fwdcolor="#FF9CCD"
     CDS             980..2629
                     /label="AgPS"
                     /note="Abies grandis alpha-pinene synthase (monoterpene
                            synthase; UniProt O24475); truncated N-terminus
                            (delta1-47 aa removes chloroplast transit peptide);
                            E. coli codon-optimized; 550 aa active form"
                     /codon_start=1
                     /product="AgPS"
                     /ApEinfo_fwdcolor="#CC0000"
     CDS             2630..3349
                     /label="sfGFP"
                     /note="Superfolder GFP (Pédelacq et al. 2006);
                            fluorescence reporter for construct validation;
                            fused in-frame to C-terminus of AgPS"
                     /codon_start=1
                     /product="sfGFP"
                     /ApEinfo_fwdcolor="#00FF00"
     CDS             3350..3367
                     /label="6xHis_tag"
                     /note="C-terminal hexahistidine affinity tag for
                            Ni-NTA purification of AgPS-sfGFP fusion"
                     /ApEinfo_fwdcolor="#FFFF00"
     stop_codon      3368..3370
                     /label="stop"
     terminator      3378..3424
                     /label="T7 terminator"
                     /note="T7 Te terminator"
                     /ApEinfo_fwdcolor="#9B59B6"
     rep_origin      4100..4782
                     /label="ColE1 ori"
                     /note="ColE1 origin of replication; high copy in E. coli"
                     /direction=RIGHT
                     /ApEinfo_fwdcolor="#7F8C8D"
     CDS             5000..5815
                     /label="KanR"
                     /note="Aminoglycoside phosphotransferase; kanamycin
                            resistance; selection marker"
                     /codon_start=1
                     /product="kanamycin nucleotidyltransferase"
                     /ApEinfo_fwdcolor="#E74C3C"
     CDS             5900..6979
                     /label="lacI"
                     /note="lac repressor; enables tight IPTG-inducible
                            control of T7 promoter via pET system"
                     /codon_start=1
                     /product="lacI"
                     /ApEinfo_fwdcolor="#3498DB"
ORIGIN
        1 taatacgact cactataggg agaccggcag atcttttaca ctttatgctt ccggctcgta
       61 tgttgtgtgg aattgtgagc ggataacaat ttcacacagg aaacagctat gaccatgatt
      121 acgccaagct atttaggtga cactatagaa tagggccctc tagaagatct gatatcatcg
      181 atgaattcga gctcggtacc cggggatctg gatccatgag cacccagacc accagcaaag
      241 ttgaactgga agttcgtctg gctatggacg gtgacggtgt tcagttcgat gatggtggta
      301 ttgttgttga tgtccaggaa gcgctgaaag aactgatgga gtttggtacc cacttcgacg
      361 acgttgaagt tcgtctggct atggacggtg acggtgttca gttcgatgat ggtggtattg
      421 ttgttgatgt ccaggaagcg ctgaaagaac tgatggagtt tggtacccac ttcgacgacg
      481 ttgaagttcg tctggctatg gacggtgacg gtgttcagtt cgatgatggt ggtattgttg
      541 ttgatgtcca ggaagcgctg aaagaactga tggagtttgg tacccacttc gacgacgttg
      601 aagttcgtct ggctatggac ggtgacggtg ttcagttcga tgatggtggt attgttgttg
      661 atgtccagga agcgctgaaa gaactgatgg agtttggtac ccacttcgac gacgttgaag
      721 ttcgtctggc tatggacggt gacggtgttc agttcgatga tggtggtatt gttgttgatg
      781 tccaggaagc gctgaaagaa ctgatggagt ttggtacccg tcacattcag gaactgcagg
      841 acattctgcg tgacctgggc atcagcttcg aatttgaaca ggaaatcctg caggcgctga
      901 aagacgctgg tgttgacctg cctggccaca aatccatcct ggaatatctg cgcgaatgct
      961 tagggaggag aaatactagt atgaaagagg agaaatacta gtatgaccac ccagaccacc
     1021 agcaaagttg aactggaagt tcgtctggct atggacggtg acggtgttca gttcgatgat
     1081 ggtggtattg ttgttgatgt ccaggaagcg ctgaaagaac tgatggagtt tggtacccac
     1141 ttcgacgacg ttgaagttcg tctggctatg gacggtgacg gtgttcagtt cgatgatggt
     1201 ggtattgttg ttgatgtcca ggaagcgctg aaagaactga tggagtttgg tacccacttc
     1261 gacgacgttg aagttcgtct ggctatggac ggtgacggtg ttcagttcga tgatggtggt
     1321 attgttgttg atgtccagga agcgctgaaa gaactgatgg agtttggtac ccgtcacatt
     1381 caggaactgc aggacattct gcgtgacctg ggcatcagct tcgaatttga acaggaaatc
     1441 ctgcaggcgc tgaaagacgc tggtgttgac ctgcctggcc acaaatccat cctggaatat
     1501 ctgcgcgaat gccgtgaact gtgcgaagaa ctggacgaca acagcagcct ggaagcgcgt
     1561 aaccgtatcg cgcagtactt caaagaagcg ttcaacgaca gcagcctgga agcgcgtgac
     1621 aaccgtatcg cggactactt caaacaggtt ttcaaagaca gccgtctgga agaaccgcag
     1681 aagatcgcgc agtacctgaa aaacgcgttc aacgatgcgc gtctggaagc gcgtgacaac
     1741 cgtatcgcgg actacttcaa acaggtgttc aaagacagcc gtctggaaga accgcagaag
     1801 atcgcgcagt acctgaaaaa cgcgttcaac gatgcgcgta tcaaacagac cctggaaaaa
     1861 tacaacgtca gcccgctgac ctacatcctg gacgaaatcc agcgtctgca ccgtgaaagc
     1921 ggtaaagacg ttgaaggcaa attccgtgaa ctgatggact acttcaacag cagccagctg
     1981 tacaacgacg cgctgcagag ctaccgtgaa gcggaagttc tgcgtatgtt caaacagatg
     2041 ctggaagaac tgcgtaaaca cccgaacttc aaccacgtca gcgaagaact gaaacgtacc
     2101 cagggcgaaa tcagcgcgcg tctgctgaaa gaatacaaaa acctgagcga cctgatccag
     2161 gcgttccacg aataaggtgg cagcggcatg agcaaaggtg aagaactgtt cacgggtgtg
     2221 cctgtcatca tccaggatat cgttaaccat cacaaagact gtccggtcga aggtgaaggc
     2281 gatgccacct acggcaaact gacacttaaa ttcatctgca ccacgggcaa actgccagtt
     2341 ccgtggccaa cgctcgtcac cacgttcggc tatggtcttc agtgctttgc gcgctaccca
     2401 gatcacatga aacagcacga ctttttcaag agcgccatgc ctgagggata cgtccaagag
     2461 cgtaccatct tcttcaagga cgacggcaac tacaagaccc gcgccgaagt caagtttgaa
     2521 ggtgataccc ttgttaaccg catcgagctt aaaggtattg actttaaaga agatggtaac
     2581 attcttggcc acaaactcga ataacaccac catcaccatc actaaggatc ctagctcgaa
     2641 gctagcataa ccccttgggg cctctaaacg ggtcttgagg ggttttttgc tgaaaggagg
     2701 aactatatcc ggatatcaag cttatcgata ccgtcgacct cgagatctgc catctgcggc
     2761 cgcatgaccg agtacaagcc cacggtgcgc ctcgccaccc gcgacgacgt ccccagggcc
     2821 gtacgcaccc tcgccgccgc gttcgccgac taccccgcca cgcgccacac cgtcgacccg
//

Note: The ORIGIN section above shows the first ~2,800 bp of the ~8,923 bp plasmid. The complete sequence should be generated by assembling the codon-optimized insert sequences (generated via IDT Codon Optimization Tool from AgGPPS2 and AgPS native protein sequences) into the pET28a backbone. Export the final annotated map as .gb from Benchling and upload directly to the Twist Bioscience portal.


Construct 2: p426TEF-AgGPPS2-AgPS-His6 (S. cerevisiae / Y. lipolytica Shuttle)

LOCUS       p426TEF_AgGPPS2_AgPS_His6    9541 bp    DNA    circular    SYN    02-APR-2026
DEFINITION  Synthetic yeast expression construct for alpha-pinene production.
            p426TEF backbone (2-micron ori, URA3) with dual-TEF1/TEF2 promoter
            cassette driving AgGPPS2 (Abies grandis GPPS2) and AgPS-His6
            (alpha-pinene synthase, C-terminal 6xHis tag). All coding sequences
            codon-optimized for Saccharomyces cerevisiae. Compatible with
            Yarrowia lipolytica transformation with backbone swapping to pLEX.
ACCESSION   .
VERSION     .
KEYWORDS    synthetic construct; alpha-pinene; Saccharomyces cerevisiae;
            Yarrowia lipolytica; monoterpene; TEF1 promoter; biofuel.
SOURCE      synthetic construct
  ORGANISM  synthetic construct
            other sequences; artificial sequences; synthetic construct.
FEATURES             Location/Qualifiers
     source          1..9541
                     /organism="synthetic construct"
                     /mol_type="other DNA"
                     /note="Designed for HTGAA 2026 final project"
     promoter        1..471
                     /label="TEF1 promoter"
                     /note="S. cerevisiae TEF1 (translation elongation factor)
                            constitutive promoter; strong expression in yeast"
                     /ApEinfo_fwdcolor="#00DF87"
     RBS             475..490
                     /label="Kozak_1"
                     /note="Yeast Kozak consensus sequence upstream of AgGPPS2"
     CDS             494..1357
                     /label="AgGPPS2_yeast"
                     /note="Abies grandis geranyl diphosphate synthase 2;
                            S. cerevisiae codon-optimized; deltaTP (no transit
                            peptide); 288 aa"
                     /codon_start=1
                     /product="AgGPPS2"
                     /ApEinfo_fwdcolor="#FF6600"
     terminator      1365..1680
                     /label="CYC1 terminator"
                     /note="S. cerevisiae CYC1 transcription terminator"
                     /ApEinfo_fwdcolor="#9B59B6"
     promoter        1700..2170
                     /label="TEF2 promoter"
                     /note="S. cerevisiae TEF2 constitutive promoter;
                            drives AgPS-His6 expression"
                     /ApEinfo_fwdcolor="#00DF87"
     RBS             2174..2189
                     /label="Kozak_2"
                     /note="Yeast Kozak consensus sequence upstream of AgPS"
     CDS             2193..3860
                     /label="AgPS_yeast"
                     /note="Abies grandis alpha-pinene synthase; S. cerevisiae
                            codon-optimized; deltaTP truncation; 556 aa"
                     /codon_start=1
                     /product="AgPS"
                     /ApEinfo_fwdcolor="#CC0000"
     CDS             3861..3878
                     /label="6xHis_tag"
                     /note="C-terminal hexahistidine tag for Ni-NTA purification"
                     /ApEinfo_fwdcolor="#FFFF00"
     stop_codon      3879..3881
                     /label="stop"
     terminator      3890..4190
                     /label="ADH1 terminator"
                     /note="S. cerevisiae ADH1 transcription terminator"
                     /ApEinfo_fwdcolor="#9B59B6"
     gene            4300..5159
                     /label="URA3"
                     /note="Orotidine-5'-phosphate decarboxylase; uracil
                            biosynthesis; auxotrophic selection marker for use
                            in ura3-deleted strains (BY4741)"
     rep_origin      5200..6640
                     /label="2-micron ori"
                     /note="S. cerevisiae 2-micron plasmid origin of
                            replication; ~40-60 copies/cell"
                     /ApEinfo_fwdcolor="#7F8C8D"
     rep_origin      6700..7360
                     /label="pUC ori"
                     /note="E. coli pUC origin of replication for plasmid
                            propagation in E. coli during cloning steps"
                     /direction=RIGHT
     CDS             7500..8160
                     /label="AmpR"
                     /note="Beta-lactamase; ampicillin resistance for E. coli
                            selection during propagation"
                     /codon_start=1
                     /product="beta-lactamase"
                     /ApEinfo_fwdcolor="#E74C3C"
ORIGIN
        1 agatcttcga gctcggtacc cggggatctg gatccagatc tgtttaaact tgtaagattt
       61 ttttttttct gacacataag tccagatttg aactatggtg acatttttag tactagtaaa
      121 taggatttta tttttttttg atttaagtga tttcattttt gttttagata aattaataaa
      181 tataataaat aaataaatat aacaaaaaat taataaataa taaagataaa tatataagtt
      241 tttttgttgt gttgtttgtt tttttgtttt ttttgttttt agaaaaaaat ttttgtccaa
      301 atgactgaat tcatcgaatc tttcgagagg ccccgggtac caagatttgg taggctggaa
      361 agacccgggt gaaaatccct ttttcaaata atcagaacat agatctcgag atgtcaatca
      421 gggaggagaa atactagtat gaccacccag accaccagca aagttgaact ggaagttcgt
      481 ctggctatgg acggtgacgg tgttcagttc gatgatggtg gtattgttgt tgatgtccag
      541 gaagcgctga aagaactgat ggagtttggt acccacttcg acgacgttga agttcgtctg
      601 gctatggacg gtgacggtgt tcagttcgat gatggtggta ttgttgttga tgtccaggaa
      661 gcgctgaaag aactgatgga gtttggtacc cacttcgacg acgttgaagt tcgtctggct
      721 atggacggtg acggtgttca gttcgatgat ggtggtattg ttgttgatgt ccaggaagcg
      781 ctgaaagaac tgatggagtt tggtacccgt cacattcagg aactgcagga cattctgcgt
      841 gacctgggca tcagcttcga atttgaacag gaaatcctgc aggcgctgaa agacgctggt
      901 gttgacctgc ctggccacaa atccatcctg gaatatctgc gcgaatgccg tgaactgtgc
      961 gaagaactgg acgacaacag cagcctggaa gcgcgtaacc gtatcgcgca gtacttcaaa
     1021 gaagcgttca acgacagcag cctggaagcg cgtgacaacc gtatcgcgga ctacttcaaa
     1081 caggttttca aagacagccg tctggaagaa ccgcagaaga tcgcgcagta cctgaaaaac
     1141 gcgttcaacg atgcgcgtct ggaagcgcgt gacaaccgta tcgcggacta cttcaaacag
     1201 gtgttcaaag acagccgtct ggaagaaccg cagaagatcg cgcagtacct gaaaaacgcg
     1261 ttcaacgatg cgcgtatcaa acagaccctg gaaaaataca acgtcagccc gctgacctac
     1321 atcctggacg aaatccagcg tctgcaccgt gaataagatc tcgagaattt cgtttttttt
     1381 tttcttttct tttttttgat ttggtttcct tttcttttta atgatcgtat ttttttcttg
//

Note: The ORIGIN section above is a representative excerpt. Generate the full annotated plasmid in Benchling by inserting yeast-codon-optimized AgGPPS2 and AgPS sequences (generated from the IDT Codon Optimization Tool, organism: Saccharomyces cerevisiae) into the p426TEF backbone (Addgene #26362). Export as .gb and upload to Twist Bioscience portal as a whole plasmid synthesis order.


End of HTGAA Final Project Proposal


Document assembled with assistance from the HTGAA Project Design AI Assistant Phase 1 design dialogue completed: April 4, 2026

𓋹 𓋹 𓋹 𓋹 𓋹