Individual Final Project

Enzyme engineering for (cyanobacterial) bioplastic production

Imagine if plastic was an environmental solution, rather than an environmental problem.

Carbon capture, utilization, and storage (CCUS) is an umbrella term for any sort of technology that pulls carbon out of the atmosphere and repurposes it into a useful product and/or moves it to long-term storage as a climate change mitigation strategy. One example is bioplastics made with photosynthesis. Plastics are polymers made up of primarily carbon, and when produced by photosynthetic organisms (such as cyanobacteria), that carbon can come directly out of the atmosphere. ![CCUS figure]

Some strains of cyanobacteria produce biopolymers called polyhydroxyalkanoates (PHAs) that can be used as drop-in replacements for conventional petroleum-derived plastics. The most common PHA is poly-3-hydroxybutyrate or PHB. ![PHA granules in cyano] ![PHB]

For my project, I wanted to use AI reasoning combined with traditional machine learning to engineer a more productive PHB synthase enzyme for cost-competitive plastic biomanufacturing.

Background

2024 was the hottest year on record, with the average global temperature over the 1.5C increase limit that environmentalists have been advocating for the last several decades (United Nations, 1.5C). Although many countries are pledging to decrease their greenhouse gas outputs, carbon will continue to be added to the atmosphere for the foreseeable future, even if at decreasing rates. Mitigation strategies, such as CCUS, will be an essential component to minimize, or even reverse course, on climate change.

In theory, cyanobacteria are a financially useful production chassis because they take advantage of ambient or waste feedstocks, such as sunlight, atmospheric carbon dioxide, non-potable water such as seawater or greywater (Schubert et al, 2024; Wlodarczyk et al, 2020). Unfortunately, the relatively low production and energy-intensive processing have prevented commercially-viable production from being realized. The visionary aim of my project is to develop a commercial-scale bioprocessing operation for a bioplastic PHB-production cyanobacteria strain, with the ultimate goal to provide a drop-in raw plastic for use in consumer goods and packaging.

Because some strains of cyanobacteria are natural PHA producers, they are a good potential production chassis for bioplastic. The ideal strain would grow well off atmospheric carbon, have high PHA production, and have other phenotypic traits that are beneficial for cheaper or easier bioproduction, such as salinity or pH tolerance for contamination prevention, nitrogen fixation for cheaper growth media, and fast settling for lower energy collection and dewatering. One possible good host strain is Cyanobacterium aponium sp. UTEX 3222 because it has planktonic growth, salinity tolerance, rapid settling, fast growth rate, and native PHA production (Schubert et al, 2024).

PHB synthase enzyme, PhaC

PHAs are produced through a biosynthetic pathway as an offshoot from sugar metabolism, believed to be for long-term energy storage. The key enzyme is the PHB synthase enzyme, PhaC (also called PHA synthase, PHB polymerase, etc.). There are four classes of PhaC enzyme, which preferentially utilize different monomers. The best studied PhaC is from Cupriavidus necator H16 (Neoh et al, 2022). PhaC has two domains: the catalytic C-domain that includes the catalytic triad His-Asp-Cys, and the N-domain. The N-domain has been demonstrated to be essential for enzymatic activity, although its exact role is still unknown - though it has been suggested to potentially play a role in substrate selectivity (Neo et al, 2022).

There has been a lot of research into increasing the substrate promiscuity of PhaC because variety in monomer composition results in different thermochemical properties of the resulting plastic (Antonio et al, 2000; Harada et al, 2021; Kane, 2021; Sivashankari et al, 2023; Timm et al, 1990; Tsuge et al, 2004; Ye et al, 2008). However, I hypothesized that increasing substrate specificity towards the 3HB monomer that results in PHB would increase the amount of PHB produced. I hoped that this also might have the secondary effects of normalizing the molecular weights of the polymer molecules produced and decreasing downstream purification costs because there would be fewer minor products to separate out.

ML/AI in protein science

My initial thought was to use machine learning to design new protein mutants; however, my early literature searches made me seriously consider the practical constraints I was operating under: namely, time (before the final project was due) and processing capability (of my personal laptop and home internet). Therefore, I expanded my original intention to include free-for-public-use large language model (LLM) agents, such as ChatGPT and Claude.ai.

Hypothesis

I hypothesized that PhaC mutants engineered for increased substrate specificity by AI and ML would have higher PHB productivity.

Project aims

Aim 1: Experimental

Generate PHB synthase mutants using LLM reasoning and traditional machine learning (ML).

Literature search
Train a free-use LLM for PHB synthase mutations.
Write prompts to minimize hallucinations by requesting suggestions be based on the data provided, describe what each suggestion is based on (including literature references to verify), and specifically identifying more speculative suggestions.

Aim 1.5

Multi-sequence alignment (MSA) of diverse bacterial PhaC of various classes.
Use Python scikit-learn library to train ML classifier for PHB synthase substrate preference based on the MSA.
Feed ML output back into LLM to generate additional mutations and hypothesized mechanisms.

Aim 2: Developmental

Test the highest-scored mutants for PHB production.

Design cell-free expression vector.
Make a library of phaC mutant sequences.
Test variants for PHB production.
DBTL cycle(s) with LLM, ML, and in vitro testing; iterate as resources allow.

Aim 2.5

Test best-performing mutants in cyanobacterial chassis.

Literature search to identify 1-3 cyanobacterial strains to use as production chassis.
Re-design expression cassette and selected phaC mutants for cyanobacterial expression.
Test for PHB production in cyanobacterial strains.

Aim 3: Visionary

Develop a scalable bioprocessing operation that can ultimately be used for commercial plastics production.

Develop a basis for each operation step at the bench scale.
Design scale-up for a pilot plant.

Methods

Enzyme selection

I’d originally planned to use the PhaC from my intended host strain C. aponium UTEX 3222, but I was unable to find a sequence for it. I searched for the UTEX 3222 PhaC sequence in the supplemental files of the paper that said it identified PHA biosynthetic genes, but it wasn’t included (Schubert et al, 2024). It also wasn’t listed on the full annotated genome assembly (ASM3863077v1) published to the NCBI. I BLASTed the C. necator PhaC against the UTEX 3222 genome assembly, as well as a few different PhaC’s from other cyanobacterial species, but none of these produced any results.

Therefore, I decided to move forward using C. necator PhaC, hereafter called PhaC_Cn.

PhaC_Cn mutagenesis data compilation

I’d originally planned to do a large MSA with diverse bacterial PhaC from all four classes. I did the following search on UniProt: (taxonomy_id:2) AND (protein_name:“Poly(3-hydroxyalkanoate) polymerase subunit PhaC”). This yielded 1,445 results. Using the internal UniProt tool, I tried to do an MSA (max 50 sequences), but this crashed my browser. I then tried to download a subset of these sequences for an MSA through Benchling or with Python code written with the help of Claude and a Python for Dummies book, but downloading hundreds of sequences froze my computer, and I gave up on the possibility of doing this on my personal laptop and home internet network. Below is the Identity Matrix of a 7 sequence MSA from UniProt that I was able to visualize, but 7 sequences is not enough to generate features or suggestions from, especially when they are so different.

So I decided to use PhaC_Cn mutagenesis data instead. I used Google Search to find peer-reviewed articles on machine learning and LLMs for protein design and PhaC_Cn, selecting the most relevant ones for further reading based ont the abstracts. From the papers I read, I additionally checked out their references for additional papers that could be helpful. Because I could only find limited mutagenesis data, I also included two other bacterial PhaC because they were listed in papers comparing the structures and sequences: PhaC from Aeromonas caviae (Harada et al, 2021) and PhaC from Chromobacterium sp. USM2 (Chuah et al, 2013). I imported these two PhaC amino acid sequences into Benchling, and aligned them with PhaC_Cn.

I compiled the identified sequences into a google spreadsheet. Mutagenesis compilation

LLM context document

I started my conversation with Claude.ai by asking it about enzyme engineering with LLMs generally, to see what it knew.

i’m interested in enzyme engineering using LLMs. what can you tell me about that?

This produced a list of protein language models and tools (ESM-2, AlphaFold, ProteinMPNN, RFdiffusion, Progen2, and EVOLVEpro); practical workflows; active research topics; and limitations.

From there, I decided to ask what it thought of my hypothesis, hoping that would lead to a discussion of what the LLM needed to achieve this goal.

i’m specifically interested in mutations for substrate specificity in bacterial PhaC (polyhydroxyalkanoate synthase) enzymes. i have this idea that with the aligned protein sequences and some information on substrate preference for each one, an LLM might be able to output suggested sequences for a given substrate. Does an LLM seem like the correct tool for this?

Claude suggested that traditional ML would be a better option, but that combining LLM reasoning with ML could be a worthwhile approach. It included a warning that the biggest limitation was likely data availability. It gave four discrete options and a hybrid approach.

Option 1: Fine-tuned pLM (e.g., ESM-2) with substrate labels
Option 2: Supervised ML on alignment features
Option 3: EVE-style or EVOLVEpro unsupervised variant scoring
Option 4: Use an LLM as a reasoning layer over structured data
A hybrid pipeline:
1. Curate your MSA + substrate labels carefully (scl/mcl/mixed, specific monomers if known)
2. Identify key specificity-determining residues using mutual information or DCA (direct coupling analysis) between alignment columns and substrate labels — this gives you a reduced feature space
3. Use ESM-2 embeddings as features for a supervised model predicting substrate class
4. Score candidate mutations zero-shot using ESM-2 log-likelihood to filter for “natural-looking” sequences
5. Use structural context (the PhaC crystal structures / AlphaFold models) to sanity-check that suggested mutations are geometrically plausible in the binding pocket The hybrid pipeline was had some similarities to the Protein Design II (week 5) homework, which I thought was a good sign.

I decided to move forward with options 2 and 4 since those were the approaches I’d had in mind earlier.

i’m most interested in Option 2 and Option 4. Could you please give me a very specific step-by-step instructions for both?

This prompt was a goldmine. It resulted in actually quite a lot of useful Python code for the ML approach, as well as the description of a context document that could be used for continuity between LLM reasoning sessions. I figured that between the context document and carefully worded prompts, I could also use it to minimize the chance of undetected hallucinations. Claude also suggested a pipeline for combining the two, from which I ultimately built my approach (though I didn’t exactly copy it because I thought my order worked better).

Literature (Option 4) → curated mutation DB ↓ MSA + MI analysis (Option 2) → top specificity positions ↓ LLM reasoning (Option 4) → mechanistic hypotheses about those positions ↓ ML scoring (Option 2) → rank candidate mutants quantitatively ↓ LLM (Option 4) → design wet-lab validation experiment ↓ Experimental results → feed back into both pipelines

I asked Claude for what specifically it wanted out of the context document so I could build it appropriately.

Please give me more information about the context document you would need if i wanted to go with LLM reasoning

This resulted in a list of sections, and it also offered to generate a template for me to start from. I asked it to do this.

please create a template markdown file for my context document

original context document template from Claude While this template did contain some information on PhaC, I did not leave anything in that I did not have a reference to verify it because I wasn’t confident that it wasn’t hallucinations - and indeed, it did have some information that I suspect was hallucinations (although it may have simply been from a source I hadn’t personally read). However, going through the document, it relied heavily on an MSA that I simply wasn’t able to do, so I requested that Claude update the document to reflect the mutagenesis dataset.

i’m going to start with a dataset only containing PhaC from C. necator and known mutagenesis studies, not diverse PhaC from many different bacteria. Please update the context document template to reflect this, and let me know any particular warnings or considerations for this approach.

This resulted in the following template. updated context document template from Claude

I filled in this context document with information that I could find in my references, and the mutagenesis data I had compiled. When I reuploaded it to Claude, it identified a few typos and questioned one of my references. I made the edits, but clarified that my reference said what I had asserted and Claude should continue to use it within my document and reasoning sessions, despite external data that Claude may have thought it had that disagreed with it. This may have been a hallucination, but I think Claude was actually just generalizing from other PhaC mutagenesis papers (not from C. necator).

Final context document for my first reasoning session

LLM reasoning session

I started with the warning from my context document.

Before we start, please keep in mind: My dataset consists only of C. necator PhaC1 (wild-type) and published point mutants of this single enzyme. I do not have a multi-species alignment. All positional reasoning should be grounded in (a) the experimental mutation database in Section 4, and (b) structural analysis of PDB 5T6O / AlphaFold model P23608. Do not infer specificity determinants from phylogenetic patterns — that data is not available. Now let’s do the first reasoning session.

Claude asked for more direction, so I specificied that I wanted mechanistic reasoning.

Let’s start with mechanistic hypothesis building please.

Results

Mutagenesis data compilation

PhaC_Cn, PhaC_Ac, PhaC_Cs alignment

Benchling for PhaC_Cn with mutagenesis positions annotated

LLM Reasoning

Claude’s output for my first reasoning session started with a summary of information that I had included in my context document. While not particularly useful to me, it did confirm for me that it was drawing primarily from my context document and not as much external sources.

I refined my request with my next prompts for explicit suggestions:

That looks good - can you suggest a few single mutations within the N-terminus that would get us some data on how variation there affects substrate specificity? Yes, and let’s include potential combinations if possible - combinations can exclude the N-terminus positions at this point since we lack data there.

This resulted in a suggestion of a series of 11 variants for a first wet lab experimental run: the base wild-type as a control, 7 single mutations, and 3 double mutations. It was fairly conservative in its recommendations, but I suspect that’s due to the limited data.

Finally, I asked Claude to update my context document with the log of our first reasoning session.

Yes, please log both this table and a summary of our reasoning session in the session log section.

Resulting in the final context document.

Next steps

For starters, my next steps are to continue with the ML in Aim 1.5. I hope that by using the public wifi at a library or a college campus, I’ll be able to download the sequences for the MSA. Then I can hopefully let my computer run the MSA over a long period of time. The detailed steps for aims 1.5 and onward can be found in my final report below.

Additionally, the feedback from TAs and fellow students after my presentation was to compare outputs from multiple LLMs, using my same context document and prompts. So I’d like to do that as well.

Documents

final report

final presentation slides

lab notebook

mutagenesis compilation

context document

References

Antonio, RV; Steinbuchel, A; Rehm, BHA. Analysis of in vivo substrate specificity of the PHA synthase from Ralstonia eutropha: formation of novel copolyesters in recombinant Escherichia coli. FEMS Microbiology Letters 2000, 182(1): 111-117. https://doi.org/10.1111/j.1574-6968.2000.tb08883.x
Chek, MF; Hiroe, A; Hakoshima, T; et al. PHA synthase (PhaC): interpreting the functions of bioplastic-producing enzyme from a structural perspective. Applied Microbiology and Biotechnology 2018, 103: 1131-1141. https://doi.org/10.1007/s00253-018-9538-8
Chuah, J-A; Tomizawa, S; Yamada, M; et al. Characterization of site-specific mutations in a short-chain-length/medium-chain-length polyhydroxyalkanoate synthase: In vivo and in vitro studies of enzymatic activity and substrate specificity. Applied and Environmental Microbiology 2013, 79. https://doi.org/10.1128/AEM.00564-13
Dong, H; Yang, X; Shi, J; et al. Exploring the feasibility of cell-free synthesis as a platform for polyhydroxyalkanoate (PHA) production: Opportunities and challenges. Polymers 2023, 15(10). https://doi.org/10.3390/polym15102333
Harada, K; Kobayashi, S; Oshima, K; et al. Engineering of Aeromonas caviae polyhydroxyalkanoate synthase through site-directed mutagenesis for enhanced polymerization of the 3-hydroxyhexanoate unit. Frontiers of Bioengineering and Biotechnology 2021, 9. https://doi.org/10.3389/fbioe.2021.627082
Jossek, R; Steinbuchel, A. In vitro synthesis of poly(3-hydroxybutyric acid) by using an enzymatic coenzyme A recycling system. FEMS Microbiology Letters 1998, 168: 319-324. https://doi.org/10.1111/j.1574-6968.1998.tb13290.x
Kane, A. Toward engineering the substrate specificity of a PHA synthase (PhaC). Victoria University of Wellington, Masters thesis, 2021. https://doi.org/10.26686/wgtn.17152079
Neoh, SZ; Check, MF; Tan, HT; et al. Polyhydroxyalkanoate synthase (PhaC): The key enzyme for biopolyester synthesis. Current Research in Biotechnology 2022, 4: 87-101. https://doi.org/10.1016/j.crbiot.2022.01.002
Satoh, Y; Tajima, K; Tannai, H; et al. Enzyme-catalyzed poly(3-hydroxybutyrate) synthesis from acetate with CoA recycling and NADPH regeneration in Vitro. Journal of Bioscience and Bioengineering 2002, 95(4): 335-341. https://doi.org/10.1016/S1389-1723(03)80064-6
Schubert, MG; Tang, T-C; Goodchild-Michelman, IM; et al. Cyanobacteria newly isolated from marine volcanic seeps display rapid sinking and robust, high-density growth. Applied Environmental Microbiology 2024, 90(11). https://doi.org/10.1128/aem.00841-24
Sivashankari, RM; Mierzati, M; Miyahara, Y; et al. Exploring Class I polyhydroxyalkanoate synthases with broad substrate specificity for polymerization of structurally diverse monomer units. Frontiers in Bioengineering and Biotechnology 2023, 11. https://doi.org/10.3389/fbioe.2023.1114946
Sudesh, K; Taguchi, K; Doi, Y. Effect of increased PHA synthase activity on polyhydroxyalkanoates biosynthesis in Synechocystis sp. PCC 6803. International Journal of Macromolecules 2002, 30: 97-104. https://doi.org/10.1016/S0141-8130(02)00010-7
Taguchi, S; Nakamura, H; Hiraishi, T; et al. In vitro evolution of a polyhydroxybutyrate synthase by intragenic suppression-type mutagenesis. Journal of Biochemistry 2002, 131(6): 801-806. https://doi.org/10.1093/oxfordjournals.jbchem.a003168
Timm, A; Byrom, D; Steinbuchel, A. Formation of blends of various poly(3-hydroxyalkanoic acids) by a recombinant strain of Pseudomonas oleovorans. Applied Microbiology and Biotechnology 1990, 33: 296-301. https://doi.org/10.1007/BF00164525
Tsuge, T; Saito, Y; Narike, M; et al. Mutation effects of a conserved alanine (Ala510) in Type I polyhydroxyalkanoate synthase from Ralstonia eutropha on polyester biosynthesis. Macromolecular Bioscience 2004, 4(10): 963-970. https://doi.org/10.1002/mabi.200400075
United Nations. 1.5C: What it means and why it matters. UN Climate Action. Accessed 2026.05.25. https://www.un.org/en/climatechange/science/climate-issues/degrees-matter
Valentini, G; Malchiodi, D; Gliozzo, J; et al. The promises of large language models for protein design and modeling. Frontiers in Bioinformatics 2023, 3: 1304099. https://doi.org/10.3389/fbinf.2023.1304099
Wittenborn, EC; Jost, M; Wei, Y; et al. Structure of the catalytic domain of the class I polyhydroxybutyrate synthase from Cupriavidus necator. Journal of Biological Chemistry 2016, 291(48): 25264-25277. https://doi.org/10.1074/jbc.M116.756833
Wlodarczyk, A; Selao, TT; Norling, B; et al. Newly discovered Synechococcus sp. PCC 11901 is a robust cyanobacterial strain for high biomass production. Nature Communications Biology 2020, 3: 215. https://doi.org/10.1038/s42003-020-0910-8
Ye, Z; Song, G; Chen, G; et al. Location of functional region at N-terminus of polyhydroxyalkanoate (PHA) synthase by N-terminal mutation and its effects on PHA synthesis. Biochemical Engineering Journal 2008, 41(1): 67-73. https://doi.org/10.1016/j.bej.2008.03.006