PhaC Enzyme Engineering — LLM Context Document

Version: v2.0 Date: [DATE] Engineer: [YOUR NAME] Project goal: [ONE SENTENCE SUMMARY, e.g. “Engineer C. necator PhaC1 to incorporate 3HHx at >15 mol%”]

Dataset scope note: This document is built around a single reference enzyme (C. necator PhaC1) and published mutagenesis studies on that enzyme and its close variants. It does NOT use diverse multi-species sequence alignments. See Section 3 for implications and compensating strategies.

1. Enzyme Family Background

1.1 Classification

Class	Subunit structure	Size	Native substrate preference	Example organism
I	Single subunit	~65 kDa	scl (C3–C5): 3HB, 3HV, 3HP	Cupriavidus necator H16
II	Single subunit	~60 kDa	mcl (C6–C14): 3HHx, 3HO, 3HD	Pseudomonas aeruginosa
III	Heterodimer (PhaC + PhaE)	~40+40 kDa	scl	Allochromatium vinosum
IV	Heterodimer (PhaC + PhaR)	~40+40 kDa	scl	Bacillus megaterium

Class I and II share ~50% sequence identity; Class III/IV are more distantly related
This project focuses exclusively on Class I, using Cn PhaC1 as the sole reference

1.2 Reaction chemistry

Catalyzes polymerization of (R)-3-hydroxyacyl-CoA thioesters into PHA
Ping-pong (double displacement) mechanism:
1. Acylation: acyl group transferred to catalytic Cys, CoA released
2. Transacylation: acyl group transferred to growing polymer chain
Lipase-like α/β hydrolase fold
Catalytic triad: Cys – His – Asp
- C. necator PhaC1 (Cn) reference numbering: C319, D480, H508
- All residue positions in this document use Cn PhaC1 numbering unless noted

1.3 Substrate scope terminology

Term	Chain length	Key monomers	Notes
scl	C3–C5	3HP, 3HB, 3HV	Native Cn PhaC1 preference
mcl	C6–C14	3HHx, 3HO, 3HD, 3HDD	Engineering target
Broad/mixed	C3–C14	scl + mcl	Ideal outcome — rare
Specialty	varies	3H4MV, 3H2MB, aromatic	Out of scope for this project

1.4 Why substrate specificity is structurally interesting

Substrate-binding tunnel geometry determines acyl chain length tolerance
Residues within ~5–10 Å of catalytic Cys are primary selectivity determinants
mcl selectivity often results from removal of steric clash (smaller residues), not addition of new contacts — counterintuitive but well-supported
Electrostatic environment affects CoA-thioester positioning
Dimerization interface indirectly influences active site geometry

2. Structural Information

Note: With a single-enzyme dataset, structural information becomes more important, not less. It is the primary source of positional reasoning in the absence of multi-species alignment signal. Invest time in this section.

2.1 Available experimental structures for Cn PhaC1

PDB ID	Details	Resolution	Notes
5T6O	C. necator PhaC1, Class I	[X] Å	Primary reference — use this
[ID]	[Any other Cn PhaC1 structures]	[Res]

2.2 AlphaFold model for Cn PhaC1

UniProt accession: Q05HB5 (Cn PhaC1)
Overall pLDDT: [score]
Confidence notes: [e.g. high confidence in core domain, low in N-terminal region residues 1–30 and surface loops]
Use AF model for: loop conformations, surface regions not in crystal structure
Prefer crystal structure (5T6O) for: active site geometry, tunnel dimensions

2.3 Key structural regions (Cn PhaC1 numbering)

Region	Residues	Function	Notes
N-terminal domain	1–170	Regulatory, dimerization	Less conserved, lower structure confidence
Core catalytic domain	171–400	Contains Cys319	High confidence, primary engineering target
C-terminal domain	401–589	Contains Asp480, His508	High confidence
Substrate-binding tunnel	[list residues]	Selectivity determinant	Fill from structural analysis
Dimer interface	[list residues]	Stability	Avoid mutations here

2.4 Substrate-binding tunnel residues

(Fill this table carefully — it is the core of your structural reasoning)

Position	WT residue	Distance to C319 (Å)	Role in tunnel	Notes
149	Ala	[X]	Entrance region	Key specificity determinant
171	[AA]	[X]	[role]
325	Ser	[X]	Tunnel lining
392	[AA]	[X]	[role]
[pos]	[AA]	[X]	[role]

How to fill this table: Open 5T6O in PyMOL or ChimeraX. Select C319. Run: select tunnel_res, byres (all within 10 of resi 319). List those residues here with distances. This is worth spending 1–2 hours on — it will substantially improve LLM reasoning quality.

2.5 Tunnel geometry notes

Estimated tunnel constriction in WT Cn PhaC1: ~[X] Å (from structural analysis)
Residues that form the constriction point: [list]
Estimated minimum cavity volume for 3HHx-CoA accommodation: [X ų if known]
[Add MD simulation or docking results here as they become available]

3. Dataset Scope, Limitations, and Compensating Strategies

This section is critical. Read before every LLM session.

3.1 What your dataset contains

Reference enzyme: C. necator H16 PhaC1 (wild-type)
Variants: Published point mutants, double mutants, and combinatorial variants of Cn PhaC1 from the mutagenesis literature
Labels: Substrate incorporation data (mol% monomer) from those studies
What it does NOT contain: Homologous PhaC sequences from other species, Class II sequences, or unlabeled natural variants

3.2 Implications and honest limitations

Issue	Explanation	Impact
No evolutionary signal	Without a multi-species alignment, you cannot use co-evolutionary analysis (MI, DCA) to identify specificity-determining positions	Cannot compute MI scores; Section 3.4 of the original template is not applicable
Narrow sequence space	All data points are close variants of one sequence (~1–5 mutations from WT)	Model cannot extrapolate to distant sequence space; suggestions far from WT are unreliable
Sparse coverage	Published mutagenesis studies cover only a small fraction of all possible positions	Many positions have no experimental data; reasoning about them is purely structural/hypothetical
Publication bias	Literature overwhelmingly reports positive results (mutations that did something interesting)	Negative results (mutations with no effect) are underrepresented; hard to learn what doesn’t matter
Lab-to-lab variability	Different studies use different assay conditions, hosts, carbon sources	Quantitative comparisons across studies are unreliable; treat mol% values as approximate
Limited combinatorial data	Few studies systematically explore epistatic interactions	Combining individually beneficial mutations may not be additive

3.3 What this dataset IS good for

The LLM can reason very effectively about:
- Mechanistic hypotheses — why does mutation X change specificity, based on structure and chemistry?
- Interpreting your experimental results — what does an unexpected outcome tell you about the mechanism?
- Experimental design — which mutations to test next given what is known?
- Identifying gaps — which positions have never been mutated but are structurally important?
- Literature synthesis — connecting observations across papers into a coherent mechanistic model

3.4 Compensating strategies

To partially offset the lack of multi-species alignment data:

Lean heavily on structural reasoning (Section 2) — fill in the tunnel residue table as completely as possible; this replaces alignment signal as your primary source of positional hypotheses
Include Class II reference data explicitly — even if not in your training set, you can add a “comparative note” section describing which Cn PhaC1 positions correspond to Class II residues (from manual alignment of just Cn PhaC1 vs. Pa PhaC1/2). This gives the LLM evolutionary context without requiring a full MSA.
Weight negative results equally to positive — if you can find papers reporting mutations that failed to shift specificity, record them in Section 4.3. They are highly informative and rare in the literature.
Be explicit about data gaps in prompts — tell the LLM “position X has never been mutated in the literature” so it flags its reasoning as structural/hypothetical rather than evidence-based.
Use the LLM to propose positions to structurally analyze — ask it which tunnel residues it would prioritize examining in the crystal structure, then verify those manually before including them in subsequent prompts.

3.5 Class II reference comparison

(Manual alignment of just Cn PhaC1 vs. one or two Class II enzymes — fills in some evolutionary context without a full MSA)

Cn PhaC1 residue	Cn AA	Pa PhaC1 equivalent residue	Pa AA	Significance
149	Ala	[pos]	[AA]	Tunnel entrance
325	Ser	[pos]	[AA]	Tunnel lining
[pos]	[AA]	[pos]	[AA]

How to fill this: Use a pairwise alignment tool (e.g. EMBOSS Needle at https://www.ebi.ac.uk/Tools/psa/emboss_needle/) with Cn PhaC1 (UniProt Q05HB5) and Pa PhaC1 (UniProt Q9HWK2). This takes ~10 minutes and is worth doing.

4. Experimental Mutation Database

(This is the heart of your dataset — populate as completely as possible)

4.1 Key literature to mine for Cn PhaC1 mutations

Tsuge et al. (2003) Macromolecules — F420 region, systematic Class I
Amara et al. (2002) — systematic Class I mutagenesis panel
Rehm et al. — early mechanistic mutagenesis
Nomura et al. — broad-specificity engineering attempts
Insomphun et al. — 3HHx incorporation focus
Hiroe et al. — combinatorial mutagenesis
[Add others as you find them — search PubMed: “PhaC mutagenesis” OR “polyhydroxyalkanoate synthase substrate specificity”]

Mining tip: For each paper, extract: (1) every mutation tested, including ones with no effect — these are just as valuable, (2) exact assay conditions, (3) quantitative data where reported. Even a table footnote saying “A300G showed no change in specificity” belongs here.

4.2 Gain-of-function mutations (toward mcl / broader specificity)

Mutation	Effect	3HB mol%	3HV mol%	3HHx mol%	Activity vs WT	Assay conditions	Reference
F420S	Gains 3HHx	[X]	[X]	~8%	[X]%	[conditions]	Tsuge 2003
A510S	Increased mcl	[X]	[X]	[X]	[X]%	[conditions]	Amara 2002
[mut]	[effect]						[ref]

4.3 Neutral mutations (no significant effect on specificity)

(Underrepresented in literature but critically important — record every instance you can find)

Mutation	Region	Why tested	Outcome	Reference
[mut]	[region]	[rationale in original paper]	No change in specificity	[ref]

4.4 Deleterious mutations (loss of activity or expression)

Mutation	Effect	Suspected reason	Reference
[mut]	[e.g. insoluble, inactive]	[e.g. disrupts fold]	[ref]

4.5 Combinatorial / double mutants

Mutations	Effect vs. singles	3HHx mol%	Epistasis observed	Reference
F420S + A510S	[effect]	[X]%	Additive / synergistic / antagonistic	[ref]
[muts]	[effect]			[ref]

4.6 Thermostability mutations

(Relevant when stacking specificity mutations)

Mutation	ΔTm	Effect on activity	Effect on specificity	Reference
[mut]	[+/- X°C]	[effect]	[effect]	[ref]

4.7 Positions that have NOT been mutated in literature

(Fill as you read — these are candidate positions for novel exploration)

Position	WT AA	Distance to C319	Structural role	Why interesting
[pos]	[AA]	[X] Å	[role]	[rationale]

4.8 Data quality notes

Quantitative mol% values are sensitive to: carbon source ratios, growth phase, expression level, host strain — treat cross-study comparisons as qualitative only
In vitro CoA-release assays (DTNB) give intrinsic kinetic data but don’t fully reflect in vivo selectivity under substrate competition
Some older studies used racemic substrates — stereospecificity may confound apparent chain-length specificity
[Add specific notes about inconsistencies you notice across papers]

5. Your Starting Enzyme (Wild-Type Cn PhaC1)

5.1 Identity

Organism: Cupriavidus necator H16
UniProt accession: Q05HB5
PhaC class: I
Gene: phaC1 (in pha operon: phaCAB)
Full sequence length: 589 aa

5.2 Known properties of WT Cn PhaC1

Native substrate preference: scl — 3HB (major), 3HV (minor), negligible 3HHx
Specific activity: [X nmol/min/mg — fill from literature or your own data]
Thermostability: [Tm or optimal temperature]
Expression in E. coli: [your experience — yield, solubility]
Dimerization: active as dimer; monomer is inactive
Known issues: [e.g. requires careful lysis conditions, prone to aggregation at high concentration]

5.3 Full WT sequence

[PASTE FULL 589 aa SEQUENCE HERE — available from UniProt Q05HB5]

5.4 Substrate-binding pocket region

(Residues 300–340, centered on C319 — paste into prompts as needed)

Residues 300–340:
[PASTE SEQUENCE]
Position:  300                319                340
                               ^C319 (catalytic)

5.5 Catalytic and key residue positions (for quick reference)

Residue	AA	Role
C319	Cys	Catalytic — nucleophile; DO NOT MUTATE
D480	Asp	Catalytic triad; DO NOT MUTATE
H508	His	Catalytic triad; DO NOT MUTATE
A149	Ala	Tunnel entrance; primary specificity target
S325	Ser	Tunnel lining; specificity target
[others from Section 2.4]

6. Engineering Target

6.1 Primary goal

[State precisely, e.g.:]

Incorporate 3HHx (C6) at >15 mol% in copolymer produced in E. coli BL21 on mixed carbon source (sodium butyrate + sodium hexanoate), while retaining 3HB incorporation >50 mol%

6.2 Secondary goals

[e.g. Maintain thermostability — Tm drop <5°C acceptable]
[e.g. Retain soluble expression yield >X mg/L]
[e.g. Avoid total loss of scl activity]

6.3 Acceptable tradeoffs

[e.g. Up to 30% reduction in overall polymerization rate]
[e.g. Reduced in vivo PHA titer acceptable if specificity goal met]

6.4 Hard constraints — DO NOT VIOLATE

Do NOT mutate catalytic triad: C319, D480, H508
Maximum simultaneous mutations in any single variant: [N]
Must retain soluble expression in E. coli
Avoid dimer interface residues: [list positions]
[Any others specific to your project]

6.5 What has already been tested

(Update after every experiment round — prevents redundant suggestions)

Mutation(s)	3HHx result	Other notable effects	Date	Notes
WT control	~0 mol%	Baseline	[date]
F420S	~3 mol%	—	[date]	Insufficient
[mut]	[result]	[notes]	[date]

7. Production and Assay Context

7.1 Expression system

Host: [e.g. E. coli BL21(DE3)]
Vector: [e.g. pET-28a, N-terminal His6-tag]
Expression conditions: [e.g. 25°C, 16h post-induction, 0.5 mM IPTG]
Typical soluble yield: [X mg/L culture]
Purification: [e.g. Ni-NTA IMAC, single step]

7.2 In vivo PHA production conditions

Host: [e.g. E. coli BL21(DE3) ΔfadB or WT]
Co-expressed pathway genes: [e.g. phaA, phaB for 3HB-CoA; phaJ for 3HHx-CoA]
Carbon source(s): [e.g. 10 mM sodium butyrate + 5 mM sodium hexanoate]
Growth conditions: [e.g. M9 minimal, 30°C, 48h]
PHA content range (WT): [X–Y wt%]

7.3 In vitro activity assay (if used)

Assay type: [e.g. DTNB colorimetric assay — monitors CoA release at 412 nm]
Substrate(s): [e.g. 3HB-CoA, 3HHx-CoA at X µM]
Buffer conditions: [e.g. 50 mM Tris pH 7.5, 150 mM NaCl]

7.4 PHA analysis

Extraction method: [e.g. chloroform extraction, sodium hypochlorite digestion]
Monomer analysis: [e.g. GC-FID after acidic methanolysis; GC-MS for identification]
Quantification standard: [e.g. PHB standard curve]
Throughput: [e.g. 24 variants per experiment]

8. Reasoning Guidelines for LLM

8.1 Dataset context — tell the LLM explicitly at session start

Always include this statement at the top of each session prompt:

“My dataset consists only of C. necator PhaC1 (wild-type) and published point mutants of this single enzyme. I do not have a multi-species alignment. All positional reasoning should be grounded in (a) the experimental mutation database in Section 4, and (b) structural analysis of PDB 5T6O / AlphaFold model Q05HB5. Do not infer specificity determinants from phylogenetic patterns — that data is not available.”

8.2 Prioritization criteria (in order, adjusted for this dataset)

Direct experimental evidence — mutations in Section 4 with measured outcomes
Structural/mechanistic reasoning — based on 5T6O crystal structure and tunnel geometry (Section 2)
Analogy to Class II — using the pairwise comparison in Section 3.5, noting explicitly when this is being used
Chemical intuition — physicochemical rationale for a substitution, flagged as [SPECULATIVE] if no structural or experimental support

8.3 Required output format for mutation suggestions

For every suggested mutation, provide:

(a) Mutation in standard notation (e.g. A149F, Cn PhaC1 numbering)
(b) Primary evidence basis: Experimental / Structural / Class II analogy / Chemical intuition [SPECULATIVE]
(c) Mechanistic rationale — specific, not generic
(d) Consistency with existing data in Section 4 — does it contradict anything?
(e) Confidence: High (direct experimental support) / Medium (structural + analogy) / Low (chemical intuition only)
(f) Predicted risk: stability, expression, activity loss

8.4 Reasoning I do NOT want

Statements like “this position is conserved in mcl enzymes” — you do not have alignment data to support this; use only the pairwise comparison in 3.5
Overconfident quantitative predictions of mol% outcomes
Suggestions violating hard constraints in Section 6.4
Suggestions already in Section 6.5 “already tested” table
Filling data gaps with plausible-sounding inventions — flag uncertainty explicitly

8.5 Especially useful prompts for this dataset type

Given the single-enzyme focus, these prompt types will be most productive:

Gap analysis: “Which tunnel-lining residues (Section 2.4) have never been mutated in the literature (Section 4.7)? For each, give a structural rationale for whether they are likely to affect specificity.”
Mechanistic interpretation: “Mutation X gave unexpected result Y. Given the structural context of position X (distance to C319, neighboring residues, tunnel role), propose 2–3 mechanistic explanations.”
Epistasis prediction: “Given that A149F and S325A are individually beneficial, reason about whether their combination is likely to be additive, synergistic, or antagonistic, based on their structural relationship.”
Experimental prioritization: “I can test 12 variants. Given the mutation database and structural data, design a 12-variant panel that maximizes information gained about specificity determinants.”

8.6 My background

[e.g.:]

PhD in microbiology/biochemistry. Comfortable with protein biochemistry, enzyme kinetics, and microbial fermentation. Less experienced with structural biology — please explain structural reasoning clearly but do not oversimplify the biochemistry.

9. Session Log

(Prepend full context document + append this log to every session)

Session [DATE]

Prompt used: [Paste exact prompt]

Key outputs / hypotheses: [Summarize or paste key suggestions]

Your assessment: [Which suggestions seem credible? Which seem poorly supported?]

Flagged uncertainties from LLM: [Note any [SPECULATIVE] tags or explicit uncertainties the LLM raised]

Action items:

[e.g. Verify position 171 distance to C319 in PyMOL]
[e.g. Test A149F single mutant]
[e.g. Search for papers on position 392 mutagenesis]

Session [DATE]

(repeat block)

10. Experimental Results Log

Experiment [DATE / ID]

Variants tested:

Variant	3HB mol%	3HV mol%	3HHx mol%	Total PHA wt%	Soluble expression?	Notes
WT	[X]	[X]	~0	[X]	Yes	Control
[mut]	[X]	[X]	[X]	[X]	[Y/N]

Interpretation: [What do these results mean for your mechanistic model?]

Surprises / inconsistencies with predictions: [Critical to record — unexpected results are often the most informative]

Updated hypotheses: [How do results revise your model of specificity determinants?]

Add to mutation database: [Y/N — copy rows to Section 4 as appropriate]

End of context document — v2.0 (single-enzyme / mutagenesis dataset scope) Keep this file updated and prepend it in full to every new LLM session