PhaC Enzyme Engineering — LLM Context Document

Version: v1.0
Date: [DATE]
Engineer: [YOUR NAME]
Project goal: [ONE SENTENCE SUMMARY, e.g. “Engineer Class I PhaC to incorporate 3HHx at >15 mol%”]

1. Enzyme Family Background

1.1 Classification

Class	Subunit structure	Size	Native substrate preference	Example organism
I	Single subunit	~65 kDa	scl (C3–C5): 3HB, 3HV, 3HP	Cupriavidus necator H16
II	Single subunit	~60 kDa	mcl (C6–C14): 3HHx, 3HO, 3HD	Pseudomonas aeruginosa
III	Heterodimer (PhaC + PhaE)	~40+40 kDa	scl	Allochromatium vinosum
IV	Heterodimer (PhaC + PhaR)	~40+40 kDa	scl	Bacillus megaterium

Class I and II share ~50% sequence identity; Class III/IV are more distantly related
Class I/II are the primary engineering targets for substrate specificity work

1.2 Reaction chemistry

Catalyzes polymerization of (R)-3-hydroxyacyl-CoA thioesters into PHA
Ping-pong (double displacement) mechanism:
1. Acylation: acyl group transferred to catalytic Cys, CoA released
2. Transacylation: acyl group transferred to growing polymer chain
Lipase-like α/β hydrolase fold
Catalytic triad: Cys – His – Asp
- C. necator PhaC1 (Cn) reference numbering: C319, D480, H508

1.3 Substrate scope terminology

Term	Chain length	Key monomers	Notes
scl	C3–C5	3HP, 3HB, 3HV	Most Class I enzymes
mcl	C6–C14	3HHx, 3HO, 3HD, 3HDD	Most Class II enzymes
lcl	>C14	3HHxD+	Very rare
Broad/mixed	C3–C14	scl + mcl	Rare, high engineering value
Specialty	varies	3H4MV, 3H2MB, aromatic	Non-standard monomers

1.4 Why substrate specificity is structurally interesting

Substrate-binding tunnel geometry determines acyl chain length tolerance
Residues within ~5–10 Å of catalytic Cys are primary selectivity determinants
mcl selectivity often results from removal of steric clash (smaller residues), not addition of new contacts — counterintuitive but well-supported
Electrostatic environment affects CoA-thioester positioning
Dimerization interface indirectly influences active site geometry (Class I/II)

2. Structural Information

2.1 Available experimental structures

PDB ID	Enzyme	Class	Resolution	Notes
5T6O	C. necator PhaC1	I	[X] Å	Primary Class I reference
4QO9	Chromobacterium sp. USM2 PhaC	I	[X] Å
[ID]	[Enzyme]	[Class]	[Res]	[Notes]

2.2 AlphaFold models

UniProt accession	Organism	Class	pLDDT (overall)	Confidence notes
[ACCESSION]	[ORG]	[I/II]	[score]	[e.g. low in N-term, residues 1–40]

2.3 Key structural regions

(Using C. necator PhaC1 residue numbering as reference)

Region	Residues (Cn)	Function	Conservation
N-terminal domain	1–170	Regulatory, dimerization	Low
Core catalytic domain	171–400	Contains Cys319	High
C-terminal domain	401–589	Contains Asp480, His508	High
Substrate-binding tunnel	[list residues]	Selectivity determinant	Moderate
Dimer interface	[list residues]	Stability	Moderate

2.4 Known substrate-contacting / selectivity residues

(From mutagenesis studies and structural analyses — update as you find more)

Position (Cn)	WT residue	Role	scl consensus	mcl consensus	Evidence
149	Ala	Tunnel entrance	A/V (89%)	F/W (74%)	Mutagenesis
171	[AA]	Structural hinge
325	Ser	Tunnel lining	S/A	A/G
392	[AA]	Near active site
480	Asp	Catalytic triad	D	D	Catalytic
508	His	Catalytic triad	H	H	Catalytic
[pos]	[AA]	[role]

2.5 Tunnel geometry notes

scl enzymes: narrower tunnel, estimated constriction ~4–6 Å
mcl enzymes: wider tunnel — bulky residues at key positions replaced by smaller ones (Ala, Gly) to accommodate longer acyl chains
[Add any MD simulation or docking notes here as available]

3. Sequence Dataset Summary

3.1 Dataset composition

Total sequences collected: [N]
After 95% identity dereplication (cd-hit): [N]
Labeled with substrate preference data: [N]
- scl only: [N]
- mcl only: [N]
- broad/mixed: [N]
- specialty monomer: [N]
Unlabeled (phylogenetic diversity only): [N]
Data sources: UniProt/SwissProt, NCBI RefSeq, literature

3.2 Taxonomic distribution

Taxon	N sequences	Dominant class	Notes
Betaproteobacteria	[N]	Class I	C. necator relatives
Gammaproteobacteria	[N]	Class II	Pseudomonas relatives
Alphaproteobacteria	[N]	I/III
Firmicutes	[N]	Class IV
Other	[N]

3.3 Alignment properties

Alignment method: [MUSCLE / Clustal Omega / MAFFT]
Raw aligned length: [N] columns
After gap trimming (>80% gap threshold): [N] columns
Mean pairwise identity, full dataset: [X]%
Mean pairwise identity, scl group: [X]%
Mean pairwise identity, mcl group: [X]%

3.4 Top mutual information positions

(Fill in after running Option 2 MI analysis)

Alignment col	Approx. residue (Cn)	scl consensus	mcl consensus	MI score
[col]	~[res]	[AA (%)]	[AA (%)]	[score]
[col]	~[res]	[AA (%)]	[AA (%)]	[score]
[col]	~[res]	[AA (%)]	[AA (%)]	[score]

4. Experimental Mutation Database

(This section should grow over time as you mine the literature and generate your own data)

4.1 Key literature to mine

Tsuge et al. (2003) Macromolecules — F420 region, Class I
Amara et al. (2002) — systematic Class I mutagenesis
Rehm lab series — Class II specificity determinants
Nomura et al. — broad-specificity engineered variants
Insomphun et al. — 3HHx incorporation engineering
[Add others as you find them]

4.2 Gain-of-function mutations (toward mcl/broader specificity)

Mutation	Background	Substrate effect	Quantitative data	Assay type	Reference
F420S	Cn PhaC1	Gains 3HHx incorporation	3HHx: 0 → 8 mol%	In vivo GC	Tsuge 2003
A510S	Cn PhaC1	Increased mcl acceptance	—		Amara 2002
[mut]	[bg]	[effect]	[data]	[assay]	[ref]

4.3 Loss-of-function / specificity-narrowing mutations

Mutation	Background	Substrate effect	Quantitative data	Assay type	Reference
[mut]	[bg]	[effect]	[data]	[assay]	[ref]

4.4 Combinatorial / double mutants

Mutations	Background	Effect vs. singles	Epistasis	Reference
F420S + A510S	Cn PhaC1	[effect]	Additive / synergistic / antagonistic	[ref]
[muts]	[bg]	[effect]		[ref]

4.5 Thermostability mutations

(Relevant if stacking specificity mutations — need to preserve stability)

Mutation	Background	ΔTm	Effect on activity	Reference
[mut]	[bg]	[+/- X°C]	[effect]	[ref]

4.6 Notes on data quality and comparability

Monomer incorporation % varies heavily with fermentation conditions (carbon source, growth phase, host strain) — cross-lab comparisons are unreliable
In vitro assays (purified enzyme + CoA thioesters) are more reliable for intrinsic specificity than in vivo PHA production titers
[Add any other caveats specific to your dataset]

5. Your Starting Enzyme (Wild-Type)

5.1 Identity

UniProt accession: [ID]
Organism: [NAME]
PhaC class: [I / II / III / IV]
Gene name: [phaC / phaC1 / phaC2]
Full sequence length: [N] aa

5.2 Known properties

Native substrate preference: [e.g. scl — 3HB/3HV, negligible 3HHx]
Specific activity: [X nmol/min/mg if known]
Thermostability: [Tm or optimal temperature]
Expression: [e.g. soluble in E. coli BL21 at 25°C, typical yield X mg/L]
Any known issues: [e.g. prone to aggregation, requires CoA for stability]

5.3 Sequence — full

[PASTE FULL AMINO ACID SEQUENCE HERE]

5.4 Sequence — substrate-binding pocket region

(~30 residues centered on catalytic Cys; easier to include in prompts)

[PASTE POCKET REGION SEQUENCE HERE — label residue numbers]
e.g. residues 305–335: XXXXCXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
                            ^C319

5.5 Alignment position mapping

(Map your WT residue numbers to the C. necator reference numbering and to your alignment column numbers — critical for interpreting suggestions)

Your residue	Your AA	Cn equivalent residue	Alignment column
[N]	[AA]	[N]	[col]

6. Engineering Target

6.1 Primary goal

[State precisely, e.g.:]

Incorporate 3HHx (C6) at >15 mol% in scl-mcl copolymer produced in E. coli BL21 on mixed carbon source (sodium butyrate + sodium hexanoate)

6.2 Secondary goals

[e.g. Retain 3HB incorporation >50 mol%]
[e.g. Maintain thermostability — Tm drop <5°C acceptable]

6.3 Acceptable tradeoffs

[e.g. Up to 30% reduction in overall polymerization activity]
[e.g. Reduced expression yield acceptable if specificity goal is met]

6.4 Hard constraints — DO NOT VIOLATE

Maximum simultaneous mutations: [N] (practical screening limit)
Must retain soluble expression in E. coli
Do not mutate catalytic triad residues (C319, D480, H508)
Avoid dimer interface mutations (stability risk)
[Add any others]

6.5 What has already been tested

(Critical — prevents the LLM from repeatedly suggesting things you’ve tried)

Mutation(s)	Result	Date tested	Notes
F420S	3HHx only 3% — insufficient	[date]	Tested in BL21, 30°C
[mut]	[result]	[date]

7. Production and Assay Context

7.1 Expression system

Host: [e.g. E. coli BL21(DE3)]
Vector: [e.g. pET-28a, His-tag]
Expression conditions: [e.g. 25°C, 16h, 0.5 mM IPTG]
Typical yield: [X mg/L culture]

7.2 PHA production conditions

Carbon source(s): [e.g. 10 mM sodium butyrate + 5 mM sodium hexanoate]
Co-pathway: [e.g. PhaA/PhaB co-expressed for 3HB-CoA supply; PhaJ for 3HHx-CoA]
Growth phase at harvest: [e.g. 48h, stationary]
PHA content typically: [X wt%]

7.3 Analytical method

PHA extraction: [e.g. chloroform extraction, sodium hypochlorite method]
Monomer analysis: [e.g. GC-FID after methanolysis, GC-MS for identification]
Activity assay (if used): [e.g. DTNB assay monitoring CoA release]
Throughput: [e.g. 24 variants per experiment]

8. Reasoning Guidelines for LLM

8.1 Prioritization criteria (in order)

Mechanistic/structural plausibility — a rationale is required
Consistency with experimental mutation database (Section 4)
Conservation pattern in target-substrate homologs (Section 3.4)
Novelty relative to literature

8.2 Required output format for mutation suggestions

For every suggested mutation, provide:

(a) Mutation in standard notation (e.g. A149F)
(b) Mechanistic rationale — why this residue, why this substitution
(c) Supporting evidence — literature, alignment, structural
(d) Confidence level: High / Medium / Low
(e) Potential risks — stability, expression, off-target effects
(f) Tag [SPECULATIVE] if based on analogy with no direct evidence

8.3 Reasoning I do NOT want

Suggestions based solely on “this residue differs between scl and mcl sequences” without structural or mechanistic reasoning
Overconfident quantitative predictions (e.g. “this will give 20% 3HHx”)
Suggestions that violate hard constraints in Section 6.4
Ignoring the “already tested” table in Section 6.5

8.4 When hypotheses conflict

Explicitly state the conflict and explain both sides
Do not silently choose one; flag for experimental resolution

8.5 My background

[Describe your expertise so the LLM calibrates explanation depth, e.g.:]

PhD in microbiology/biochemistry. Comfortable with protein biochemistry and microbial fermentation. Less experienced with structural biology and computational methods — please explain structural reasoning in accessible terms but do not oversimplify the biochemistry.

9. Session Log

(Append after each LLM session — builds institutional memory)

Session [DATE]

Question asked: [Paste your prompt]

Key LLM output / hypotheses generated: [Summarize or paste]

Your assessment: [Which suggestions seem worth pursuing, which to discard and why]

Action items:

[e.g. Test A149F single mutant]
[e.g. Check position 171 in AlphaFold model]

Session [DATE]

(repeat block)

10. Experimental Results Log

(Append as data comes in — feeds back into Section 4 and future sessions)

Experiment [DATE / ID]

Variants tested:

Variant	3HB mol%	3HV mol%	3HHx mol%	Total PHA wt%	Notes
WT	[X]	[X]	[X]	[X]	Control
[mut]	[X]	[X]	[X]	[X]

Interpretation: [What do these results mean for your hypotheses?]

Updated hypotheses: [How do results change your model of specificity determinants?]

End of context document — keep this file updated and prepend it to every new LLM session