PhaC Enzyme Engineering — LLM Context Document

Version: v2.0 Date: [DATE] Engineer: [YOUR NAME] Project goal: [ONE SENTENCE SUMMARY, e.g. “Engineer C. necator PhaC1 to incorporate 3HHx at >15 mol%”]

Dataset scope note: This document is built around a single reference enzyme (C. necator PhaC1) and published mutagenesis studies on that enzyme and its close variants. It does NOT use diverse multi-species sequence alignments. See Section 3 for implications and compensating strategies.


1. Enzyme Family Background

1.1 Classification

ClassSubunit structureSizeNative substrate preferenceExample organism
ISingle subunit~65 kDascl (C3–C5): 3HB, 3HV, 3HPCupriavidus necator H16
IISingle subunit~60 kDamcl (C6–C14): 3HHx, 3HO, 3HDPseudomonas aeruginosa
IIIHeterodimer (PhaC + PhaE)~40+40 kDasclAllochromatium vinosum
IVHeterodimer (PhaC + PhaR)~40+40 kDasclBacillus megaterium
  • Class I and II share ~50% sequence identity; Class III/IV are more distantly related
  • This project focuses exclusively on Class I, using Cn PhaC1 as the sole reference

1.2 Reaction chemistry

  • Catalyzes polymerization of (R)-3-hydroxyacyl-CoA thioesters into PHA
  • Ping-pong (double displacement) mechanism:
    1. Acylation: acyl group transferred to catalytic Cys, CoA released
    2. Transacylation: acyl group transferred to growing polymer chain
  • Lipase-like α/β hydrolase fold
  • Catalytic triad: Cys – His – Asp
    • C. necator PhaC1 (Cn) reference numbering: C319, D480, H508
    • All residue positions in this document use Cn PhaC1 numbering unless noted

1.3 Substrate scope terminology

TermChain lengthKey monomersNotes
sclC3–C53HP, 3HB, 3HVNative Cn PhaC1 preference
mclC6–C143HHx, 3HO, 3HD, 3HDDEngineering target
Broad/mixedC3–C14scl + mclIdeal outcome — rare
Specialtyvaries3H4MV, 3H2MB, aromaticOut of scope for this project

1.4 Why substrate specificity is structurally interesting

  • Substrate-binding tunnel geometry determines acyl chain length tolerance
  • Residues within ~5–10 Å of catalytic Cys are primary selectivity determinants
  • mcl selectivity often results from removal of steric clash (smaller residues), not addition of new contacts — counterintuitive but well-supported
  • Electrostatic environment affects CoA-thioester positioning
  • Dimerization interface indirectly influences active site geometry

2. Structural Information

Note: With a single-enzyme dataset, structural information becomes more important, not less. It is the primary source of positional reasoning in the absence of multi-species alignment signal. Invest time in this section.

2.1 Available experimental structures for Cn PhaC1

PDB IDDetailsResolutionNotes
5T6OC. necator PhaC1, Class I[X] ÅPrimary reference — use this
[ID][Any other Cn PhaC1 structures][Res]

2.2 AlphaFold model for Cn PhaC1

  • UniProt accession: Q05HB5 (Cn PhaC1)
  • Overall pLDDT: [score]
  • Confidence notes: [e.g. high confidence in core domain, low in N-terminal region residues 1–30 and surface loops]
  • Use AF model for: loop conformations, surface regions not in crystal structure
  • Prefer crystal structure (5T6O) for: active site geometry, tunnel dimensions

2.3 Key structural regions (Cn PhaC1 numbering)

RegionResiduesFunctionNotes
N-terminal domain1–170Regulatory, dimerizationLess conserved, lower structure confidence
Core catalytic domain171–400Contains Cys319High confidence, primary engineering target
C-terminal domain401–589Contains Asp480, His508High confidence
Substrate-binding tunnel[list residues]Selectivity determinantFill from structural analysis
Dimer interface[list residues]StabilityAvoid mutations here

2.4 Substrate-binding tunnel residues

(Fill this table carefully — it is the core of your structural reasoning)

PositionWT residueDistance to C319 (Å)Role in tunnelNotes
149Ala[X]Entrance regionKey specificity determinant
171[AA][X][role]
325Ser[X]Tunnel lining
392[AA][X][role]
[pos][AA][X][role]

How to fill this table: Open 5T6O in PyMOL or ChimeraX. Select C319. Run: select tunnel_res, byres (all within 10 of resi 319). List those residues here with distances. This is worth spending 1–2 hours on — it will substantially improve LLM reasoning quality.

2.5 Tunnel geometry notes

  • Estimated tunnel constriction in WT Cn PhaC1: ~[X] Å (from structural analysis)
  • Residues that form the constriction point: [list]
  • Estimated minimum cavity volume for 3HHx-CoA accommodation: [X ų if known]
  • [Add MD simulation or docking results here as they become available]

3. Dataset Scope, Limitations, and Compensating Strategies

This section is critical. Read before every LLM session.

3.1 What your dataset contains

  • Reference enzyme: C. necator H16 PhaC1 (wild-type)
  • Variants: Published point mutants, double mutants, and combinatorial variants of Cn PhaC1 from the mutagenesis literature
  • Labels: Substrate incorporation data (mol% monomer) from those studies
  • What it does NOT contain: Homologous PhaC sequences from other species, Class II sequences, or unlabeled natural variants

3.2 Implications and honest limitations

IssueExplanationImpact
No evolutionary signalWithout a multi-species alignment, you cannot use co-evolutionary analysis (MI, DCA) to identify specificity-determining positionsCannot compute MI scores; Section 3.4 of the original template is not applicable
Narrow sequence spaceAll data points are close variants of one sequence (~1–5 mutations from WT)Model cannot extrapolate to distant sequence space; suggestions far from WT are unreliable
Sparse coveragePublished mutagenesis studies cover only a small fraction of all possible positionsMany positions have no experimental data; reasoning about them is purely structural/hypothetical
Publication biasLiterature overwhelmingly reports positive results (mutations that did something interesting)Negative results (mutations with no effect) are underrepresented; hard to learn what doesn’t matter
Lab-to-lab variabilityDifferent studies use different assay conditions, hosts, carbon sourcesQuantitative comparisons across studies are unreliable; treat mol% values as approximate
Limited combinatorial dataFew studies systematically explore epistatic interactionsCombining individually beneficial mutations may not be additive

3.3 What this dataset IS good for

  • The LLM can reason very effectively about:
    • Mechanistic hypotheses — why does mutation X change specificity, based on structure and chemistry?
    • Interpreting your experimental results — what does an unexpected outcome tell you about the mechanism?
    • Experimental design — which mutations to test next given what is known?
    • Identifying gaps — which positions have never been mutated but are structurally important?
    • Literature synthesis — connecting observations across papers into a coherent mechanistic model

3.4 Compensating strategies

To partially offset the lack of multi-species alignment data:

  1. Lean heavily on structural reasoning (Section 2) — fill in the tunnel residue table as completely as possible; this replaces alignment signal as your primary source of positional hypotheses

  2. Include Class II reference data explicitly — even if not in your training set, you can add a “comparative note” section describing which Cn PhaC1 positions correspond to Class II residues (from manual alignment of just Cn PhaC1 vs. Pa PhaC1/2). This gives the LLM evolutionary context without requiring a full MSA.

  3. Weight negative results equally to positive — if you can find papers reporting mutations that failed to shift specificity, record them in Section 4.3. They are highly informative and rare in the literature.

  4. Be explicit about data gaps in prompts — tell the LLM “position X has never been mutated in the literature” so it flags its reasoning as structural/hypothetical rather than evidence-based.

  5. Use the LLM to propose positions to structurally analyze — ask it which tunnel residues it would prioritize examining in the crystal structure, then verify those manually before including them in subsequent prompts.

3.5 Class II reference comparison

(Manual alignment of just Cn PhaC1 vs. one or two Class II enzymes — fills in some evolutionary context without a full MSA)

Cn PhaC1 residueCn AAPa PhaC1 equivalent residuePa AASignificance
149Ala[pos][AA]Tunnel entrance
325Ser[pos][AA]Tunnel lining
[pos][AA][pos][AA]

How to fill this: Use a pairwise alignment tool (e.g. EMBOSS Needle at https://www.ebi.ac.uk/Tools/psa/emboss_needle/) with Cn PhaC1 (UniProt Q05HB5) and Pa PhaC1 (UniProt Q9HWK2). This takes ~10 minutes and is worth doing.


4. Experimental Mutation Database

(This is the heart of your dataset — populate as completely as possible)

4.1 Key literature to mine for Cn PhaC1 mutations

  • Tsuge et al. (2003) Macromolecules — F420 region, systematic Class I
  • Amara et al. (2002) — systematic Class I mutagenesis panel
  • Rehm et al. — early mechanistic mutagenesis
  • Nomura et al. — broad-specificity engineering attempts
  • Insomphun et al. — 3HHx incorporation focus
  • Hiroe et al. — combinatorial mutagenesis
  • [Add others as you find them — search PubMed: “PhaC mutagenesis” OR “polyhydroxyalkanoate synthase substrate specificity”]

Mining tip: For each paper, extract: (1) every mutation tested, including ones with no effect — these are just as valuable, (2) exact assay conditions, (3) quantitative data where reported. Even a table footnote saying “A300G showed no change in specificity” belongs here.

4.2 Gain-of-function mutations (toward mcl / broader specificity)

MutationEffect3HB mol%3HV mol%3HHx mol%Activity vs WTAssay conditionsReference
F420SGains 3HHx[X][X]~8%[X]%[conditions]Tsuge 2003
A510SIncreased mcl[X][X][X][X]%[conditions]Amara 2002
[mut][effect][ref]

4.3 Neutral mutations (no significant effect on specificity)

(Underrepresented in literature but critically important — record every instance you can find)

MutationRegionWhy testedOutcomeReference
[mut][region][rationale in original paper]No change in specificity[ref]

4.4 Deleterious mutations (loss of activity or expression)

MutationEffectSuspected reasonReference
[mut][e.g. insoluble, inactive][e.g. disrupts fold][ref]

4.5 Combinatorial / double mutants

MutationsEffect vs. singles3HHx mol%Epistasis observedReference
F420S + A510S[effect][X]%Additive / synergistic / antagonistic[ref]
[muts][effect][ref]

4.6 Thermostability mutations

(Relevant when stacking specificity mutations)

MutationΔTmEffect on activityEffect on specificityReference
[mut][+/- X°C][effect][effect][ref]

4.7 Positions that have NOT been mutated in literature

(Fill as you read — these are candidate positions for novel exploration)

PositionWT AADistance to C319Structural roleWhy interesting
[pos][AA][X] Å[role][rationale]

4.8 Data quality notes

  • Quantitative mol% values are sensitive to: carbon source ratios, growth phase, expression level, host strain — treat cross-study comparisons as qualitative only
  • In vitro CoA-release assays (DTNB) give intrinsic kinetic data but don’t fully reflect in vivo selectivity under substrate competition
  • Some older studies used racemic substrates — stereospecificity may confound apparent chain-length specificity
  • [Add specific notes about inconsistencies you notice across papers]

5. Your Starting Enzyme (Wild-Type Cn PhaC1)

5.1 Identity

  • Organism: Cupriavidus necator H16
  • UniProt accession: Q05HB5
  • PhaC class: I
  • Gene: phaC1 (in pha operon: phaCAB)
  • Full sequence length: 589 aa

5.2 Known properties of WT Cn PhaC1

  • Native substrate preference: scl — 3HB (major), 3HV (minor), negligible 3HHx
  • Specific activity: [X nmol/min/mg — fill from literature or your own data]
  • Thermostability: [Tm or optimal temperature]
  • Expression in E. coli: [your experience — yield, solubility]
  • Dimerization: active as dimer; monomer is inactive
  • Known issues: [e.g. requires careful lysis conditions, prone to aggregation at high concentration]

5.3 Full WT sequence

[PASTE FULL 589 aa SEQUENCE HERE — available from UniProt Q05HB5]

5.4 Substrate-binding pocket region

(Residues 300–340, centered on C319 — paste into prompts as needed)

Residues 300–340:
[PASTE SEQUENCE]
Position:  300                319                340
                               ^C319 (catalytic)

5.5 Catalytic and key residue positions (for quick reference)

ResidueAARole
C319CysCatalytic — nucleophile; DO NOT MUTATE
D480AspCatalytic triad; DO NOT MUTATE
H508HisCatalytic triad; DO NOT MUTATE
A149AlaTunnel entrance; primary specificity target
S325SerTunnel lining; specificity target
[others from Section 2.4]

6. Engineering Target

6.1 Primary goal

[State precisely, e.g.:]

Incorporate 3HHx (C6) at >15 mol% in copolymer produced in E. coli BL21 on mixed carbon source (sodium butyrate + sodium hexanoate), while retaining 3HB incorporation >50 mol%

6.2 Secondary goals

  • [e.g. Maintain thermostability — Tm drop <5°C acceptable]
  • [e.g. Retain soluble expression yield >X mg/L]
  • [e.g. Avoid total loss of scl activity]

6.3 Acceptable tradeoffs

  • [e.g. Up to 30% reduction in overall polymerization rate]
  • [e.g. Reduced in vivo PHA titer acceptable if specificity goal met]

6.4 Hard constraints — DO NOT VIOLATE

  • Do NOT mutate catalytic triad: C319, D480, H508
  • Maximum simultaneous mutations in any single variant: [N]
  • Must retain soluble expression in E. coli
  • Avoid dimer interface residues: [list positions]
  • [Any others specific to your project]

6.5 What has already been tested

(Update after every experiment round — prevents redundant suggestions)

Mutation(s)3HHx resultOther notable effectsDateNotes
WT control~0 mol%Baseline[date]
F420S~3 mol%[date]Insufficient
[mut][result][notes][date]

7. Production and Assay Context

7.1 Expression system

  • Host: [e.g. E. coli BL21(DE3)]
  • Vector: [e.g. pET-28a, N-terminal His6-tag]
  • Expression conditions: [e.g. 25°C, 16h post-induction, 0.5 mM IPTG]
  • Typical soluble yield: [X mg/L culture]
  • Purification: [e.g. Ni-NTA IMAC, single step]

7.2 In vivo PHA production conditions

  • Host: [e.g. E. coli BL21(DE3) ΔfadB or WT]
  • Co-expressed pathway genes: [e.g. phaA, phaB for 3HB-CoA; phaJ for 3HHx-CoA]
  • Carbon source(s): [e.g. 10 mM sodium butyrate + 5 mM sodium hexanoate]
  • Growth conditions: [e.g. M9 minimal, 30°C, 48h]
  • PHA content range (WT): [X–Y wt%]

7.3 In vitro activity assay (if used)

  • Assay type: [e.g. DTNB colorimetric assay — monitors CoA release at 412 nm]
  • Substrate(s): [e.g. 3HB-CoA, 3HHx-CoA at X µM]
  • Buffer conditions: [e.g. 50 mM Tris pH 7.5, 150 mM NaCl]

7.4 PHA analysis

  • Extraction method: [e.g. chloroform extraction, sodium hypochlorite digestion]
  • Monomer analysis: [e.g. GC-FID after acidic methanolysis; GC-MS for identification]
  • Quantification standard: [e.g. PHB standard curve]
  • Throughput: [e.g. 24 variants per experiment]

8. Reasoning Guidelines for LLM

8.1 Dataset context — tell the LLM explicitly at session start

Always include this statement at the top of each session prompt:

“My dataset consists only of C. necator PhaC1 (wild-type) and published point mutants of this single enzyme. I do not have a multi-species alignment. All positional reasoning should be grounded in (a) the experimental mutation database in Section 4, and (b) structural analysis of PDB 5T6O / AlphaFold model Q05HB5. Do not infer specificity determinants from phylogenetic patterns — that data is not available.”

8.2 Prioritization criteria (in order, adjusted for this dataset)

  1. Direct experimental evidence — mutations in Section 4 with measured outcomes
  2. Structural/mechanistic reasoning — based on 5T6O crystal structure and tunnel geometry (Section 2)
  3. Analogy to Class II — using the pairwise comparison in Section 3.5, noting explicitly when this is being used
  4. Chemical intuition — physicochemical rationale for a substitution, flagged as [SPECULATIVE] if no structural or experimental support

8.3 Required output format for mutation suggestions

For every suggested mutation, provide:

  • (a) Mutation in standard notation (e.g. A149F, Cn PhaC1 numbering)
  • (b) Primary evidence basis: Experimental / Structural / Class II analogy / Chemical intuition [SPECULATIVE]
  • (c) Mechanistic rationale — specific, not generic
  • (d) Consistency with existing data in Section 4 — does it contradict anything?
  • (e) Confidence: High (direct experimental support) / Medium (structural + analogy) / Low (chemical intuition only)
  • (f) Predicted risk: stability, expression, activity loss

8.4 Reasoning I do NOT want

  • Statements like “this position is conserved in mcl enzymes” — you do not have alignment data to support this; use only the pairwise comparison in 3.5
  • Overconfident quantitative predictions of mol% outcomes
  • Suggestions violating hard constraints in Section 6.4
  • Suggestions already in Section 6.5 “already tested” table
  • Filling data gaps with plausible-sounding inventions — flag uncertainty explicitly

8.5 Especially useful prompts for this dataset type

Given the single-enzyme focus, these prompt types will be most productive:

  • Gap analysis: “Which tunnel-lining residues (Section 2.4) have never been mutated in the literature (Section 4.7)? For each, give a structural rationale for whether they are likely to affect specificity.”

  • Mechanistic interpretation: “Mutation X gave unexpected result Y. Given the structural context of position X (distance to C319, neighboring residues, tunnel role), propose 2–3 mechanistic explanations.”

  • Epistasis prediction: “Given that A149F and S325A are individually beneficial, reason about whether their combination is likely to be additive, synergistic, or antagonistic, based on their structural relationship.”

  • Experimental prioritization: “I can test 12 variants. Given the mutation database and structural data, design a 12-variant panel that maximizes information gained about specificity determinants.”

8.6 My background

[e.g.:]

PhD in microbiology/biochemistry. Comfortable with protein biochemistry, enzyme kinetics, and microbial fermentation. Less experienced with structural biology — please explain structural reasoning clearly but do not oversimplify the biochemistry.


9. Session Log

(Prepend full context document + append this log to every session)

Session [DATE]

Prompt used: [Paste exact prompt]

Key outputs / hypotheses: [Summarize or paste key suggestions]

Your assessment: [Which suggestions seem credible? Which seem poorly supported?]

Flagged uncertainties from LLM: [Note any [SPECULATIVE] tags or explicit uncertainties the LLM raised]

Action items:

  • [e.g. Verify position 171 distance to C319 in PyMOL]
  • [e.g. Test A149F single mutant]
  • [e.g. Search for papers on position 392 mutagenesis]

Session [DATE]

(repeat block)


10. Experimental Results Log

Experiment [DATE / ID]

Variants tested:

Variant3HB mol%3HV mol%3HHx mol%Total PHA wt%Soluble expression?Notes
WT[X][X]~0[X]YesControl
[mut][X][X][X][X][Y/N]

Interpretation: [What do these results mean for your mechanistic model?]

Surprises / inconsistencies with predictions: [Critical to record — unexpected results are often the most informative]

Updated hypotheses: [How do results revise your model of specificity determinants?]

Add to mutation database: [Y/N — copy rows to Section 4 as appropriate]


End of context document — v2.0 (single-enzyme / mutagenesis dataset scope) Keep this file updated and prepend it in full to every new LLM session