Canary Circuit
Presentation:
Canary Circuit Slides
Section 1: Abstract
:contentReference[oaicite:0]{index=0} (HD) is a fatal inherited neurodegenerative disorder caused by a CAG trinucleotide repeat expansion in the HTT gene.
Although individuals are born with this mutation, neurons remain functionally stable for most of life. Rapid degeneration occurs only after the repeat crosses a somatic threshold of approximately 150 CAGs. This process is described by the ELongATE model.
This project proposes Canary Circuit: a two-part computational and conceptual framework that:
- Uses physics-informed neural networks (PINNs) constrained by the Handsaker et al. continuous-time Markov chain (CTMC) master equation to recover expansion-rate functions from cross-sectional single-cell data.
- Translates the inferred minimal regulatory framework into a conceptual genetic circuit capable of detecting early disease phases before irreversible neuronal loss.
Like a canary in a coal mine, the circuit is intended to signal danger before the critical threshold is crossed.
Section 2: Background and Motivation
Huntington’s disease is an autosomal dominant disorder. Global prevalence has increased from 2.71 to 4.88 per 100,000 individuals, with founder-effect populations reaching dramatically higher frequencies.
Although the mutation is inherited at birth, symptoms emerge only after decades, followed by rapid neurological decline. This prolonged pre-symptomatic phase historically lacked a mechanistic explanation.
Handsaker et al. (Cell, 2025) demonstrated that the delay arises from somatic repeat expansion. Using single-cell measurements of CAG repeat length and genome-wide gene expression in human striatal neurons, the study showed:
- neurons remain stable during slow expansion
- rapid transcriptional collapse occurs after ~150 CAGs
- neuronal identity genes are lost
- developmental programs become derepressed
This progression is formalized as the ELongATE model, which divides disease progression into five sequential phases (A–E).
The study established that somatic expansion itself — rather than inherited repeat length alone — is the proximal driver of neuronal degeneration.
Subsequent Validation
Independent studies further support this framework.
- MSH3 and PMS1 drive expansion through mismatch-repair-associated stabilization of DNA hairpins.
- Expansion can be pharmacologically suppressed in human neurons.
- Transcriptional repression of the HTT locus reduces somatic instability.
- Blood-based expansion correlates with early neurodegeneration in living patients.
The Gap
The ELongATE model remains descriptive rather than mechanistic.
It identifies:
- phases
- thresholds
- outcomes
but does not resolve the minimal regulatory system connecting:
- repeat expansion
- DNA repair activity
- transcriptional collapse
Without this mechanistic layer, predicting individual disease trajectories or designing targeted interventions remains difficult.
Neurons spend more than 95% of their lifespan in a pre-symptomatic but unstable state — creating a potentially large therapeutic intervention window.
The Canary Circuit specifically targets this silent transition period.
Section 3: Project Aims
Aim 1: Validate the PINN Against Established ELongATE Biology
The first aim validates whether the PINN can recover the biology observed experimentally by Handsaker et al.
Using:
- per-cell CAG repeat length
- gene expression data
- donor-specific distributions
the network is trained by minimizing both:
- CTMC master-equation residuals
- divergence from observed repeat-length distributions
Success Criteria
- Minimized KL divergence between predicted and observed distributions
- Recovery of the two-phase expansion pattern
- Recovery of the ~150 CAG transcriptional threshold without hard-coding
- Successful leave-one-out donor validation
This aim establishes the computational foundation for subsequent aims.
Aim 2: Infer the Minimal Regulatory Structure Governing Phase A → B Acceleration
This aim moves from validation into mechanistic inference.
The objective is to determine why expansion accelerates during the Phase A → B transition.
Approach
Modifier genes identified through GWAS are incorporated as PINN covariates:
- MSH3
- FAN1
- MLH1
- PMS1
- PMS2
- LIG1
Additional analyses include:
- modifier-expression estimation
- transcriptional regulation of HTT
- cross-regional transcriptomic comparisons
Output
A minimal regulatory graph identifying the components most predictive of transition timing.
This is the primary novel contribution of the project.
Aim 3: Design a Conceptual Genetic Circuit for Early Phase Detection
The third aim translates the inferred regulatory graph into the Canary Circuit.
The circuit detects signatures of Phase B acceleration before the ~150 CAG toxicity threshold is crossed.
Circuit Structure
- Input: proxy for MSH3 activity or HTT transcription rate
- Output: measurable reporter signal during the Phase B acceleration window
The circuit is simulated using ordinary differential equations (ODEs).
Evaluation Criteria
- Detectable output specifically during Phase B
- Low false-positive rate during Phase A
- Low false-negative rate at Phase C onset
The hypothesis is that the circuit can discriminate:
- Phase A
- Phase B
- Phase C
using inferred regulatory dynamics.
Section 4: Methodology
4.1 Data Sources and Roles
| Dataset | Source | Role in Project |
|---|---|---|
| Single-cell CAG + expression data | Handsaker et al. (2025), NeMO | PINN observational constraint |
| GWAS modifier genotypes | GeM-HD Consortium | Covariate selection |
| Multi-region transcriptomics | Mätlik et al. (2024) | Cross-regional comparison |
Data Access Note
Only open-access components of the datasets are used.
Controlled-access sequencing data requiring dbGaP approval is not included in this project.
4.2 PINN Architecture and Master Equation Constraint
The biological dynamics are modeled using the Handsaker CTMC master equation.
Preprocessing
Per-cell repeat lengths are binned into donor-level probability distributions.
The PINN is trained on distributions rather than individual cells.
PINN Design
A feedforward neural network learns the expansion-rate functions:
- ( \alpha(n) )
- ( \beta(n) )
where ( n ) is repeat length.
Loss Components
- Physics loss: CTMC residual
- Data loss: predicted vs observed distributions
Implementation
- PyTorch
- custom PDE residual loss
- donor-wise and joint training
Validation
Leave-one-out cross-validation across six deeply sampled donors.
Primary metric:
- KL divergence
4.2.1 Continuous Fokker–Planck Formulation
The discrete CTMC is expressed in its continuous Fokker–Planck limit:
:contentReference[oaicite:1]{index=1}
where:
- ( P(n,t) ) is the probability density of repeat length ( n )
- ( \mu(n,\alpha) ) is the drift term
- ( D(n) ) is the diffusion coefficient
Distribution Interpretation
The model tracks the full population distribution of neurons.
Initially, neurons cluster near inherited repeat lengths (~42 CAGs). Over time, the distribution:
- drifts toward larger repeat lengths
- broadens due to stochastic variability
Drift Term
The drift component:
:contentReference[oaicite:2]{index=2}
controls directional expansion.
Within the Canary Circuit framework:
- expansion is slow below ~80 CAGs
- accelerates during the A → B transition
- becomes fastest above ~150 CAGs
High MSH3 activity increases drift magnitude, while FAN1-associated stabilization reduces it.
Diffusion Term
The diffusion component:
:contentReference[oaicite:3]{index=3}
captures stochastic variability between neurons.
Even identical starting repeat lengths diverge over time due to:
- mismatch repair variability
- slipped-strand formation
- repair timing differences
Without diffusion, the model predicts a narrow deterministic wave rather than the experimentally observed broad distribution.
Why Use the Fokker–Planck Formulation
The original CTMC is discrete and difficult to differentiate directly within neural-network optimization.
The Fokker–Planck approximation is continuous and differentiable, enabling automatic differentiation during PINN training.
Role of the PINN
Rather than repeatedly solving simulations numerically, the PINN learns a continuous approximation:
:contentReference[oaicite:4]{index=4}
Once trained, the model can estimate distributions for arbitrary modifier combinations using a single forward pass.
4.3 GWAS Modifier Integration (Aim 2)
GWAS modifiers identify which DNA repair genes alter HD onset timing.
Selected modifiers include:
- MSH3
- FAN1
- MLH1
- PMS1
- PMS2
- LIG1
Expression values from the Handsaker dataset are incorporated as continuous covariates affecting expansion-rate parameters.
Published knockout measurements initialize directional effects before fitting to human single-cell data.
4.4 Mechanistic Inference Procedure (Aim 2)
The minimal regulatory graph is inferred by:
- fitting the PINN with individual modifiers
- measuring variance explained
- performing ablation analysis
- retaining only informative nodes
Cross-regional transcriptomic datasets are analyzed to determine whether transition dynamics are striatum-specific.
4.5 ODE Circuit Simulation (Aim 3)
The Canary Circuit is simulated as a two-node ODE system.
Sensor Node
Represents:
- MSH3 activity
- or HTT transcription rate
Output Node
Represents a threshold-activated reporter.
The output crosses a detection threshold during the Phase B acceleration window.
Simulation
- kinetic parameters initialized from published MMR kinetics
- parameter sweeps across biologically plausible ranges
- sensitivity analysis with ±50% perturbations
Implementation
SciPy odeint in Python.
4.6 Circularity Acknowledgment
The ~150 CAG threshold used during PINN training originates from the same Handsaker dataset.
This creates unavoidable circularity within a single-cohort cross-sectional dataset.
The PINN therefore does not independently discover the threshold; it incorporates it as prior biological knowledge.
Section 5: Limitations and Bioethics
Limitations
Temporal Structure Comes From Physics
The PINN recovers dynamics by embedding the CTMC master equation directly into the optimization process.
Recovered parameters therefore depend strongly on the assumptions built into the governing equations.
Small and Genetically Narrow Cohort
The model is trained on six deeply sampled donors with inherited repeat lengths between 40–43 CAGs.
Generalization to:
- juvenile-onset HD
- diverse ancestries
- broader repeat-length ranges
remains uncertain.
Fixed Threshold Assumption
The ~150 CAG threshold is treated as invariant.
If thresholds differ across:
- brain regions
- cell types
- modifier backgrounds
then recovered parameters may become biased.
Bioethical Considerations
All datasets are:
- pre-collected
- anonymized
- accessed under published agreements
No new patient data is collected.
The project is entirely computational.
Translational Considerations
If implemented experimentally in the future:
- the Canary Circuit would function as a neuronal biosensor
- biosafety review and IRB oversight would be required
Additionally, because HD disproportionately affects founder-effect populations such as the Lake Maracaibo community, equitable access and community engagement must remain central to any future translational development.
Section 6: References
- Handsaker RE, Kashin S, Reed NM, et al. Cell. 2025;188(3):623–639.e19.
- GeM-HD Consortium. Nature Genetics. 2025;57(6):1426–1436.
- Wang N, et al. Cell. 2025;188:1524–1544.e22.
- Mathews EW, Coffey SR, Gärtner A, et al. Nature Communications. 2025;16:10009.
- Richard G-F. Cells. 2021;10:1019.
- Monckton DG, Jones L, Pearson CE, Wheeler V. Journal of Huntington’s Disease. 2021;10(1):7–33.
- Bunting EL, Donaldson J, Cumming SA, et al. Science Translational Medicine. 2025.
- Scahill RI, et al. Nature Medicine. 2025.
- Mätlik K, Baffuto M, Kus L, et al. Nature Genetics. 2024;56:383–394.