Canary Circuit

Section 1: Abstract

:contentReference[oaicite:0]{index=0} (HD) is a fatal inherited neurodegenerative disorder caused by a CAG trinucleotide repeat expansion in the HTT gene.

Although individuals are born with this mutation, neurons remain functionally stable for most of life. Rapid degeneration occurs only after the repeat crosses a somatic threshold of approximately 150 CAGs. This process is described by the ELongATE model.

This project proposes Canary Circuit: a two-part computational and conceptual framework that:

Uses physics-informed neural networks (PINNs) constrained by the Handsaker et al. continuous-time Markov chain (CTMC) master equation to recover expansion-rate functions from cross-sectional single-cell data.
Translates the inferred minimal regulatory framework into a conceptual genetic circuit capable of detecting early disease phases before irreversible neuronal loss.

Like a canary in a coal mine, the circuit is intended to signal danger before the critical threshold is crossed.

Section 2: Background and Motivation

Huntington’s disease is an autosomal dominant disorder. Global prevalence has increased from 2.71 to 4.88 per 100,000 individuals, with founder-effect populations reaching dramatically higher frequencies.

Although the mutation is inherited at birth, symptoms emerge only after decades, followed by rapid neurological decline. This prolonged pre-symptomatic phase historically lacked a mechanistic explanation.

Handsaker et al. (Cell, 2025) demonstrated that the delay arises from somatic repeat expansion. Using single-cell measurements of CAG repeat length and genome-wide gene expression in human striatal neurons, the study showed:

neurons remain stable during slow expansion
rapid transcriptional collapse occurs after ~150 CAGs
neuronal identity genes are lost
developmental programs become derepressed

This progression is formalized as the ELongATE model, which divides disease progression into five sequential phases (A–E).

The study established that somatic expansion itself — rather than inherited repeat length alone — is the proximal driver of neuronal degeneration.

Subsequent Validation

Independent studies further support this framework.

MSH3 and PMS1 drive expansion through mismatch-repair-associated stabilization of DNA hairpins.
Expansion can be pharmacologically suppressed in human neurons.
Transcriptional repression of the HTT locus reduces somatic instability.
Blood-based expansion correlates with early neurodegeneration in living patients.

The Gap

The ELongATE model remains descriptive rather than mechanistic.

It identifies:

phases
thresholds
outcomes

but does not resolve the minimal regulatory system connecting:

repeat expansion
DNA repair activity
transcriptional collapse

Without this mechanistic layer, predicting individual disease trajectories or designing targeted interventions remains difficult.

Neurons spend more than 95% of their lifespan in a pre-symptomatic but unstable state — creating a potentially large therapeutic intervention window.

The Canary Circuit specifically targets this silent transition period.

Section 3: Project Aims

Aim 1: Validate the PINN Against Established ELongATE Biology

The first aim validates whether the PINN can recover the biology observed experimentally by Handsaker et al.

Using:

per-cell CAG repeat length
gene expression data
donor-specific distributions

the network is trained by minimizing both:

CTMC master-equation residuals
divergence from observed repeat-length distributions

Success Criteria

Minimized KL divergence between predicted and observed distributions
Recovery of the two-phase expansion pattern
Recovery of the ~150 CAG transcriptional threshold without hard-coding
Successful leave-one-out donor validation

This aim establishes the computational foundation for subsequent aims.

Aim 2: Infer the Minimal Regulatory Structure Governing Phase A → B Acceleration

This aim moves from validation into mechanistic inference.

The objective is to determine why expansion accelerates during the Phase A → B transition.

Approach

Modifier genes identified through GWAS are incorporated as PINN covariates:

MSH3
FAN1
MLH1
PMS1
PMS2
LIG1

Additional analyses include:

modifier-expression estimation
transcriptional regulation of HTT
cross-regional transcriptomic comparisons

Output

A minimal regulatory graph identifying the components most predictive of transition timing.

This is the primary novel contribution of the project.

Aim 3: Design a Conceptual Genetic Circuit for Early Phase Detection

The third aim translates the inferred regulatory graph into the Canary Circuit.

The circuit detects signatures of Phase B acceleration before the ~150 CAG toxicity threshold is crossed.

Circuit Structure

Input: proxy for MSH3 activity or HTT transcription rate
Output: measurable reporter signal during the Phase B acceleration window

The circuit is simulated using ordinary differential equations (ODEs).

Evaluation Criteria

Detectable output specifically during Phase B
Low false-positive rate during Phase A
Low false-negative rate at Phase C onset

The hypothesis is that the circuit can discriminate:

Phase A
Phase B
Phase C

using inferred regulatory dynamics.

Section 4: Methodology

4.1 Data Sources and Roles

Dataset	Source	Role in Project
Single-cell CAG + expression data	Handsaker et al. (2025), NeMO	PINN observational constraint
GWAS modifier genotypes	GeM-HD Consortium	Covariate selection
Multi-region transcriptomics	Mätlik et al. (2024)	Cross-regional comparison

Data Access Note

Only open-access components of the datasets are used.

Controlled-access sequencing data requiring dbGaP approval is not included in this project.

4.2 PINN Architecture and Master Equation Constraint

The biological dynamics are modeled using the Handsaker CTMC master equation.

Preprocessing

Per-cell repeat lengths are binned into donor-level probability distributions.

The PINN is trained on distributions rather than individual cells.

PINN Design

A feedforward neural network learns the expansion-rate functions:

( \alpha(n) )
( \beta(n) )

where ( n ) is repeat length.

Loss Components

Physics loss: CTMC residual
Data loss: predicted vs observed distributions

Implementation

PyTorch
custom PDE residual loss
donor-wise and joint training

Validation

Leave-one-out cross-validation across six deeply sampled donors.

Primary metric:

KL divergence

4.2.1 Continuous Fokker–Planck Formulation

The discrete CTMC is expressed in its continuous Fokker–Planck limit:

:contentReference[oaicite:1]{index=1}

where:

( P(n,t) ) is the probability density of repeat length ( n )
( \mu(n,\alpha) ) is the drift term
( D(n) ) is the diffusion coefficient

Distribution Interpretation

The model tracks the full population distribution of neurons.

Initially, neurons cluster near inherited repeat lengths (~42 CAGs). Over time, the distribution:

drifts toward larger repeat lengths
broadens due to stochastic variability

Drift Term

The drift component:

:contentReference[oaicite:2]{index=2}

controls directional expansion.

Within the Canary Circuit framework:

expansion is slow below ~80 CAGs
accelerates during the A → B transition
becomes fastest above ~150 CAGs

High MSH3 activity increases drift magnitude, while FAN1-associated stabilization reduces it.

Diffusion Term

The diffusion component:

:contentReference[oaicite:3]{index=3}

captures stochastic variability between neurons.

Even identical starting repeat lengths diverge over time due to:

mismatch repair variability
slipped-strand formation
repair timing differences

Without diffusion, the model predicts a narrow deterministic wave rather than the experimentally observed broad distribution.

Why Use the Fokker–Planck Formulation

The original CTMC is discrete and difficult to differentiate directly within neural-network optimization.

The Fokker–Planck approximation is continuous and differentiable, enabling automatic differentiation during PINN training.

Role of the PINN

Rather than repeatedly solving simulations numerically, the PINN learns a continuous approximation:

:contentReference[oaicite:4]{index=4}

Once trained, the model can estimate distributions for arbitrary modifier combinations using a single forward pass.

4.3 GWAS Modifier Integration (Aim 2)

GWAS modifiers identify which DNA repair genes alter HD onset timing.

Selected modifiers include:

MSH3
FAN1
MLH1
PMS1
PMS2
LIG1

Expression values from the Handsaker dataset are incorporated as continuous covariates affecting expansion-rate parameters.

Published knockout measurements initialize directional effects before fitting to human single-cell data.

4.4 Mechanistic Inference Procedure (Aim 2)

The minimal regulatory graph is inferred by:

fitting the PINN with individual modifiers
measuring variance explained
performing ablation analysis
retaining only informative nodes

Cross-regional transcriptomic datasets are analyzed to determine whether transition dynamics are striatum-specific.

4.5 ODE Circuit Simulation (Aim 3)

The Canary Circuit is simulated as a two-node ODE system.

Sensor Node

Represents:

MSH3 activity
or HTT transcription rate

Output Node

Represents a threshold-activated reporter.

The output crosses a detection threshold during the Phase B acceleration window.

Simulation

kinetic parameters initialized from published MMR kinetics
parameter sweeps across biologically plausible ranges
sensitivity analysis with ±50% perturbations

Implementation

SciPy odeint in Python.

4.6 Circularity Acknowledgment

The ~150 CAG threshold used during PINN training originates from the same Handsaker dataset.

This creates unavoidable circularity within a single-cohort cross-sectional dataset.

The PINN therefore does not independently discover the threshold; it incorporates it as prior biological knowledge.

Section 5: Limitations and Bioethics

Limitations

Temporal Structure Comes From Physics

The PINN recovers dynamics by embedding the CTMC master equation directly into the optimization process.

Recovered parameters therefore depend strongly on the assumptions built into the governing equations.

Small and Genetically Narrow Cohort

The model is trained on six deeply sampled donors with inherited repeat lengths between 40–43 CAGs.

Generalization to:

juvenile-onset HD
diverse ancestries
broader repeat-length ranges

remains uncertain.

Fixed Threshold Assumption

The ~150 CAG threshold is treated as invariant.

If thresholds differ across:

brain regions
cell types
modifier backgrounds

then recovered parameters may become biased.

Bioethical Considerations

All datasets are:

pre-collected
anonymized
accessed under published agreements

No new patient data is collected.

The project is entirely computational.

Translational Considerations

If implemented experimentally in the future:

the Canary Circuit would function as a neuronal biosensor
biosafety review and IRB oversight would be required

Additionally, because HD disproportionately affects founder-effect populations such as the Lake Maracaibo community, equitable access and community engagement must remain central to any future translational development.

Section 6: References

Handsaker RE, Kashin S, Reed NM, et al. Cell. 2025;188(3):623–639.e19.
GeM-HD Consortium. Nature Genetics. 2025;57(6):1426–1436.
Wang N, et al. Cell. 2025;188:1524–1544.e22.
Mathews EW, Coffey SR, Gärtner A, et al. Nature Communications. 2025;16:10009.
Richard G-F. Cells. 2021;10:1019.
Monckton DG, Jones L, Pearson CE, Wheeler V. Journal of Huntington’s Disease. 2021;10(1):7–33.
Bunting EL, Donaldson J, Cumming SA, et al. Science Translational Medicine. 2025.
Scahill RI, et al. Nature Medicine. 2025.
Mätlik K, Baffuto M, Kus L, et al. Nature Genetics. 2024;56:383–394.