Presentation:
Canary Circuit Slides
Section 1: Abstract :contentReference[oaicite:0]{index=0} (HD) is a fatal inherited neurodegenerative disorder caused by a CAG trinucleotide repeat expansion in the HTT gene.
Although individuals are born with this mutation, neurons remain functionally stable for most of life. Rapid degeneration occurs only after the repeat crosses a somatic threshold of approximately 150 CAGs. This process is described by the ELongATE model.
:contentReference[oaicite:0]{index=0} (HD) is a fatal inherited neurodegenerative disorder caused by a CAG trinucleotide repeat expansion in the HTT gene.
Although individuals are born with this mutation, neurons remain functionally stable for most of life. Rapid degeneration occurs only after the repeat crosses a somatic threshold of approximately 150 CAGs. This process is described by the ELongATE model.
This project proposes Canary Circuit: a two-part computational and conceptual framework that:
Uses physics-informed neural networks (PINNs) constrained by the Handsaker et al. continuous-time Markov chain (CTMC) master equation to recover expansion-rate functions from cross-sectional single-cell data.
Translates the inferred minimal regulatory framework into a conceptual genetic circuit capable of detecting early disease phases before irreversible neuronal loss.
Like a canary in a coal mine, the circuit is intended to signal danger before the critical threshold is crossed.
Section 2: Background and Motivation
Huntington’s disease is an autosomal dominant disorder. Global prevalence has increased from 2.71 to 4.88 per 100,000 individuals, with founder-effect populations reaching dramatically higher frequencies.
Although the mutation is inherited at birth, symptoms emerge only after decades, followed by rapid neurological decline. This prolonged pre-symptomatic phase historically lacked a mechanistic explanation.
Handsaker et al. (Cell, 2025) demonstrated that the delay arises from somatic repeat expansion. Using single-cell measurements of CAG repeat length and genome-wide gene expression in human striatal neurons, the study showed:
neurons remain stable during slow expansion
rapid transcriptional collapse occurs after ~150 CAGs
neuronal identity genes are lost
developmental programs become derepressed
This progression is formalized as the ELongATE model, which divides disease progression into five sequential phases (A–E).
The study established that somatic expansion itself — rather than inherited repeat length alone — is the proximal driver of neuronal degeneration.
Subsequent Validation
Independent studies further support this framework.
MSH3 and PMS1 drive expansion through mismatch-repair-associated stabilization of DNA hairpins.
Expansion can be pharmacologically suppressed in human neurons.
Transcriptional repression of the HTT locus reduces somatic instability.
Blood-based expansion correlates with early neurodegeneration in living patients.
The Gap
The ELongATE model remains descriptive rather than mechanistic.
It identifies:
phases
thresholds
outcomes
but does not resolve the minimal regulatory system connecting:
repeat expansion
DNA repair activity
transcriptional collapse
Without this mechanistic layer, predicting individual disease trajectories or designing targeted interventions remains difficult.
Neurons spend more than 95% of their lifespan in a pre-symptomatic but unstable state — creating a potentially large therapeutic intervention window.
The Canary Circuit specifically targets this silent transition period.
Section 3: Project Aims
Aim 1: Validate the PINN Against Established ELongATE Biology
The first aim validates whether the PINN can recover the biology observed experimentally by Handsaker et al.
Using:
per-cell CAG repeat length
gene expression data
donor-specific distributions
the network is trained by minimizing both:
CTMC master-equation residuals
divergence from observed repeat-length distributions
Success Criteria
Minimized KL divergence between predicted and observed distributions
Recovery of the two-phase expansion pattern
Recovery of the ~150 CAG transcriptional threshold without hard-coding
Successful leave-one-out donor validation
This aim establishes the computational foundation for subsequent aims.
Aim 2: Infer the Minimal Regulatory Structure Governing Phase A → B Acceleration
This aim moves from validation into mechanistic inference.
The objective is to determine why expansion accelerates during the Phase A → B transition.
Approach
Modifier genes identified through GWAS are incorporated as PINN covariates:
MSH3
FAN1
MLH1
PMS1
PMS2
LIG1
Additional analyses include:
modifier-expression estimation
transcriptional regulation of HTT
cross-regional transcriptomic comparisons
Output
A minimal regulatory graph identifying the components most predictive of transition timing.
This is the primary novel contribution of the project.
Aim 3: Design a Conceptual Genetic Circuit for Early Phase Detection
The third aim translates the inferred regulatory graph into the Canary Circuit.
The circuit detects signatures of Phase B acceleration before the ~150 CAG toxicity threshold is crossed.
Circuit Structure
Input: proxy for MSH3 activity or HTT transcription rate
Output: measurable reporter signal during the Phase B acceleration window
The circuit is simulated using ordinary differential equations (ODEs).
Evaluation Criteria
Detectable output specifically during Phase B
Low false-positive rate during Phase A
Low false-negative rate at Phase C onset
The hypothesis is that the circuit can discriminate:
Phase A
Phase B
Phase C
using inferred regulatory dynamics.
Section 4: Methodology
4.1 Data Sources and Roles
Dataset
Source
Role in Project
Single-cell CAG + expression data
Handsaker et al. (2025), NeMO
PINN observational constraint
GWAS modifier genotypes
GeM-HD Consortium
Covariate selection
Multi-region transcriptomics
Mätlik et al. (2024)
Cross-regional comparison
Data Access Note
Only open-access components of the datasets are used.
Controlled-access sequencing data requiring dbGaP approval is not included in this project.
4.2 PINN Architecture and Master Equation Constraint
The biological dynamics are modeled using the Handsaker CTMC master equation.
Preprocessing
Per-cell repeat lengths are binned into donor-level probability distributions.
The PINN is trained on distributions rather than individual cells.
PINN Design
A feedforward neural network learns the expansion-rate functions:
( \alpha(n) )
( \beta(n) )
where ( n ) is repeat length.
Loss Components
Physics loss: CTMC residual
Data loss: predicted vs observed distributions
Implementation
PyTorch
custom PDE residual loss
donor-wise and joint training
Validation
Leave-one-out cross-validation across six deeply sampled donors.
Primary metric:
KL divergence
4.2.1 Continuous Fokker–Planck Formulation
The discrete CTMC is expressed in its continuous Fokker–Planck limit:
:contentReference[oaicite:1]{index=1}
where:
( P(n,t) ) is the probability density of repeat length ( n )
( \mu(n,\alpha) ) is the drift term
( D(n) ) is the diffusion coefficient
Distribution Interpretation
The model tracks the full population distribution of neurons.
Initially, neurons cluster near inherited repeat lengths (~42 CAGs). Over time, the distribution:
drifts toward larger repeat lengths
broadens due to stochastic variability
Drift Term
The drift component:
:contentReference[oaicite:2]{index=2}
controls directional expansion.
Within the Canary Circuit framework:
expansion is slow below ~80 CAGs
accelerates during the A → B transition
becomes fastest above ~150 CAGs
High MSH3 activity increases drift magnitude, while FAN1-associated stabilization reduces it.
Diffusion Term
The diffusion component:
:contentReference[oaicite:3]{index=3}
captures stochastic variability between neurons.
Even identical starting repeat lengths diverge over time due to:
mismatch repair variability
slipped-strand formation
repair timing differences
Without diffusion, the model predicts a narrow deterministic wave rather than the experimentally observed broad distribution.
Why Use the Fokker–Planck Formulation
The original CTMC is discrete and difficult to differentiate directly within neural-network optimization.
The Fokker–Planck approximation is continuous and differentiable, enabling automatic differentiation during PINN training.
Role of the PINN
Rather than repeatedly solving simulations numerically, the PINN learns a continuous approximation:
:contentReference[oaicite:4]{index=4}
Once trained, the model can estimate distributions for arbitrary modifier combinations using a single forward pass.
4.3 GWAS Modifier Integration (Aim 2)
GWAS modifiers identify which DNA repair genes alter HD onset timing.
Selected modifiers include:
MSH3
FAN1
MLH1
PMS1
PMS2
LIG1
Expression values from the Handsaker dataset are incorporated as continuous covariates affecting expansion-rate parameters.
Published knockout measurements initialize directional effects before fitting to human single-cell data.
4.4 Mechanistic Inference Procedure (Aim 2)
The minimal regulatory graph is inferred by:
fitting the PINN with individual modifiers
measuring variance explained
performing ablation analysis
retaining only informative nodes
Cross-regional transcriptomic datasets are analyzed to determine whether transition dynamics are striatum-specific.
4.5 ODE Circuit Simulation (Aim 3)
The Canary Circuit is simulated as a two-node ODE system.
Sensor Node
Represents:
MSH3 activity
or HTT transcription rate
Output Node
Represents a threshold-activated reporter.
The output crosses a detection threshold during the Phase B acceleration window.
Simulation
kinetic parameters initialized from published MMR kinetics
parameter sweeps across biologically plausible ranges
sensitivity analysis with ±50% perturbations
Implementation
SciPy odeint in Python.
4.6 Circularity Acknowledgment
The ~150 CAG threshold used during PINN training originates from the same Handsaker dataset.
This creates unavoidable circularity within a single-cohort cross-sectional dataset.
The PINN therefore does not independently discover the threshold; it incorporates it as prior biological knowledge.
Section 5: Limitations and Bioethics
Limitations
Temporal Structure Comes From Physics
The PINN recovers dynamics by embedding the CTMC master equation directly into the optimization process.
Recovered parameters therefore depend strongly on the assumptions built into the governing equations.
Small and Genetically Narrow Cohort
The model is trained on six deeply sampled donors with inherited repeat lengths between 40–43 CAGs.
Generalization to:
juvenile-onset HD
diverse ancestries
broader repeat-length ranges
remains uncertain.
Fixed Threshold Assumption
The ~150 CAG threshold is treated as invariant.
If thresholds differ across:
brain regions
cell types
modifier backgrounds
then recovered parameters may become biased.
Bioethical Considerations
All datasets are:
pre-collected
anonymized
accessed under published agreements
No new patient data is collected.
The project is entirely computational.
Translational Considerations
If implemented experimentally in the future:
the Canary Circuit would function as a neuronal biosensor
biosafety review and IRB oversight would be required
Additionally, because HD disproportionately affects founder-effect populations such as the Lake Maracaibo community, equitable access and community engagement must remain central to any future translational development.
Section 6: References
Handsaker RE, Kashin S, Reed NM, et al. Cell. 2025;188(3):623–639.e19.