Projects

Final projects:

  • Presentation: Canary Circuit Slides Section 1: Abstract :contentReference[oaicite:0]{index=0} (HD) is a fatal inherited neurodegenerative disorder caused by a CAG trinucleotide repeat expansion in the HTT gene. Although individuals are born with this mutation, neurons remain functionally stable for most of life. Rapid degeneration occurs only after the repeat crosses a somatic threshold of approximately 150 CAGs. This process is described by the ELongATE model.

Subsections of Projects

Canary Circuit

Presentation:
Canary Circuit Slides


Section 1: Abstract

:contentReference[oaicite:0]{index=0} (HD) is a fatal inherited neurodegenerative disorder caused by a CAG trinucleotide repeat expansion in the HTT gene.

Although individuals are born with this mutation, neurons remain functionally stable for most of life. Rapid degeneration occurs only after the repeat crosses a somatic threshold of approximately 150 CAGs. This process is described by the ELongATE model.

This project proposes Canary Circuit: a two-part computational and conceptual framework that:

  1. Uses physics-informed neural networks (PINNs) constrained by the Handsaker et al. continuous-time Markov chain (CTMC) master equation to recover expansion-rate functions from cross-sectional single-cell data.
  2. Translates the inferred minimal regulatory framework into a conceptual genetic circuit capable of detecting early disease phases before irreversible neuronal loss.

Like a canary in a coal mine, the circuit is intended to signal danger before the critical threshold is crossed.


Section 2: Background and Motivation

Huntington’s disease is an autosomal dominant disorder. Global prevalence has increased from 2.71 to 4.88 per 100,000 individuals, with founder-effect populations reaching dramatically higher frequencies.

Although the mutation is inherited at birth, symptoms emerge only after decades, followed by rapid neurological decline. This prolonged pre-symptomatic phase historically lacked a mechanistic explanation.

Handsaker et al. (Cell, 2025) demonstrated that the delay arises from somatic repeat expansion. Using single-cell measurements of CAG repeat length and genome-wide gene expression in human striatal neurons, the study showed:

  • neurons remain stable during slow expansion
  • rapid transcriptional collapse occurs after ~150 CAGs
  • neuronal identity genes are lost
  • developmental programs become derepressed

This progression is formalized as the ELongATE model, which divides disease progression into five sequential phases (A–E).

The study established that somatic expansion itself — rather than inherited repeat length alone — is the proximal driver of neuronal degeneration.


Subsequent Validation

Independent studies further support this framework.

  • MSH3 and PMS1 drive expansion through mismatch-repair-associated stabilization of DNA hairpins.
  • Expansion can be pharmacologically suppressed in human neurons.
  • Transcriptional repression of the HTT locus reduces somatic instability.
  • Blood-based expansion correlates with early neurodegeneration in living patients.

The Gap

The ELongATE model remains descriptive rather than mechanistic.

It identifies:

  • phases
  • thresholds
  • outcomes

but does not resolve the minimal regulatory system connecting:

  • repeat expansion
  • DNA repair activity
  • transcriptional collapse

Without this mechanistic layer, predicting individual disease trajectories or designing targeted interventions remains difficult.

Neurons spend more than 95% of their lifespan in a pre-symptomatic but unstable state — creating a potentially large therapeutic intervention window.

The Canary Circuit specifically targets this silent transition period.


Section 3: Project Aims

Aim 1: Validate the PINN Against Established ELongATE Biology

The first aim validates whether the PINN can recover the biology observed experimentally by Handsaker et al.

Using:

  • per-cell CAG repeat length
  • gene expression data
  • donor-specific distributions

the network is trained by minimizing both:

  • CTMC master-equation residuals
  • divergence from observed repeat-length distributions

Success Criteria

  • Minimized KL divergence between predicted and observed distributions
  • Recovery of the two-phase expansion pattern
  • Recovery of the ~150 CAG transcriptional threshold without hard-coding
  • Successful leave-one-out donor validation

This aim establishes the computational foundation for subsequent aims.


Aim 2: Infer the Minimal Regulatory Structure Governing Phase A → B Acceleration

This aim moves from validation into mechanistic inference.

The objective is to determine why expansion accelerates during the Phase A → B transition.

Approach

Modifier genes identified through GWAS are incorporated as PINN covariates:

  • MSH3
  • FAN1
  • MLH1
  • PMS1
  • PMS2
  • LIG1

Additional analyses include:

  • modifier-expression estimation
  • transcriptional regulation of HTT
  • cross-regional transcriptomic comparisons

Output

A minimal regulatory graph identifying the components most predictive of transition timing.

This is the primary novel contribution of the project.


Aim 3: Design a Conceptual Genetic Circuit for Early Phase Detection

The third aim translates the inferred regulatory graph into the Canary Circuit.

The circuit detects signatures of Phase B acceleration before the ~150 CAG toxicity threshold is crossed.

Circuit Structure

  • Input: proxy for MSH3 activity or HTT transcription rate
  • Output: measurable reporter signal during the Phase B acceleration window

The circuit is simulated using ordinary differential equations (ODEs).

Evaluation Criteria

  • Detectable output specifically during Phase B
  • Low false-positive rate during Phase A
  • Low false-negative rate at Phase C onset

The hypothesis is that the circuit can discriminate:

  • Phase A
  • Phase B
  • Phase C

using inferred regulatory dynamics.


Section 4: Methodology

4.1 Data Sources and Roles

DatasetSourceRole in Project
Single-cell CAG + expression dataHandsaker et al. (2025), NeMOPINN observational constraint
GWAS modifier genotypesGeM-HD ConsortiumCovariate selection
Multi-region transcriptomicsMätlik et al. (2024)Cross-regional comparison

Data Access Note

Only open-access components of the datasets are used.

Controlled-access sequencing data requiring dbGaP approval is not included in this project.


4.2 PINN Architecture and Master Equation Constraint

The biological dynamics are modeled using the Handsaker CTMC master equation.

Preprocessing

Per-cell repeat lengths are binned into donor-level probability distributions.

The PINN is trained on distributions rather than individual cells.

PINN Design

A feedforward neural network learns the expansion-rate functions:

  • ( \alpha(n) )
  • ( \beta(n) )

where ( n ) is repeat length.

Loss Components

  • Physics loss: CTMC residual
  • Data loss: predicted vs observed distributions

Implementation

  • PyTorch
  • custom PDE residual loss
  • donor-wise and joint training

Validation

Leave-one-out cross-validation across six deeply sampled donors.

Primary metric:

  • KL divergence

4.2.1 Continuous Fokker–Planck Formulation

The discrete CTMC is expressed in its continuous Fokker–Planck limit:

:contentReference[oaicite:1]{index=1}

where:

  • ( P(n,t) ) is the probability density of repeat length ( n )
  • ( \mu(n,\alpha) ) is the drift term
  • ( D(n) ) is the diffusion coefficient

Distribution Interpretation

The model tracks the full population distribution of neurons.

Initially, neurons cluster near inherited repeat lengths (~42 CAGs). Over time, the distribution:

  • drifts toward larger repeat lengths
  • broadens due to stochastic variability

Drift Term

The drift component:

:contentReference[oaicite:2]{index=2}

controls directional expansion.

Within the Canary Circuit framework:

  • expansion is slow below ~80 CAGs
  • accelerates during the A → B transition
  • becomes fastest above ~150 CAGs

High MSH3 activity increases drift magnitude, while FAN1-associated stabilization reduces it.


Diffusion Term

The diffusion component:

:contentReference[oaicite:3]{index=3}

captures stochastic variability between neurons.

Even identical starting repeat lengths diverge over time due to:

  • mismatch repair variability
  • slipped-strand formation
  • repair timing differences

Without diffusion, the model predicts a narrow deterministic wave rather than the experimentally observed broad distribution.


Why Use the Fokker–Planck Formulation

The original CTMC is discrete and difficult to differentiate directly within neural-network optimization.

The Fokker–Planck approximation is continuous and differentiable, enabling automatic differentiation during PINN training.


Role of the PINN

Rather than repeatedly solving simulations numerically, the PINN learns a continuous approximation:

:contentReference[oaicite:4]{index=4}

Once trained, the model can estimate distributions for arbitrary modifier combinations using a single forward pass.


4.3 GWAS Modifier Integration (Aim 2)

GWAS modifiers identify which DNA repair genes alter HD onset timing.

Selected modifiers include:

  • MSH3
  • FAN1
  • MLH1
  • PMS1
  • PMS2
  • LIG1

Expression values from the Handsaker dataset are incorporated as continuous covariates affecting expansion-rate parameters.

Published knockout measurements initialize directional effects before fitting to human single-cell data.


4.4 Mechanistic Inference Procedure (Aim 2)

The minimal regulatory graph is inferred by:

  1. fitting the PINN with individual modifiers
  2. measuring variance explained
  3. performing ablation analysis
  4. retaining only informative nodes

Cross-regional transcriptomic datasets are analyzed to determine whether transition dynamics are striatum-specific.


4.5 ODE Circuit Simulation (Aim 3)

The Canary Circuit is simulated as a two-node ODE system.

Sensor Node

Represents:

  • MSH3 activity
  • or HTT transcription rate

Output Node

Represents a threshold-activated reporter.

The output crosses a detection threshold during the Phase B acceleration window.

Simulation

  • kinetic parameters initialized from published MMR kinetics
  • parameter sweeps across biologically plausible ranges
  • sensitivity analysis with ±50% perturbations

Implementation

SciPy odeint in Python.


4.6 Circularity Acknowledgment

The ~150 CAG threshold used during PINN training originates from the same Handsaker dataset.

This creates unavoidable circularity within a single-cohort cross-sectional dataset.

The PINN therefore does not independently discover the threshold; it incorporates it as prior biological knowledge.


Section 5: Limitations and Bioethics

Limitations

Temporal Structure Comes From Physics

The PINN recovers dynamics by embedding the CTMC master equation directly into the optimization process.

Recovered parameters therefore depend strongly on the assumptions built into the governing equations.

Small and Genetically Narrow Cohort

The model is trained on six deeply sampled donors with inherited repeat lengths between 40–43 CAGs.

Generalization to:

  • juvenile-onset HD
  • diverse ancestries
  • broader repeat-length ranges

remains uncertain.

Fixed Threshold Assumption

The ~150 CAG threshold is treated as invariant.

If thresholds differ across:

  • brain regions
  • cell types
  • modifier backgrounds

then recovered parameters may become biased.


Bioethical Considerations

All datasets are:

  • pre-collected
  • anonymized
  • accessed under published agreements

No new patient data is collected.

The project is entirely computational.

Translational Considerations

If implemented experimentally in the future:

  • the Canary Circuit would function as a neuronal biosensor
  • biosafety review and IRB oversight would be required

Additionally, because HD disproportionately affects founder-effect populations such as the Lake Maracaibo community, equitable access and community engagement must remain central to any future translational development.


Section 6: References

  1. Handsaker RE, Kashin S, Reed NM, et al. Cell. 2025;188(3):623–639.e19.
  2. GeM-HD Consortium. Nature Genetics. 2025;57(6):1426–1436.
  3. Wang N, et al. Cell. 2025;188:1524–1544.e22.
  4. Mathews EW, Coffey SR, Gärtner A, et al. Nature Communications. 2025;16:10009.
  5. Richard G-F. Cells. 2021;10:1019.
  6. Monckton DG, Jones L, Pearson CE, Wheeler V. Journal of Huntington’s Disease. 2021;10(1):7–33.
  7. Bunting EL, Donaldson J, Cumming SA, et al. Science Translational Medicine. 2025.
  8. Scahill RI, et al. Nature Medicine. 2025.
  9. Mätlik K, Baffuto M, Kus L, et al. Nature Genetics. 2024;56:383–394.

Group Final Project

cover image cover image