Projects
Final projects:
- Computational HCV Vaccine Design Workbench HTGAA Spring 2026 — Final Project What Is This Project? In one line: a reproducible workbench that nominates an annotated HCV immunogen construct for future experimental validation.

In one line: a reproducible workbench that nominates an annotated HCV immunogen construct for future experimental validation.
A fully computational pipeline that collects public hepatitis C virus (HCV) sequence data, identifies the most conserved and immunologically validated regions of the viral envelope, designs a family of candidate immunogens using different strategies, scores each candidate against a transparent multi-criterion rubric, and exports the top-ranked construct as an annotated, codon-optimized DNA and protein sequence ready for experimental testing.
Hepatitis C virus (HCV) has infected roughly 50 million people worldwide and kills around 242,000 per year (WHO 2024). Despite more than 35 years of research, there is no licensed vaccine.
The core obstacle is the E2 envelope protein — the part of the virus that binds host cells and is the primary antibody target. E2 is a difficult target for three reasons:
Extreme genetic diversity. HCV has 8 confirmed genotypes and 93+ subtypes (ICTV 2026). A vaccine designed against a single reference strain trains the immune system on one variant of a constantly shifting target.
The glycan shield. E2 carries 9–11 N-linked glycosylation sites that act as a molecular camouflage, physically blocking antibody access to the conserved surfaces that would actually be worth targeting.
Conformational instability. E2 adopts different shapes depending on context. Linear sequence-level approaches cannot fully capture the neutralizing epitopes that only exist as three-dimensional structures.
Computational approaches can help by aggregating diversity data, identifying what is conserved despite the variability, and systematically comparing design strategies before committing to synthesis or animal studies.
Public HCV E2 protein sequences downloaded from NCBI Virus. A curated panel spanning four genotypes (1a, 1b, 2a, 3a) was assembled, covering an estimated 65–75% of global HCV infections. All sequences are logged in sequence-manifest.tsv with accession numbers and source metadata.
All sequences aligned using MAFFT v7, producing a column-level alignment. Alignment quality was verified manually. Output: aligned FASTA, 434 columns.
Per-position Shannon entropy calculated in Python across all alignment columns. A 15-amino-acid sliding window was used to identify contiguous conserved regions. The most conserved window (positions 297–311, overlapping the CD81 binding loop) reached 99.15% conservation. Mean conservation across all positions: 87.2%.
The Immune Epitope Database (IEDB) was queried for positive T-cell assay results from HCV NS3 and NS5B proteins in human hosts. Results were ranked by a composite score weighting assay count, number of independent publications (PMIDs), HLA type count, and population coverage. Top hits: KLVALGINAV (NS3, 90 assays, 18 PMIDs, HLA-A*02:01), CINGVCWTV (NS3, 57 assays, 17 PMIDs), ALYDVVTKL (NS5B, 4 assays), IMAKNEVFCV (NS5B, 4 assays).
Five candidate immunogens were designed:
| Design | Strategy | Length |
|---|---|---|
| A — Natural Baseline | Single genotype 1a reference (strain H77) | 420 aa |
| B — Consensus E2 | Statistical consensus of all curated sequences | 434 aa |
| C — Hybrid E2 + T-cell | Consensus E2 fused to 4 validated NS3/NS5B T-cell epitopes | 492 aa |
| D — E2-Ferritin Nanoparticle | Consensus E2 displayed on H. pylori ferritin 24-mer scaffold | 640 aa monomer / ~1.69 MDa particle |
| D-2 — Focused Conserved Mosaic | Only the top conserved E2 windows + T-cell cassette | 97 aa |
Each design scored on a 100-point, 6-axis rubric:
| Criterion | Weight | What It Measures |
|---|---|---|
| Genotype Coverage | 25 | Breadth of genotype representation |
| Epitope Conservation | 20 | Per-position amino acid identity at functional sites |
| HLA Coverage | 15 | Population coverage of T-cell epitopes |
| Developability | 15 | Expression feasibility, folding confidence |
| Manufacturability | 15 | Construct length, GC content, synthesis cost |
| Interpretability | 10 | Clarity and communicability of design rationale |
Final scores: A = 45, B = 66, C = 76, D = 68, D-2 = 89.
Top candidates exported as: codon-optimized DNA FASTA, protein FASTA, construct map, annotations TSV. Design C codon optimization: GC content reduced from 68.6% → 47.6% (into the mRNA synthesis sweet spot of 40–60%).
| Tool | Purpose |
|---|---|
| NCBI Virus | Public HCV E2 sequence retrieval |
| MAFFT v7 | Multiple sequence alignment |
| Python (Biopython, NumPy) | Shannon entropy, sliding window conservation, FASTA handling |
| IEDB (Immune Epitope Database) | T-cell epitope mining and ranking |
| ESMFold | Protein structure confidence prediction (pLDDT scoring) |
| Python codon optimization script | Human codon-usage optimization, GC content balancing |
| Custom scoring framework | 6-axis, 100-point scorecard (documented in design-scorecard.draft.tsv) |
All tools are public and free. The pipeline is fully scripted in Python and re-runnable.
Architecture: Consensus E2 (the full 434-residue sequence representing the statistical average across all curated sequences) genetically fused via a flexible (G4S)₃ linker to H. pylori ferritin (UniProt P52093, 167 aa). The monomer is 640 aa and self-assembles into a 24-subunit octahedral nanoparticle of approximately 1.69 MDa.
Why ferritin display matters: Presenting 24 copies of the same antigen on a symmetric scaffold drives B-cell receptor cross-linking — the same signal a real virus surface sends. This mechanism has empirical precedent: Kanekiyo et al. 2013 (Nature 499:102–106) demonstrated approximately a 10-fold increase in neutralizing antibody titers when the influenza HA-stem antigen was displayed on the same H. pylori ferritin scaffold compared to soluble antigen. The scaffold itself is well-characterized and the particle size (~1.69 MDa, ~10–15 nm) is in the range optimal for lymph node trafficking and B-cell engagement.
The scorecard paradox: D-2 scores 89/100 because it is small (97 aa), easy to synthesize, and carries a full T-cell epitope cassette. Design D scores lower because size (640 aa) reduces manufacturability points and the scorecard has no axis for multivalent avidity — that biological multiplier is invisible to sequence-level scoring. The scorecard is a decision-support tool, not an efficacy oracle. It captures what can be computed from sequence. The Kanekiyo precedent captures what biology does with geometry.
Design D-2 wins the raw scorecard at 89/100. It is a 97-amino-acid linear construct made entirely of: the two most conserved E2 windows (99.15% and 97.73% conservation respectively) connected by GGGGS linkers, plus the same four-epitope T-cell cassette.
It scores so high because it maximizes conservation efficiency (only top-1% positions included), retains full HLA coverage from the T-cell cassette, and is dramatically cheaper to synthesize than any of the full-length designs.
It does not lead because linear peptides stripped of their structural context rarely elicit neutralizing antibodies. The surfaces that broadly neutralizing antibodies recognize on E2 are conformational — they require the native fold. D-2 is the right candidate for T-cell peptide assays, for cost-limited settings, or for display ON the ferritin scaffold (a combined D+D-2 construct is the highest-priority future design).
Phase 1 — Expression and Binding (3–6 months)
Phase 2 — Computational Refinement (parallel)
Phase 3 — In Vivo Testing (aspirational)