Week 1: Principles and Practices
New heading
text
more content
graph LR DNA-->RNA-->proteins

Context
This page captures my Week 1 assignment for HTGAA 2025, reformatted for Hugo. It follows the structure of my original notes.
Why I decided not to use ChatGPT
I regularly use ChatGPT and think it’s great—however, per the TAs’ recommendation I completed this week’s thinking without it to practice independent reasoning about safety and governance. I spent significant time on this reflection and write-up.
Class Assignment
Describe a biological engineering application or tool you want to develop and why. It may relate to your HTGAA project, current research, or a curiosity.
- Status: Done ✅
Idea
Use multi-omics approaches + a Large Language Model (LLM) to identify and explain key drivers of a biological phenomenon (e.g., mechanisms in health/disease) and assist design decisions.
Description
An open-source software tool that integrates data from multiple omics layers—genomics, transcriptomics, proteomics, metabolomics—then lets an LLM reason over that context to answer questions, surface hypotheses, and suggest next steps.
- Pull structured knowledge from established databases:
- Provide the LLM with on-demand context (retrieval over curated data) to explain, compare, and discuss pathways, variants, proteins, and metabolites.
- Start with indirect querying (retrieval and cached snapshots); aim later for direct, auditable database interactions for higher autonomy.
Why it matters
- Accelerates discovery & engineering by unifying multi-omics context for rapid Q&A and exploration.
- Lower barrier for non-experts to use complex bioinformatics tools and data.
- Toward autonomous analysis: can run tasks, try alternatives, and report results under evaluation criteria.
Implementation for Proof of Concept
Scope & dataset
- Pick a focused biological question (e.g., a pathway or disease module).
- Assemble a small, curated multi-omics dataset (public entries from the resources above).
Data prep
- Normalize formats; map identifiers across databases (genes ↔ proteins ↔ pathways ↔ metabolites).
- Build lightweight indices (e.g., TSV/Parquet + JSON metadata) for retrieval.
LLM reasoning loop
- Retrieval-augmented prompts to cite source records when answering.
- Structured outputs (tables, entity lists, pathway IDs) for reproducibility.
Evaluation
- Define success checks: correctness vs. known references, coverage, and clarity.
- Human-in-the-loop review; log errors/ambiguities.
Governance, safety, & security
- Transparency: show data sources and versioning in every answer.
- Risk awareness: caution for clinical claims; avoid unverified wet-lab recommendations.
- Data handling: only public datasets for the prototype; document any API keys or rate limits.