Week 1: Principles and Practices

New heading

text

more content

graph LR
DNA-->RNA-->proteins
Week 1 cover Week 1 cover

Context
This page captures my Week 1 assignment for HTGAA 2025, reformatted for Hugo. It follows the structure of my original notes.


Why I decided not to use ChatGPT

I regularly use ChatGPT and think it’s great—however, per the TAs’ recommendation I completed this week’s thinking without it to practice independent reasoning about safety and governance. I spent significant time on this reflection and write-up.


Class Assignment

Describe a biological engineering application or tool you want to develop and why. It may relate to your HTGAA project, current research, or a curiosity.

  • Status: Done ✅

Idea

Use multi-omics approaches + a Large Language Model (LLM) to identify and explain key drivers of a biological phenomenon (e.g., mechanisms in health/disease) and assist design decisions.


Description

An open-source software tool that integrates data from multiple omics layers—genomics, transcriptomics, proteomics, metabolomics—then lets an LLM reason over that context to answer questions, surface hypotheses, and suggest next steps.

  • Pull structured knowledge from established databases:
  • Provide the LLM with on-demand context (retrieval over curated data) to explain, compare, and discuss pathways, variants, proteins, and metabolites.
  • Start with indirect querying (retrieval and cached snapshots); aim later for direct, auditable database interactions for higher autonomy.

Why it matters

  • Accelerates discovery & engineering by unifying multi-omics context for rapid Q&A and exploration.
  • Lower barrier for non-experts to use complex bioinformatics tools and data.
  • Toward autonomous analysis: can run tasks, try alternatives, and report results under evaluation criteria.

Implementation for Proof of Concept

Scope & dataset

  1. Pick a focused biological question (e.g., a pathway or disease module).
  2. Assemble a small, curated multi-omics dataset (public entries from the resources above).

Data prep

  1. Normalize formats; map identifiers across databases (genes ↔ proteins ↔ pathways ↔ metabolites).
  2. Build lightweight indices (e.g., TSV/Parquet + JSON metadata) for retrieval.

LLM reasoning loop

  1. Retrieval-augmented prompts to cite source records when answering.
  2. Structured outputs (tables, entity lists, pathway IDs) for reproducibility.

Evaluation

  1. Define success checks: correctness vs. known references, coverage, and clarity.
  2. Human-in-the-loop review; log errors/ambiguities.

Governance, safety, & security

  • Transparency: show data sources and versioning in every answer.
  • Risk awareness: caution for clinical claims; avoid unverified wet-lab recommendations.
  • Data handling: only public datasets for the prototype; document any API keys or rate limits.

Resources (external)