Week 1: Principles and Practices

New heading

text

Why I decided not to use ChatGPT

I regularly use ChatGPT and think it’s great—however, per the TAs’ recommendation I completed this week’s thinking without it to practice independent reasoning about safety and governance. I spent significant time on this reflection and write-up.

Class Assignment

Describe a biological engineering application or tool you want to develop and why. It may relate to your HTGAA project, current research, or a curiosity.

Status: Done ✅

Idea

Use multi-omics approaches + a Large Language Model (LLM) to identify and explain key drivers of a biological phenomenon (e.g., mechanisms in health/disease) and assist design decisions.

Description

An open-source software tool that integrates data from multiple omics layers—genomics, transcriptomics, proteomics, metabolomics—then lets an LLM reason over that context to answer questions, surface hypotheses, and suggest next steps.

Pull structured knowledge from established databases:
- NCBI GenBank, UniProt, KEGG, Reactome
Provide the LLM with on-demand context (retrieval over curated data) to explain, compare, and discuss pathways, variants, proteins, and metabolites.
Start with indirect querying (retrieval and cached snapshots); aim later for direct, auditable database interactions for higher autonomy.

Why it matters

Accelerates discovery & engineering by unifying multi-omics context for rapid Q&A and exploration.
Lower barrier for non-experts to use complex bioinformatics tools and data.
Toward autonomous analysis: can run tasks, try alternatives, and report results under evaluation criteria.

Implementation for Proof of Concept

Scope & dataset

Pick a focused biological question (e.g., a pathway or disease module).
Assemble a small, curated multi-omics dataset (public entries from the resources above).

Data prep

Normalize formats; map identifiers across databases (genes ↔ proteins ↔ pathways ↔ metabolites).
Build lightweight indices (e.g., TSV/Parquet + JSON metadata) for retrieval.

LLM reasoning loop

Retrieval-augmented prompts to cite source records when answering.
Structured outputs (tables, entity lists, pathway IDs) for reproducibility.

Evaluation

Define success checks: correctness vs. known references, coverage, and clarity.
Human-in-the-loop review; log errors/ambiguities.

Governance, safety, & security

Transparency: show data sources and versioning in every answer.
Risk awareness: caution for clinical claims; avoid unverified wet-lab recommendations.
Data handling: only public datasets for the prototype; document any API keys or rate limits.