📄Committed Listener MOU
I am an HTGAA Committed Listener, my responsibilities are:
Watching class lectures and recitations
Participating in node reviews
Developing and documenting my homework
Actively communicating with other students and TAs on the forum
Allowing HTGAA and BioClub to share my work (with attribution)
Honestly reporting on my work, and appropriately attributing and citing the work of others (both human and non-human)
Following locally applicable health and safety guidance
Promoting a respectful environment free of harassment and discrimination
Signed by committing this file to my documentation page/repository,
Liam
08 March 2026
Describe an application Identify a biological engineering tool or application you wish to develop and explain your motivation.
I would like to develop a way to make plants grow 100x faster. I find this a very interesting and ambitious question. Perhaps you reverse-engineer the genome, morphological development and constraints, proteins/enzymes/catalysts for growth. Perhaps you design a separate organism (two bacterium?) which produces biomass - a combination of a carbon sequester and a cellulose printer. Perhaps you attempt to design a minimal artificial cell, like a Xenobot / JCVI minimal cells - using new AI design software, you create a minimal genome/DNA, design your own morphological topology through simulation, which is compiled down to gene regulatory networks (GRN’s), transcription factors/thresholds, and DNA.
Answer prep questions from three faculty members:
Homework Questions from Professor Jacobson: Nature’s machinery for copying DNA is called polymerase. What is the error rate of polymerase? How does this compare to the length of the human genome. How does biology deal with that discrepancy? Error rate refers to errors per nucleotide added per replication. An error could be a misincorporation (wrong base expressed for a pair), for example.
Make a free account at benchling.com, Import the Lambda DNA. Simulate Restriction Enzyme Digestion with the following Enzymes: EcoRI HindIII BamHI KpnI EcoRV SacI SalI Create a pattern/image in the style of Paul Vanouse’s Latent Figure Protocol artworks. Benchling screenshots. Experimental design for Gel art.
In the wet-lab perform the lab experiment you designed in Part 1 and outlined in this week’s lab protocol “Gel Art: Restriction Digests and Gel Electrophoresis”.
N/A - no access to BioClub Tokyo Lab.
3.1. Choose your protein. In recitation, we discussed that you will pick a protein for your homework that you find interesting. Which protein have you chosen and why? Using one of the tools described in recitation (NCBI, UniProt, google), obtain the protein sequence for the protein you chose.
Miraculin - https://rest.uniprot.org/uniprotkb/P13087.fastahttps://rest.uniprot.org/uniprotkb/P13087.txt
>sp|P13087|MIRA_SYNDU Miraculin OS=Synsepalum dulcificum OX=3743 PE=1 SV=3 MKELTMLSLSFFFVSALLAAAANPLLSAADSAPNPVLDIDGEKLRTGTNYYIVPVLRDHG GGLTVSATTPNGTFVCPPRVVQTRKEVDHDRPLAFFPENPKEDVVRVSTDLNINFSAFMP CRWTSSTVWRLDKYDESTGQYFVTIGGVKGNPGPETISSWFKIEEFCGSGFYKLVFCPTV CGSCKVKCGDVGIYIDQKGRRRLALSDKPFAFEFNKTVYF 3.2. Reverse Translate: Protein (amino acid) sequence to DNA (nucleotide) sequence. Using https://www.bioinformatics.org/sms2/rev_trans.html:
Steps to build a plasmid:
Import DNA into Benchling. Add promoter, RBS, start/stop codons, 7x His Tag, and terminator Export .fasta and import into Twist. Order Twist clonal gene, using pTwist Amp High Copy vector. Export .gb (genbank) file for plasmid. Import plasmid .gb file into Benchling, open Info>Toplogy and set Circular.
DNA Read (i) What DNA would you want to sequence (e.g., read) and why? This could be DNA related to human health (e.g. genes related to disease research), environmental monitoring (e.g., sewage waste water, biodiversity analysis), and beyond (e.g. DNA data storage, biobank). No idea. Possibly my basil plant.
Review recitation materials and lab documentation. Design artwork using the GUI at opentrons-art.rcdonovan.com. Write a Python script using coordinates from the GUI via the “HTGAA26 Opentrons Colab”. Sign up for a robot time slot and run the script on the Opentrons robot. Submit Python file via provided form. Artwork Design Python Script
2.1. Find and describe a published paper utilizing Opentrons or similar liquid handling automation tools. The paper I have found: Slowpoke: An Automated Golden Gate Cloning Workflow for Opentrons OT‑2 and Flex
Slowpoke is a tool which generates Opentron protocols for DNA assembly. DNA assembly is used to assemble larger strings of DNA than can be synthesised in one go, by joining together oligonucleotides.
Part A. Conceptual Questions Answer any NINE of the following questions from Shuguang Zhang:
How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons) Meat is roughly 20% protein by weight, so 0.2*500g=100g of protein. This is the only amino acid in meat, as carbs are sugars, and fats are triglycerides (fatty acids + glycerol).
Part A. SOD1 Binder Peptide Design Superoxide dismutase 1 (SOD1) is a cytosolic antioxidant enzyme that converts superoxide radicals into hydrogen peroxide and oxygen. In its native state, it forms a stable homodimer and binds copper and zinc.
Mutations in SOD1 cause familial Amyotrophic Lateral Sclerosis (ALS). Among them, the A4V mutation (Alanine → Valine at residue 4) leads to one of the most aggressive forms of the disease. The mutation subtly destabilizes the N-terminus, perturbs folding energetics, and promotes toxic aggregation.
What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose? A Phusion HF PCR Master Mix is a pre-combined PCR reaction system optimised for a specific engineered DNA polymerase.
Phusion DNA polymerase — provides the catalytic activity which synthesises DNA, and includes a 3’→5’ exonuclease proofreading to reduce error Reaction buffer MgCl₂ — magnesium ions dNTPs — deoxynucelotide triphosphates: dATP, dCTP, dGTP, and dTTP Stabilizers/additives Water A typical setup only requires after adding:
Subsections of Homework
Week 1 HW.1: Class assignment
1. Describe an application
Identify a biological engineering tool or application you wish to develop and explain your motivation.
I would like to develop a way to make plants grow 100x faster. I find this a very interesting and ambitious question. Perhaps you reverse-engineer the genome, morphological development and constraints, proteins/enzymes/catalysts for growth. Perhaps you design a separate organism (two bacterium?) which produces biomass - a combination of a carbon sequester and a cellulose printer. Perhaps you attempt to design a minimal artificial cell, like a Xenobot / JCVI minimal cells - using new AI design software, you create a minimal genome/DNA, design your own morphological topology through simulation, which is compiled down to gene regulatory networks (GRN’s), transcription factors/thresholds, and DNA.
Why? Because trees and plants are great. They are calming, they look beautiful, they are functionally useful. Originally I wanted to build my own house, and was wondering - why is wood so expensive? If we could grow wood more quickly and effectively, that would be useful. It would also be fun to rapidly green certain areas of the world to produce arable land - the Australian desert, for example.
2. Establish governance goals
Describe one or more governance/policy goals related to ensuring that this application or tool contributes to an “ethical” future, like ensuring non-malfeasance (preventing harm). Break big goals down into two or more specific sub-goals.
Enhance biosecurity (prevent misuse and uncontrolled spread)
Prevent incidents
Restrict access to engineered strains, protocols, and enabling tools
Use genetic containment (kill-switches, auxotrophy, sterility)
Avoid traits that increase invasiveness or persistence outside intended settings
Help respond
Establish monitoring and reporting systems for unexpected dissemination
Prohibit open release until long-term impacts are understood
Prefer reversible or self-limiting designs over permanent alterations
Help respond
Post-deployment surveillance and remediation plans
Defined liability and responsibility for environmental harms
Equity, autonomy, and constructive use (ensure benefits are fairly distributed)
Minimizing burdens to stakeholders
Community consultation for land-use and deployment decisions
Avoid shifting risks onto local ecosystems or vulnerable populations
Feasibility without blocking research
Clear regulatory pathways that enable safe experimentation
Transparency and documentation to support responsible scaling
Promote beneficial applications
Prioritize reforestation, sustainable materials, and climate-positive outcomes
Discourage purely extractive or destabilizing commercial deployment
3. Design governance actions
Describe at least three different potential governance “actions” by considering the four aspects below (Purpose, Design, Assumptions, Risks of Failure & “Success”)
Purpose: What is done now and what changes are you proposing?
Design: What is needed to make it “work”? (including the actor(s) involved - who must opt-in, fund, approve, or implement, etc)
Assumptions: What could you have wrong (incorrect assumptions, uncertainties)?
Risks of Failure & “Success”: How might this fail, including any unintended consequences of the “success” of your proposed actions?
Evaluate each action against objectives including:
Biosecurity enhancement
Lab safety
Environmental protection
Cost/burden minimization
Feasibility and research impact
Does the option:
Option 1
Option 2
Option 3
Enhance Biosecurity
3
3
2
• By preventing incidents
3
3
2
• By helping respond
2
2
3
Foster Lab Safety
3
2
1
• By preventing incident
3
2
1
• By helping respond
2
2
2
Protect the environment
3
2
3
• By preventing incidents
3
2
2
• By helping respond
2
1
3
Other considerations
• Minimizing costs and burdens to stakeholders
2
2
1
• Feasibility?
2
3
1
• Not impede research
1
2
1
• Promote constructive applications
3
2
3
5. Prioritize options
Last, drawing upon this scoring, describe which governance option, or combination of options, you would prioritize, and why. Outline any trade-offs you considered as well as assumptions and uncertainties.
For this, you can choose one or more relevant audiences for your recommendation, which could range from the very local (e.g. to MIT leadership or Cambridge Mayoral Office) to the national (e.g. to President Biden or the head of a Federal Agency) to the international (e.g. to the United Nations Office of the Secretary-General, or the leadership of a multinational firm or industry consortia). These could also be one of the “actor” groups in your matrix.
I would prioritise Containment-by-design + staged release. Given that there is immense uncertainty in how this project could be achieved, it is a waste of resources to consider other governance actions for now. Rapid iteration to reduce uncertainty is the path towards achievement. As part of this - a scalable safety protocol throughout this process facilitates rapid experimentation without risk of ruin, until the project can achieve milestones necessary for unlocking funding and revenue.
Week 1 HW.2: Lecture prep for W2
Answer prep questions from three faculty members:
Homework Questions from Professor Jacobson:
Nature’s machinery for copying DNA is called polymerase. What is the error rate of polymerase? How does this compare to the length of the human genome. How does biology deal with that discrepancy?
Error rate refers to errors per nucleotide added per replication. An error could be a misincorporation (wrong base expressed for a pair), for example.
Error rate of polymerase synthesis is 1/1e7 (1:10^7).
The rate of errors in polymerase copying the human genome’s DNA is 1/1e7 * 3e9, which is nonzero.
Biology deals with the likely error through multiple levels of mitigation:
Proofreading during synthesis corrects errors
Mismatch repair after synthesis repairs errors
Redundancy and selection at multiple levels - DNA is double-stranded, cells exist in huge populations, misfolded proteins get degraded, defective RNAs are destroyed, faulty cells undergo apoptosis
Damage repair system
How many different ways are there to code (DNA nucleotide code) for an average human protein? In practice what are some of the reasons that all of these different codes don’t work to code for the protein of interest?
Our assumptions:
Average Human Protein: 1036 bp.
~30,000 proteins observed in mammalian genome.
A protein of length L = 3L nucleotides (bases) + a stop codon in the genome
Coding is the process by which DNA is transcribed into mRNA (triplets / codons), and mRNA (codons) is translated into a linear chain of amino acids (polypeptides), which folds into 3D protein structures.
How many different ways are there to code for an average human protein, meaning how many different DNA encodings would compile (transcribe and translate) down to the same protein (chain of amino acids) of length 1036 bp?
Codons are 3 nucleotides, each which have a base (A,C,G,T). There are 64 possible triplet combinations (codons) using the four bases (A, U, G, C). Each codon encodes one amino acid. An amino acid can be encoded by multiple codons. For instance, codons GAA and GAG both specify glutamic acid and exhibit redundancy. This is referred to as degeneracy.
The degeneracy of an amino acid refers to the number of codons which encode it. ie. d(Leu)=6, meaning Leucine has 6 codons which encode it.
Average codon degeneracy across amino acids is roughly 3.
So to calculate the number of possible encodings for a protein of length L=5 amino acids, we compute the degeneracy of each amino acid, and compute their product to find the maximum number of permutations. ie. for a protein of L=5, average degeneracy d(*)=3, num_permutations=d(*) * d(*) * d(*) * d(*) * d(*) = d(*)^L = 3^L
So for an average human protein of L=1036 bp, the number of possible encodings could be 3^L = 3^1036.
There is an intractable number of possible encodings. However, functional “good” encodings are a tiny subset constrained by expression, folding, RNA processing, regulation, and host biology.
Homework Questions from Dr. LeProust:
What’s the most commonly used method for oligo synthesis currently?
solid-phase chemical synthesis with phosphoramidite chemistry
Why is it difficult to make oligos longer than 200nt via direct synthesis?
Because direct phosphoramidite synthesis has a per-step yield <1.0, errors compound exponentially with length. P(success)=(1-e)^200 is improbable (e ~= 0.01)
Why can’t you make a 2000bp gene via direct oligo synthesis?
(1-e)^2000 is near impossible, due to errors accumulating from each synthetic cycle/step.
expected number of cleavage events scales ~linearly with cycle count and purine content
Misincorporations accumulate (wrong base addition)
Homework Question from George Church:
Choose ONE of the following three questions to answer; and please cite AI prompts or paper citations used, if any.
[Using Google & Prof. Church’s slide #4] What are the 10 essential amino acids in all animals and how does this affect your view of the “Lysine Contingency”?
[Given slides #2 & 4 (AA:NA and NA:NA codes)] What code would you suggest for AA:AA interactions?
[(Advanced students)] Given the one paragraph abstracts for these real 2026 grant programs sketch a response to one of them or devise one of your own:
Out of the 20 amino acids needed, the body synthesizes 11-12, while the remaining 8-9, known as essential amino acids, must be obtained through diet.
This is not accurate to all animals, it seems? Counterexample: cats. Cats require taurine.
The Lysine Contingency was a genetic alteration Henry Wu performed in the dinosaur genome. The modification knocked out the ability of the dinosaurs to produce the amino acid Lysine.
This forced the dinosaurs to depend on lysine supplements provided by the park’s veterinary staff. In this way, dinosaurs could never escape from the park because they would never survive long without the food supplements.
Haha, I have to rewatch this film.
The way I would hack around this would be to introduce a substance containing the microbes that cows digest and feed it to the dinosaurs. These microbes synthesise the essential amino acids from nitrogen, thus mitigating the need for the dinosaurs to produce Lysine themselves, instead forming a symbiotic relationship with the microbes in their gut.
I don’t know what this question means, but it reminds me also of Liebig’s law - would the restriction of one amino acid necessarily debilitate the dinosaurs so they can’t escape, or is nature more nonlinear and complex than that?
LLM prompts used:
10 essential amino acids in all animals?
across all animals?
cows can synthesise most of their needed amino acids? how many which ones
how long can you survive without just one of the amnio acids ?
Make a free account at benchling.com, Import the Lambda DNA.
Simulate Restriction Enzyme Digestion with the following Enzymes: EcoRI HindIII BamHI KpnI EcoRV SacI SalI
Create a pattern/image in the style of Paul Vanouse’s Latent Figure Protocol artworks.
Benchling screenshots.
Experimental design for Gel art.
Week 2 HW.2: Gel Art - Restriction Digests and Gel Electrophoresis
In the wet-lab perform the lab experiment you designed in Part 1 and outlined in this week’s lab protocol “Gel Art: Restriction Digests and Gel Electrophoresis”.
N/A - no access to BioClub Tokyo Lab.
Week 2 HW.3: DNA Design Challenge
3.1. Choose your protein.
In recitation, we discussed that you will pick a protein for your homework that you find interesting. Which protein have you chosen and why? Using one of the tools described in recitation (NCBI, UniProt, google), obtain the protein sequence for the protein you chose.
Describe why you need to optimize codon usage. Which organism have you chosen to optimize the codon sequence for and why?
Proteins are translated from mRNA by tRNA’s. The tRNA’s “pair” with codons from the mRNA. A codon is a 3-base sequence which is then mapped onto a single amino acid. As we covered last week, there are 64 different codons (permutations of a string of 3 nuceleotide bases) which map down to only 20 amino acids. The degeneracy means we can swap out parts of the DNA/mRNA to express the same amino acids aka proteins. Why would we do this? Because mRNA codons are translated into amino acids by the available tRNA in the organism. Each tRNA matches a codon (or several synonymous codons, see wobble pairing at 3rd base). There is not a uniform concentration of tRNA for all codons. So some mRNA codons will translate more efficiently than others, because there is more tRNA.
To restate:
DNA encodes triplet codons.
mRNA is transcribed from DNA.
Ribosomes read mRNA in triplets.
tRNAs carrying amino acids base-pair with codons (binding with the tRNA’s complementary anticodon)
Translation rate is approximately proportional to local charged tRNA abundance and ribosomal processivity.
Multiple codons encode the same amino acid, yet different organisms use these synonymous codons at different frequencies (codon usage bias). If a gene from organism A is expressed in organism B without modification, the codon distribution may not match the tRNA pool of B.
You need to optimize codon usage in order to achieve (good) yields from your biomanufacturing process.
I choose Escherichia coli (E. coli) as the target host for optimization:
Takes less time
Cell division is faster
Well established protocols to isolate plasmid
Each cell has single chromosome
Single circular plasmid
Each replicated cell has exact copy of DNA
Easy method
3.4. You have a sequence! Now what?
What technologies could be used to produce this protein from your DNA? Describe in your words how the DNA sequence can be transcribed and translated into your protein. You may describe either cell-dependent or cell-free methods, or both.
Recombinant expression in a host organism like E. Coli.
Clone the coding sequence into an expression vector (a plasmid).
Promoter - T7 under lac control: binds the RNA polymerase
Ribosome binding site - Shine–Dalgarno AGGAGG: recruits ribosome
Coding sequence - see Miraculin DNA sequence above.
Antibiotic resistance gene - ampR: for selection of culture
Transform into E. Coli (transform the plasmid into host cells.)
Bacteria are given a heat shock.
Colonies grow.
Pick colonies.
Plate on ampicillin → only plasmid-containing cells survive.
Inoculate the liquid cultures (by introducing single colonies)
Induce expression (e.g., add IPTG if T7/lac system).
T7 RNA polymerase binds promoter
DNA is transcribed into mRNA
Ribosome binds RBS on mRNA.
tRNA translates into protein, stop at terminator.
tRNAs decode codons
Amino acids polymerize into polypeptide
Harvest. Cells are lysed. Protein is purified.
Lyse cells (sonication or chemical lysis).
Purify protein (e.g., His-tag + Ni-NTA affinity column).
Apparently E. coli is possible but non-ideal for a cysteine-rich, glycosylated plant secreted protein like miraculin.
3.5. [Optional] How does it work in nature/biological systems?
Describe how a single gene codes for multiple proteins at the transcriptional level. Try aligning the DNA sequence, the transcribed RNA, and also the resulting translated Protein!!! See example below.
Week 2 HW.4: Twist DNA Synthesis Order
Steps to build a plasmid:
Import DNA into Benchling.
Add promoter, RBS, start/stop codons, 7x His Tag, and terminator
Export .fasta and import into Twist.
Order Twist clonal gene, using pTwist Amp High Copy vector.
Export .gb (genbank) file for plasmid.
Import plasmid .gb file into Benchling, open Info>Toplogy and set Circular.
Week 2 HW.5: DNA Read/Write/Edit
DNA Read
(i) What DNA would you want to sequence (e.g., read) and why? This could be DNA related to human health (e.g. genes related to disease research), environmental monitoring (e.g., sewage waste water, biodiversity analysis), and beyond (e.g. DNA data storage, biobank).
No idea. Possibly my basil plant.
(ii) In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why?
I would use long-read sequencing (1–100+ kb). Even though it is more expensive, it would provide greater accuracy.
The way that DNA sequencing works currenly is by taking DNA, lysing it, and then reassembling fragments based on probabilistic approaches. The “read length” refers to how large these fragments are in terms of base pairs. A fragment of length = 1 bp would be near useless, since there is no way to “place” it probabilistically within the greater genome. A fragment of length = 150bp map well because apparently the human genome is largely non-repetitive at that scale.
Short-read sequencing is a read of 50–600 bp. Long-read sequencing is 1-100 kb.
Technologies:
Polymerase-based sequencing
Enzymatic digest sequencing
Nanopore sequencing
DNA microarrays
DNA Write
(i) What DNA would you want to synthesize (e.g., write) and why?
I have no idea.
(ii) What technology or technologies would you use to perform this DNA synthesis and why?
Recombinant DNA synthesis
Oligonucleotide synthesis - can make complex motifs, extremely large DNA molecules (1kbp+)
DNA Edit
(i) What DNA would you want to edit and why?
I have no idea. Potentially plant DNA. I don’t know anything about what DNA plants have. I would like to figure out how to increase the growth speed, change the bark texture. Or even doing experiments on yeast. Perhaps I could figure out the enzymes/proteins and what DNA/genes code for it, and then edit that.
(ii) What technology or technologies would you use to perform these DNA edits and why?
CRISPR-Cas9
Week 3 HW.1: Python Script for Opentrons Artwork
Review recitation materials and lab documentation.
Write a Python script using coordinates from the GUI via the “HTGAA26 Opentrons Colab”.
Sign up for a robot time slot and run the script on the Opentrons robot.
Submit Python file via provided form.
Artwork Design
Python Script
Week 3 HW.2: Post-Lab Reflection
2.1. Find and describe a published paper utilizing Opentrons or similar liquid handling automation tools.
The paper I have found: Slowpoke: An Automated Golden Gate Cloning Workflow for Opentrons OT‑2 and Flex
Slowpoke is a tool which generates Opentron protocols for DNA assembly. DNA assembly is used to assemble larger strings of DNA than can be synthesised in one go, by joining together oligonucleotides.
It provides facilities to automate:
Golden Gate Cloning - automates the DNA assembly reaction setup, E. Coli transformation, and plating.
Colony PCR - automates colony PCR screening of resulting transformants.
Users provide for:
Golden Gate Cloning
Genetic toolkit map - e.g. MoClo YTK, STK plate layout
Custom parts map
Combination file
Colony PCR
Colony template positions
PCR deck maps
Reaction recipes.
The robot protocol automates the full pipeline of assembly, transformation, plating and colony PCR:
DNA and enzyme buffer extraction.
Golden gate reaction.
Transformation.
Plating.
Colony PCR.
It is compatible with multiple MoClo/Golden Gate toolkits (YTK, STK, and extensible to others).
Manual steps still required:
Colony picking - most labour-intensive step.
Sealing PCR plates in OT-2 thermocycler module.
Transferring PCR tubes to benchtop thermocycler.
Incubation, strain storage, and plasmid purification - still accounts for a lot of time.
2.2. Describe your intended automation use for your final project, including pseudocode, scripts, or implementation plans.
I intend to use a cloud lab platform to screen an array of biosensor constructs that I have designed, synthesised, and expressed using cell-free protein synthesis (CFPS).
Week 3 HW.3: Final Project Ideas
Submit 1–3 slides with three individual project concept ideas.
Week 4 HW
Part A. Conceptual Questions
Answer any NINE of the following questions from Shuguang Zhang:
1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)
Meat is roughly 20% protein by weight, so 0.2*500g=100g of protein. This is the only amino acid in meat, as carbs are sugars, and fats are triglycerides (fatty acids + glycerol).
1 g = 6.02217364335E+23 dalton
100 g = 6.022173643E+25 daltons
2. Why do humans eat beef but do not become a cow, eat fish but do not become fish?
Humans eat beef but don’t become a cow because the stomach metabolizes complex proteins and cells down to base level molecules and amino acids.
3. Why are there only 20 natural amino acids?
For evolutionary reasons probably. Much like how human language has a finite set of phonemes which allow us to express infinitely more higher-level syllables, words, and concepts and sentences - biology has a base grammar of 20 units. This has proven to be enough - 64 codons map onto 20 amino acids plus stop signals. There may have been more but this is evidently evolutionarily optimal as it is now.
4. Can you make other non-natural amino acids? Design some new amino acids.
β-amino acids are interesting - usually the amino group is attached onto the α-carbon, but here they are attached on the β-carbon. Due to this, proteases (enzymes which support digestion) are highly ineffective against β-peptides.
Others I googled:
Fluoroleucine — leucine with fluorine substituted in; more hydrophobic and metabolically stable
Azidohomoalanine — methionine analog with an azide group, useful for click chemistry bioconjugation
5. Where did amino acids come from before enzymes that make them, and before life started?
Meterorites that naturally carry amino acids
Miller-Urey experiment showed amino acids could form spontaneously from simple molecules + an electric arc
6. If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?
Left-handed. Natural α-helices are right-handed because they’re built from L-amino acids. D-amino acids are the mirror image, so the resulting helix is the mirror image too — left-handed.
7. Can you discover additional helices in proteins?
Yes — beyond the common α-helix, proteins also contain 3₁₀-helices (3 residues per turn, tighter) and π-helices (4.4 residues per turn, rarer and wider). These are already known but underappreciated. Computational analysis of PDB structures keeps surfacing edge cases and unusual conformations that don’t fit neatly into existing categories.
8. Why are most molecular helices right-handed?
Because natural amino acids are L-enantiomers. The geometry of the L-α-carbon makes right-handed coiling energetically favorable — the side chains point outward without steric clashes in a right-handed helix. A left-handed helix built from L-amino acids would force side chains into the backbone, creating strain.
9. Why do β-sheets tend to aggregate? What is the driving force for β-sheet aggregation?
β-sheets have “sticky edges” — the backbone NH and C=O groups along the edge strands are unsatisfied hydrogen bond donors/acceptors. These can pair with the edge of another β-sheet. The driving forces are hydrogen bonding along the backbone and hydrophobic stacking between sheet faces. This makes lateral growth into large, ordered aggregates thermodynamically favorable.
10. Why do many amyloid diseases form β-sheets? Can you use amyloid β-sheets as materials?
When proteins misfold or partially unfold, they expose their backbone, which can then hydrogen-bond with other misfolded proteins into cross-β structure (hydrogen bonds running perpendicular to the fibril axis). This structure is extremely stable — often more so than the native fold — so once nucleation starts, it propagates. Many proteins will form amyloid under the right conditions; some just do so more readily due to sequence composition or environmental stress.
Yes, amyloid fibrils can be used as materials — they’re stiff, stable, and self-assembling. Researchers have used them as scaffolds for nanomaterials, hydrogels, and functional coatings.
11. Design a β-sheet motif that forms a well-ordered structure.
Part B. Protein Analysis and Visualization
(1) Briefly describe the protein you selected and why you selected it.
Insulin. A 51-amino acid peptide hormone secreted by pancreatic β-cells that regulates blood glucose by signaling cells to take up glucose. I chose it because it’s small and well-studied, historically significant (first recombinantly produced therapeutic protein), and I’m curious how something so tiny has such a large physiological effect.
(2) Identify the amino acid sequence of your protein. How long is it? What is the most frequent amino acid? How many protein sequence homologs are there? Does your protein belong to any protein family?
51 amino acids total — two chains: A (21 aa) and B (30 aa), linked by two disulfide bonds. Most frequent: leucine (L) and cysteine (C), both at 6 occurrences. Many homologs — the insulin/IGF/relaxin superfamily includes IGF-1, IGF-2, relaxin, and insulin-like peptides across many organisms. Belongs to the insulin family (InterPro: Insulin/IGF/relaxin superfamily).
(3) Identify the structure page of your protein in RCSB. When was the structure solved? Is it a good quality structure? Are there any other molecules in the solved structure apart from protein? Does your protein belong to any structure classification family?
PDB: 4INS — human insulin hexamer, solved in 1989 at 1.5 Å resolution. Good quality structure. In addition to protein, the hexamer contains two zinc ions (Zn²⁺) coordinated by His B10 residues at the center, plus water molecules.
Why extra zinc ions? Insulin is stored as a zinc-stabilized hexamer in β-cells; once secreted, the hexamer dissociates into monomers and the zinc stays behind, so zinc is necessary for storage and secretion but not for receptor binding. Zinc deficiency is linked to impaired insulin secretion and increased type 2 diabetes risk.
In structural classification, insulin belongs to the “Insulin-like” fold under the all-α class.
(4) Open the structure in 3D visualization software. Visualize as “cartoon”, “ribbon”, and “ball and stick”. Color by secondary structure — does it have more helices or sheets? Color by residue type — what can you tell about hydrophobic vs hydrophilic distribution? Visualize the surface — does it have any binding pockets?
Color by secondary structure — does it have more helices or sheets
It has helices
Red spirals = α-helices
Yellow flat arrow shapes = β-sheets
Color by residue type — does it have more helices or sheets
PyMOL: util.cbag
- green is helices. I don’t see any β-sheets.
Visualize the surface — does it have any binding pockets?
Red is helix, Yellow is sheet, Green is loop.
PyMOL: show surface
The surface doesn’t show a deep binding pocket — the receptor-binding interface is relatively flat.
Use ESM2 to generate an unsupervised deep mutational scan. Explain any particular pattern (choose a residue and mutation that stands out). (Bonus) Find sequences for which we have experimental scans, and compare the prediction of the language model to experiment.
Latent Space Analysis
Use the provided sequence dataset to embed proteins in reduced dimensionality. Analyze the different formed neighborhoods. Place your protein in the resulting map and explain its position and similarity to its neighbors.
Explain its position and similarity to its neighbors
!!! TODO - I don’t know enough to describe it.
C2. Protein Folding
Fold your protein with ESMFold. Do the predicted coordinates match your original structure? Try changing the sequence — first some mutations, then large segments. Is your protein structure resilient to mutations?
!!! TODO - Cannot see the coordinates. Structure looks interesting.
Use ProteinMPNN to inverse-fold your protein backbone. Analyze the predicted sequence probabilities and compare to the original. Input the predicted sequence into ESMFold and compare the predicted structure to your original.
Part D. Group Brainstorm on Bacteriophage Engineering
NA - Sick
Week 5 HW
Part A. SOD1 Binder Peptide Design
Superoxide dismutase 1 (SOD1) is a cytosolic antioxidant enzyme that converts superoxide radicals into hydrogen peroxide and oxygen. In its native state, it forms a stable homodimer and binds copper and zinc.
Mutations in SOD1 cause familial Amyotrophic Lateral Sclerosis (ALS). Among them, the A4V mutation (Alanine → Valine at residue 4) leads to one of the most aggressive forms of the disease. The mutation subtly destabilizes the N-terminus, perturbs folding energetics, and promotes toxic aggregation.
Challenge: Design short peptides that bind mutant SOD1, then decide which ones are worth advancing toward therapy.
Models used:
PepMLM: target sequence-conditioned peptide generation via masked language modeling
Retrieve the human SOD1 sequence from UniProt (P00441), introduce the A4V mutation, and use the PepMLM Colab to generate four peptides of length 12 amino acids conditioned on the mutant sequence. Add the known binder FLYRWLPSRRGG for comparison. Record perplexity scores.
A4V mutant SOD1 sequence (deleted M at position 1, changed A→V at position 4):
A note on perplexity: A lower perplexity score means higher model confidence that the peptide satisfies the criteria for binding the target.
Part 2: Evaluate Binders with AlphaFold3
Submit each peptide + mutant SOD1 as separate chains to the AlphaFold Server. Record the ipTM score and describe where each peptide appears to bind — does it localize near the N-terminus (A4V site), the β-barrel, or the dimer interface? Is it surface-bound or partially buried? In a short paragraph, describe the ipTM values and whether any PepMLM-generated peptide matches or exceeds the known binder.
Peptide
Binding location
ipTM score
WRSPAVAVAHWE
None
0.28
WRVGWVGVELKE
None
0.35
WRSPAAXIEHKX
None
0.33
WRVYAAXIEWGK
None
0.34
Part 3: Evaluate Properties in the PeptiVerse
Using PeptiVerse, evaluate the therapeutic properties of each peptide against the A4V mutant SOD1 sequence. Check: predicted binding affinity, solubility, hemolysis probability, net charge (pH 7), and molecular weight.
Peptide
Solubility
Hemolysis
Binding Affinity
MW (Da)
Net Charge (pH 7)
WRSPAVAVAHWE
1.0
0.044 (Non)
5.361 (Weak)
1408.6
-0.14
WRVGWVGVELKE
1.0
0.117 (Non)
7.089 (Medium)
1457.7
-0.23
WRSPAAXIEHKX
1.0
0.011 (Non)
4.645 (Weak)
1158.5
0.85
WRVYAAXIEWGK
1.0
0.043 (Non)
6.724 (Weak)
1360.7
0.76
FLYRWLPSRRGG (known)
1.0
0.047 (Non)
5.962 (Weak)
1507.7
2.76
The best peptide to advance for wet lab validation would be WRVGWVGVELKE due to its relatively high binding affinity (7.089, Medium).
Part 4: Generate Optimized Peptides with moPPIt
Using the moPPIt Colab: paste your A4V mutant SOD1 sequence, choose specific residue indices to target (e.g. near position 4, the dimer interface, or another surface patch), set peptide length to 12 aa, and enable motif + affinity guidance. Briefly describe how the moPPIt peptides differ from your PepMLM peptides. How would you evaluate these before advancing to clinical studies?
Binder
Hemolysis
Solubility
Affinity
Motif
SVKTKCCTTYQS
0.964
0.917
6.576
0.890
DDTKKCSCIQTH
0.975
0.917
6.314
0.915
ENGETFQCTKKV
0.970
0.833
6.044
0.935
KKSKKAFVCCVC
0.963
0.667
8.172
0.614
For the long execution time and computational resources required, the main advantage of moPPIt over PepMLM (in this context) is the motif score — there was no option to check motif specificity in PeptiVerse. All other properties of the PepMLM-generated sequences were comparable to the moPPIt peptides.
Part B. BRD4 Drug Discovery Platform Tutorial
(Optional — skipped)
Part C. Final Project: L-Protein Mutants
High level summary: The objective of this assignment is to improve the stability and auto-folding of the lysis protein of a MS2-phage. This mechanism is key to the understanding of how phages can potentially solve antibiotic-resistance.
Chose option 3: generating random mutations in the lysis protein while avoiding loss-of-function or nonsense codons. A Python script (Colab) was used to load active mutations from experimental data and apply them randomly to unique positions.
AF2 Multimer was used to co-fold mutant sequence 1 with DnaJ. The plDDT score indicates low model confidence in the folding of the mutant L protein. Overall, the random mutation approach is very time-consuming for obtaining leads.
Week 6 HW
What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose?
A Phusion HF PCR Master Mix is a pre-combined PCR reaction system optimised for a specific engineered DNA polymerase.
Phusion DNA polymerase — provides the catalytic activity which synthesises DNA, and includes a 3’→5’ exonuclease proofreading to reduce error
Reaction buffer
MgCl₂ — magnesium ions
dNTPs — deoxynucelotide triphosphates: dATP, dCTP, dGTP, and dTTP
Stabilizers/additives
Water
A typical setup only requires after adding:
Forward primer
Reverse primer
Template DNA
Additional water to reach final volume
What are some factors that determine primer annealing temperature during PCR?
Factors:
Primer melting temperature — dominant factor
GC content of primer
Primer length
Sequence features
Salt concentration in the reaction buffer
Template–primer mismatch tolerance
There are two methods from this class that create linear fragments of DNA: PCR, and restriction enzyme digests. Compare and contrast these two methods, both in terms of protocol as well as when one may be preferable to use over the other.
PCR creates DNA fragments by enzymatic replication using primers that define the fragment boundaries.
Protocol: The reaction contains template DNA, forward and reverse primers, dNTPs, buffer, Mg²⁺, and a thermostable DNA polymerase (e.g. Phusion or Taq). The protocol cycles temperature: denaturation (~95 °C) separates strands, annealing (~50–65 °C) allows primers to bind, and extension (~72 °C) synthesizes new DNA. After ~25–35 cycles, the region between the primers is exponentially amplified, producing many linear copies of a precisely defined sequence.
Restriction enzyme digestion produces linear fragments by cutting DNA at specific recognition sequences using restriction endonucleases.
Protocol: The protocol involves incubating DNA with one or more enzymes in the appropriate buffer (often ~37 °C) for a set time. The enzyme recognizes a short sequence (typically 4–8 bp) and cleaves the phosphodiester backbone, generating fragments with defined ends (blunt or sticky). The resulting fragment sizes depend entirely on where those recognition sites exist in the DNA.
Conceptually, PCR synthesizes a fragment by copying between two designed boundaries, whereas restriction digestion extracts a fragment by cutting an existing molecule at predetermined sequence motifs.
To understand when both are useful, consider an objective: engineer E. coli to produce human insulin, which requires building a plasmid containing the insulin gene under a bacterial promoter.
3 difference scenarios for getting insulin:
DNA comes from a biological sample (e.g. human genomic DNA). The insulin gene is buried inside billions of unrelated bases, so PCR is used to isolate and amplify only that specific region using primers that define its boundaries. PCR is therefore used when the goal is to retrieve a specific gene from a complex DNA mixture.
DNA already exists in a plasmid (e.g. moving GFP from plasmid A into plasmid B). The fragment is already isolated, so restriction enzymes are used to cut DNA at specific recognition sequences, allowing the gene to be excised and inserted into another vector. Restriction digestion is therefore used when the task is to cut and rearrange existing DNA molecules.
DNA is chemically synthesized because the sequence is already known. The synthesized fragment may still be PCR-amplified if more copies are needed, and restriction enzymes (or similar assembly methods) are used to insert it into plasmids. In practice, PCR isolates or amplifies sequences, while restriction enzymes cut DNA molecules so fragments can be inserted, removed, or reorganized.
How can you ensure that the DNA sequences that you have digested and PCR-ed will be appropriate for Gibson cloning?
Desiderata: Linear DNA fragments whose terminal 20–40 bp regions are perfectly homologous to the neighboring fragment, unique, structurally stable, and present in a clean preparation so the Gibson enzymes can expose the overlaps, allow annealing, fill gaps, and ligate the final construct.
How does the plasmid DNA enter the E. coli cells during transformation?
A cell is boundaried by a lipid membrane wall. It is not a solid wall. It is more of a dense molecular fluid. Each phospholipid is held in place only by weak interactions (hydrophobic forces, van der Waals forces). Like an electron has no fixed static wall, rather a field it creates, a cell has no fixed solid wall, it is a highly dense molecular fluid.
Lipids constantly fluctuate, small gaps appear and disappear.
During transformation, the culture is treated to brief heat shock (~42 °C for ~30–60 s). The rapid temperature change causes a sudden increase in the lipid kinetic energy, resulting in transient disordering of phospholipid packing, resulting in transient aqueous pores in bilayer. Plasmid DNA molecules enter through these pores.
This is paired with a treatment of calcium ions, which neutralises negative charges on the DNA phosphate backbone and the membrane surface, and thus reduces electrostatic repulsion between DNA and cell envelope.
Describe another assembly method in detail (such as Golden Gate Assembly)
Explain the other method in 5–7 sentences plus diagrams (either handmade or online).
Design fragments with Type IIS sites and specific 4-bp overhangs. PCR amplify or synthesize fragments with those flanking sites. Mix fragments, plasmid backbone, Type IIS enzyme, ligase, and buffer. Run digestion–ligation thermal cycles. Transform assembled plasmid into bacteria.
Model this assembly method with Benchling or Asimov Kernel!
Pippetting Units Moles (mol) measure the absolute amount of a substance Molarity (M) measures the concentration of that substance in a solution Moles (mol): A unit representing particles (atoms, molecules, etc.). Molarity (M): Concentration defined as moles of solute per liter of solution (mol/L). Conversions 1 L = 1000 mL = 1,000,000 μL 1 M = 1000 mM = 1,000,000 μM Pipette sizes P20, P200, P1000 - each fitting up to 20μL, 200μL and 1000μL (1mL) Equipment Pippette Eppendorf Tube PCR tube strip Reagents: dH2O - distilled water (purified) Gel loading dye - used for ??? Assays procedure to see if the thing is there or not thing can be a substance, chemical, entity, bacteria, etc. “see” could be measured qualitatively or quantitatively Serial dilutions What is this? It’s a geometric process which downsamples a concentration. This procedure conveys useful information in multiple areas: For measuring population counts using the human eye, you cannot count anything above 102, so a 1mL broth which might contain 107-10^9 populants can be downsampled to a 1μL broth. There is an innate assumption that the serial dilution process retains a uniform distribution of the original broth. For virology/immunology, you define strength by the last dilution that still works (neutralizes, infects, agglutinates) Dose–response curves - these are log-spaced. The serial dilution process is in a sense a geometric process (reduces by a ratio 1:10 each step, which progressively downscales in logarithmic sense). Serial dilution is how you map an unknown huge concentration into the measurable window of any detector How do you dilute? C1 * V1 = C2 * V2 rearrange: V1 = (C2*V2) / C1 V_water = V2 - V1 How do you do serial dilutions? Scenario: The stock concentration of a mystery substance (MS) is 5 M. Calculate how to dilute to 100 µM (0.1 mM): SerialDilute(1:499), SerialDilute(1:99) → Step 1: Dilute 5 M (5,000,000 µM) to 10,000 µM (500x dilution). Step 2: Dilute 10,000 µM to 100 µM (100x dilution). https://2026a.htgaa.org/2026a/course-pages/weeks/week-01/lab/index.html
Subsections of Labs
Week 1 Lab: Pipetting
Pippetting
Units
Moles (mol) measure the absolute amount of a substance
Molarity (M) measures the concentration of that substance in a solution
Moles (mol): A unit representing particles (atoms, molecules, etc.).
Molarity (M): Concentration defined as moles of solute per liter of solution (mol/L).
Conversions
1 L = 1000 mL = 1,000,000 μL
1 M = 1000 mM = 1,000,000 μM
Pipette sizes
P20, P200, P1000 - each fitting up to 20μL, 200μL and 1000μL (1mL)
Equipment
Pippette
Eppendorf Tube
PCR tube strip
Reagents:
dH2O - distilled water (purified)
Gel loading dye - used for ???
Assays
procedure to see if the thing is there or not
thing can be a substance, chemical, entity, bacteria, etc.
“see” could be measured qualitatively or quantitatively
It’s a geometric process which downsamples a concentration.
This procedure conveys useful information in multiple areas:
For measuring population counts using the human eye, you cannot count anything above 10^2, so a 1mL broth which might contain 10^7-10^9 populants can be downsampled to a 1μL broth. There is an innate assumption that the serial dilution process retains a uniform distribution of the original broth.
For virology/immunology, you define strength by the last dilution that still works (neutralizes, infects, agglutinates)
Dose–response curves - these are log-spaced. The serial dilution process is in a sense a geometric process (reduces by a ratio 1:10 each step, which progressively downscales in logarithmic sense).
Serial dilution is how you map an unknown huge concentration into the measurable window of any detector
How do you dilute?
C1 * V1 = C2 * V2
rearrange: V1 = (C2*V2) / C1
V_water = V2 - V1
How do you do serial dilutions?
Scenario: The stock concentration of a mystery substance (MS) is 5 M. Calculate how to dilute to 100 µM (0.1 mM):
SerialDilute(1:499), SerialDilute(1:99) →
Step 1: Dilute 5 M (5,000,000 µM) to 10,000 µM (500x dilution).
Step 2: Dilute 10,000 µM to 100 µM (100x dilution).