Homework

Weekly homework submissions:

  • Week 1 HW: Principles and Practices

    Biological Engineering Application Proposed application: I want to develop a computer program that helps early-stage biological research by making it easier and more responsible for researchers to analyze biological data. The tool would help organize, check, and understand biological datasets (such as genomic or protein-related data) using bioinformatics and AI-assisted methods. It would also clearly show where there is doubt and where there is a risk of misuse.

  • Week 2 HW: DNA READ, WRITE & EDIT

    Part 1: Benchling & In-silico Gel Art Simulated EcoRI Digestion Simulated HindIII Digestion Simulated BamHI Digestion Simulated KpnI Digestion Simulated EcoRV Digestion

  • Week 3 HW: Lab Automation

    Opentrons Artwork Python Script This week, I explored laboratory automation by writing and simulating a Python script for the Opentrons liquid handling robot using Google Colab. As a Committed Listener, I was not physically running the robot, but I focused on understanding the automation logic and API structure that controls robotic liquid handling.

  • Week 4 HW: Protein Design Part I

    Part A. Conceptual Questions 1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons) Eating a 500-gram piece of meat provides you with approximately 6.62 x 10²³ molecules of amino acids. Since raw meat is roughly 22% protein, you are consuming about 110 grams of actual protein; dividing this by the average mass of 100 Daltons per amino acid tells us you have 1.1 moles of them. By multiplying that by Avogadro’s constant, we find that you are swallowing about 662 sextillion molecules, a number so large it exceeds the number of stars in the observable universe

  • Week 5 HW: Protein Design Part II

    Part A: SOD1 Binder Peptide Design Part 1: PepMLM Peptide Generation sp|P00441|SODC_HUMAN Superoxide dismutase [Cu-Zn] OS=Homo sapiens OX=9606 GN=SOD1 PE=1 SV=2 MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ SOD1 A4V mutation sequence: MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ To generate candidate binders, I retrieved the human SOD1 sequence from UniProt (accession P00441) and introduced the A4V mutation at position 4. I then used the PepMLM Colab notebook to generate four 12-amino-acid peptide sequences conditioned on the mutant protein. Since only four peptides were generated, a moderate Top-K(5) ensures enough variation between candidates to evaluate different potential binding interactions with Superoxide Dismutase 1.

Subsections of Homework

Week 1 HW: Principles and Practices

cover image cover image

Biological Engineering Application

Proposed application: I want to develop a computer program that helps early-stage biological research by making it easier and more responsible for researchers to analyze biological data. The tool would help organize, check, and understand biological datasets (such as genomic or protein-related data) using bioinformatics and AI-assisted methods. It would also clearly show where there is doubt and where there is a risk of misuse.

Why this interests me: My interest stems from my academic exposure to bioinformatics and my curiosity about how software can meaningfully support biological research without lowering safety standards. As biological technologies become more accessible, I am particularly interested in how computational systems might help guide responsible use rather than accelerate harm or misuse.

Governance/Policy Goals for an Ethical Future

Ensure that AI- and software-assisted biological tools promote constructive scientific progress while minimizing risks related to misuse, safety failures, and inequitable access.

Sub-goals:

  1. Non-malfeasance by preventing intentional or accidental misuse of biological data or tools.
  2. Reduce risks related to unsafe experimental design or misinterpretation of results.
  3. Equity & Access by ensuring tools do not disproportionately benefit only well-resourced institutions or regions.
  4. To encourage transparency, reproducibility, and responsible documentation.

Governance Actions

Option 1: Mandatory Safety & Ethics Training for Tool Access

Purpose: Currently, many computational biology tools are accessible with minimal oversight. This action proposes requiring basic ethics and safety training before granting access to advanced biological analysis features.

Design

  • Actor(s): Universities, research institutions, platform developers
  • Short certification modules embedded into the tool onboarding
  • Required before unlocking sensitive or high-risk functionalities

Assumptions

  • Users will engage honestly with the training
  • Training content is kept up to date
  • Institutions agree on baseline standards

Risks of Failure & “Success”

  • Risk: Training becomes a box-checking exercise
  • Unintended success: May exclude independent researchers or under-resourced users if not designed inclusively
image image

Option 2: Built-in Technical Safeguards and Usage Monitoring

Purpose: Introduce technical constraints that limit high-risk outputs and flag potentially dangerous use cases.

Design

  • Actor(s): Software developers, private companies
  • Automated flags, rate limits, and warning prompts
  • Optional audit logs for institutional users

Assumptions

  • Risky behaviors can be meaningfully detected
  • Developers correctly anticipate misuse patterns

Risks of Failure & “Success”

  • Risk: Over-blocking legitimate research
  • Unintended success: Users may try to bypass safeguards using alternative tools

Option 3: Norms and Incentives for Transparent Documentation

Purpose: Encourage researchers to document both successes and failures to promote safer learning and reproducibility, much like chess players recording every move of a match.

Design

  • Actor(s): Journals, funding bodies, academic institutions
  • Incentives for publishing negative or null results
  • Standardized documentation templates

Assumptions

  • Researchers value incentives over speed or prestige
  • Documentation does not expose sensitive information

Risks of Failure & “Success”

  • Risk: Increased administrative burden
  • Unintended success: Over-disclosure of sensitive methods

Scoring Governance Actions

Does the option:Option 1Option 2Option 3
Enhance Biosecurity
• By preventing incidents212
• By helping respond221
Foster Lab Safety
• By preventing incident122
• By helping respond221
Protect the environment
• By preventing incidents222
• By helping respond231
Other considerations
• Minimizing costs and burdens to stakeholders231
• Feasibility?121
• Not impede research231
• Promote constructive applications121

Prioritization & Recommendation

Based on the scoring, I would prioritize a combination of Option 1 (training requirements) and Option 3 (documentation norms). Together, these approaches encourage responsible behavior without heavily restricting legitimate research. While technical safeguards (Option 2) are important, they should be applied cautiously to avoid impeding innovation.

Target audience: Academic institutions and platform developers, with encouragement from funding agencies.

Trade-offs & uncertainties:

  • Balancing accessibility with responsibility
  • Ensuring governance mechanisms evolve alongside technology
  • Risk that voluntary norms are unevenly adopted

Reflection: Ethical Concerns from This Week

This week highlighted how easily powerful biological tools can shift from beneficial to harmful depending on context, intent, and oversight. One ethical concern that stood out to me was the assumption that access alone equates to understanding or responsibility. I was also struck by how governance often lags behind technical capability.

Proposed additional governance action: Introduce interdisciplinary review processes that include not only scientists but also ethicists, policymakers, and community representatives when developing or deploying new biological tools.

Assignment (Week 2 Lecture Prep)

Homework Questions from Professor Jacobson

Question 1: Nature’s machinery for copying DNA is called polymerase. What is the error rate of polymerase? How does this compare to the length of the human genome. How does biology deal with that discrepancy?

Answer: Nature’s machinery for copying DNA, called DNA polymerase, has an error rate of approximately 1 error per 10 million base pairs (1:10⁶). The human genome consists of approximately 3 billion base pairs. This means that during DNA replication, there could be around 300 errors per replication event.

Biology has evolved mechanisms to address this discrepancy and maintain genomic integrity:

  • Proofreading by DNA Polymerase: DNA polymerase has a built-in proofreading ability. If it incorporates an incorrect nucleotide, it can detect the error, remove the incorrect base, and replace it with the correct one. This significantly reduces the error rate to about 1 error per billion base pairs (1:10⁹).

  • DNA Repair Mechanisms: Cells have additional repair systems, such as mismatch repair, which identify and correct errors that escape the proofreading process. These mechanisms further reduce the error rate and help maintain the accuracy of the genome.

These processes ensure that the human genome remains stable and functional despite its vast size and the inherent error rate of DNA replication.

Question 2: How many different ways are there to code (DNA nucleotide code) for an average human protein? In practice what are some of the reasons that all of these different codes don’t work to code for the protein of interest?

Answer: To determine how many different ways DNA can code for an average human protein, we need to consider the following:

  • Average Length of a Human Protein: The document states that the average human protein is 1036 base pairs long. Since each amino acid is encoded by a codon (a sequence of three nucleotides), this corresponds to approximately 345 amino acids (1036 ÷ 3 ≈ 345).
  • Codon Redundancy: The genetic code is degenerate, meaning multiple codons can encode the same amino acid. For example, there are 64 possible codons, but only 20 amino acids, so many amino acids are encoded by more than one codon. The number of codons per amino acid varies (e.g., leucine has 6 codons, while methionine has only 1).
  • Number of Possible DNA Codes: If we assume an average of 3 codons per amino acid (a rough estimate based on the genetic code), the number of possible DNA sequences for an average human protein would be approximately 3345, which is an astronomically large number.

Reasons Why All These Codes Don’t Work in Practice:

  • Codon Bias: Different organisms have preferences for certain codons over others, known as codon bias. Codons that are rarely used in the host organism may lead to inefficient translation or reduced protein expression.
  • mRNA Secondary Structures: Some DNA sequences may produce mRNA with secondary structures (e.g., hairpins) that interfere with ribosome binding or translation, reducing the efficiency of protein synthesis.
  • Regulatory Elements: DNA sequences may inadvertently contain regulatory elements (e.g., promoters, enhancers, or silencers) that affect transcription or translation, leading to unintended consequences.
  • Protein Folding and Function: While the amino acid sequence may be correct, the codon choice can influence the speed of translation, which in turn affects protein folding. Improper folding can result in non-functional or misfolded proteins.
  • Post-Translational Modifications: Some DNA sequences may not allow for proper post-translational modifications, which are critical for the protein’s function.
  • Codon Context Effects: The sequence context around codons can influence translation efficiency and accuracy, meaning that not all codon combinations are equally effective.

In practice, researchers often optimize codon usage for the host organism to ensure efficient and accurate protein production.

Homework Questions from Dr. LeProust

Question 1: What’s the most commonly used method for oligo synthesis currently?

Answer: The most commonly used method for oligonucleotide synthesis currently is solid-phase phosphoramidite chemistry

Question 2: Why is it difficult to make oligos longer than 200nt via direct synthesis?

Answer: Making oligonucleotides (oligos) longer than 200 nucleotides (nt) via direct chemical synthesis is difficult primarily because of the exponential decrease in yield caused by coupling efficiencies being less than 100%, and the cumulative increase in chemical errors over long synthesis cycles.

Question 3: Why can’t you make a 2000bp gene via direct oligo synthesis?

Answer: Direct, single-step chemical synthesis of a 2000 base pair (bp) DNA sequence is currently not possible using standard automated oligonucleotide synthesis, primarily due to the exponential decrease in efficiency, accumulation of errors, and the inability to purify such long single-stranded molecules. While 2000 bp genes are commonly created, they are assembled from smaller oligonucleotides (typically 40-200 bases) rather than synthesized in one direct step.

Homework Question from George Church

Question 1: What are the 10 essential amino acids in all animals and how does this affect your view of the “Lysine Contingency”?

Answer: The 10 essential amino acids that must be acquired through the diet of most animals, including humans, are Histidine, Isoleucine, Leucine, Lysine, Methionine, Phenylalanine, Threonine, Tryptophan, Valine, and Arginine. These are considered “essential” because animal bodies cannot synthesize these amino acids internally at a rate sufficient to meet metabolic needs and must instead obtain them from food.

How the Science Affects My View of the Contingency: The knowledge that lysine is an essential amino acid for all animals completely undermines the premise of the “Lysine Contingency” as a practical, reliable safety feature. All vertebrates already cannot produce their own lysine. They must obtain it from their food. Making a dinosaur “lysine-deficient” is redundant because they were already, by definition, dependent on dietary lysine. The contingency fails because dinosaurs can easily find lysine in their environment. Herbivores can eat soy, beans, and other common plants, while carnivores can obtain it by eating those herbivores. The failure of the contingency serves as a key plot point demonstrating human hubris and the inability to fully control nature, as noted by characters in the franchise.

AI Use Disclosure: I used ChatGPT (OpenAI) as a brainstorming and structuring aid while working on this assignment. Specifically, I used AI-generated prompts to help clarify the assignment requirements, organize my ideas, and explore example frameworks for governance and ethics analysis. All final interpretations, reflections, and written content were reviewed, adapted, and contextualized by me.

Week 2 HW: DNA READ, WRITE & EDIT

COver image COver image

Part 1: Benchling & In-silico Gel Art

Simulated EcoRI Digestion

Simulated HindIII Digestion

Simulated BamHI Digestion

Simulated KpnI Digestion

Simulated EcoRV Digestion

Simulated SacI Digestion

Simulated SalI Digestion

Digital Gel Art. Enzymes used to process the DNA are listed in each respective lanes

Part 3: DNA Design Challenge

3.1. Choice of Protein and Protein Sequence

Chosen Protein: Green Fluorescent Protein (GFP)

Why this protein?

I chose Green Fluorescent Protein (GFP) because it is a well-characterized and widely used reporter protein in molecular biology and synthetic biology. GFP emits green fluorescence when exposed to blue or UV light, making it an ideal tool for studying gene expression, protein localization, and cellular processes. Its extensive documentation and availability of sequence data make it suitable for computational analysis and reverse translation exercises.

Protein Sequence Source

The amino acid sequence was obtained from UniProt, a curated protein sequence database (https://www.uniprot.org/uniprotkb/P42212/entry).

Protein Sequence (UniProt format)

sp|P42212|GFP_AEQVI Green fluorescent protein OS=Aequorea victoria OX=6100 GN=GFP PE=1 SV=1

MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTL VTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLV NRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLAD HYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK

3.2. Reverse Translation: Protein to DNA

The Central Dogma of molecular biology explains how DNA is transcribed into RNA and translated into protein. Because multiple codons can encode the same amino acid, reverse translation produces a possible nucleotide sequence rather than a single definitive one.

Using a reverse translation tool from https://www.bioinformatics.org, the following nucleotide sequence was generated for the GFP protein.

Reverse-Translated DNA Sequence

atgagcaaaggcgaagaactgtttaccggcgtggtgccgattctggtggaactggatggc gatgtgaacggccataaatttagcgtgagcggcgaaggcgaaggcgatgcgacctatggc aaactgaccctgaaatttatttgcaccaccggcaaactgccggtgccgtggccgaccctg gtgaccacctttagctatggcgtgcagtgctttagccgctatccggatcatatgaaacag catgatttttttaaaagcgcgatgccggaaggctatgtgcaggaacgcaccatttttttt aaagatgatggcaactataaaacccgcgcggaagtgaaatttgaaggcgataccctggtg aaccgcattgaactgaaaggcattgattttaaagaagatggcaacattctgggccataaa ctggaatataactataacagccataacgtgtatattatggcggataaacagaaaaacggc attaaagtgaactttaaaattcgccataacattgaagatggcagcgtgcagctggcggat cattatcagcagaacaccccgattggcgatggcccggtgctgctgccggataaccattat ctgagcacccagagcgcgctgagcaaagatccgaacgaaaaacgcgatcatatggtgctg ctggaatttgtgaccgcggcgggcattacccatggcatggatgaactgtataaa

3.3. Codon Optimization

Why Codon Optimization Is Necessary

Although different organisms use the same genetic code, they prefer different codons for the same amino acids. Codon optimization improves protein expression by matching the codon usage of the host organism, increasing translation efficiency, mRNA stability, and overall protein yield.

Chosen Expression Organism and Reason

Escherichia coli

E. coli is commonly used for recombinant protein production due to its fast growth, low cost, and well-understood genetics.

Codon-Optimized DNA Sequence (for E. coli) using GenSmart™ Codon Optimization tool (https://www.genscript.com/gensmart-free-gene-codon-optimization.html)

ATGAGTAAAGGTGAAGAACTGTTTACCGGTGTGGTTCCGATCCTGGTTGAACTGGATGGTGATGTTAACGGTCATAAATTTTCAGTTTCTGGTGAAGGTGAAGGTGATGCTACCTATGGCAAATTGACTCTGAAGTTTATCTGTACCACTGGCAAATTGCCGGTGCCATGGCCAACTCTGGTGACCACTTTCTCTTATGGTGTACAGTGCTTCTCCCGTTATCCTGATCATATGAAACAGCATGATTTTTTCAAATCTGCTATGCCAGAAGGTTATGTTCAGGAAAGGACTATTTTCTTCAAGGATGATGGTAATTATAAAACTAGAGCTGAAGTTAAATTTGAAGGTGATACCTTGGTCAATCGTATTGAACTGAAAGGTATTGATTTTAAAGAAGATGGTAATATTCTGGGTCATAAACTGGAATATAATTATAATTCTCATAATGTTTATATTATGGCTGATCAGAAAAATGGTATTAAAGTTAATTTTAAAATTAGACATAATATTGAAGATAGTGGTTCAGTTCTGGCTGATCATTATCAGCAGAATACCCCAATTGGTGATGGTCCTGTTCTGCTGCCAGATAATCATTATTTGTCTACACAGAGTGCTTTGTCTAAGGATCCTAATGAAGAAAGAGATCATATGGTTCTGTTGGAATTTGTTACCGCTGCTGGTATTACACATGGTATGGATGAACTGTATTAA

3.4. You Have a Sequence! Now What?

Once the codon-optimized DNA sequence is obtained, several technologies can be used to produce the protein.

Cell-Dependent Method

  • The DNA sequence is inserted into a plasmid vector with a promoter.

  • The plasmid is transformed into a host cell (e.g., E. coli).

  • The host cell transcribes the DNA into mRNA.

  • Ribosomes translate the mRNA into the GFP protein.

  • The protein can be purified using affinity chromatography.

Cell-Free Method

  • The DNA is added to a cell-free transcription–translation system.

  • Transcription and translation occur in vitro.

  • The protein is synthesized without living cells.

  • Both methods rely on the Central Dogma: DNA → RNA → Protein.

Part 4: Prepare a Twist DNA Synthesis Order

4.2. Build Your DNA Insert Sequence

I selected GFP as my protein of interest due to its well-characterized fluorescence properties and robust expression in E. coli. The coding sequence used in this expression cassette was previously codon optimized for E. coli in Part 3 of this assignment.

I Constructed linear GFP expression cassette with annotated regulatory and coding regions

Benchling link: https://benchling.com/s/seq-0svwetPOR91RRgpKfRjz?m=slm-l7fH5hxSe2vTJEp6n9YU

I Uploaded GFP expression cassette and selected pTwist Amp High Copy vector for synthesis

Part 5: DNA Read/Write/Edit

5.1 DNA Read

(i) I would sequence DNA used for DNA-based digital data storage. Since digital files are encoded into synthetic DNA strands, sequencing is necessary to verify that the information was written correctly and can be accurately recovered. Even small errors could corrupt stored data, so reading the DNA ensures reliability and long-term stability.

(ii) I would use second-generation Illumina sequencing developed by Illumina. The input would be the synthetic DNA pool, prepared by adding adapters and amplifying it before sequencing. The system reads DNA by incorporating fluorescently labeled bases one at a time and imaging each cycle to identify the sequence. The output is a set of short reads with quality scores that can be decoded back into digital information.

5.2 DNA Write

(i) I would synthesize a simple environmental biosensor circuit that expresses Green Fluorescent Protein when exposed to a specific contaminant. This would allow visible detection of environmental signals and could be used for low-cost monitoring or education.

(ii) I would use standard solid-phase chemical DNA synthesis, such as the method used by Twist Bioscience. DNA is built one nucleotide at a time and shorter pieces can be assembled into larger constructs. The main limitations are synthesis errors in longer sequences and the need for assembly steps for bigger genes.

5.3 DNA Edit

(i) I would edit plant genes related to drought tolerance to improve crop resilience under climate stress. Enhancing stress-response pathways could support food security and more sustainable agriculture.

(ii) I would use CRISPR-Cas9, developed for genome editing by scientists including Jennifer Doudna. A guide RNA directs the Cas9 enzyme to a specific DNA sequence, where it makes a cut that the cell repairs, introducing edits. Limitations include possible off-target effects and variable efficiency.

Week 3 HW: Lab Automation

Cover image Cover image

Opentrons Artwork Python Script

This week, I explored laboratory automation by writing and simulating a Python script for the Opentrons liquid handling robot using Google Colab. As a Committed Listener, I was not physically running the robot, but I focused on understanding the automation logic and API structure that controls robotic liquid handling.

Design: A DNA Double Helix Pattern

I began by generating a coordinate-based design inspired by biological structures, particularly the DNA double helix, using the GUI at https://opentrons-art.rcdonovan.com/

Using mathematical functions in Python, I generated coordinate-based instructions that determine where liquid would be dispensed on a 96-well plate. I then structured the script using the Opentrons API format to simulate how the robot would execute these movements.

This is compatible with Opentrons OT-2 API structure:

from opentrons import types

metadata = {    # see https://docs.opentrons.com/v2/tutorial.html#tutorial-metadata
    'author': 'Pascal Agbley', 
    'protocolName': 'DNA Double Helix Structure', # Give your protocol a name
    'description': 'A custom design of a DNA Double Helix Structure.', 
    'source': 'HTGAA 2026 Opentrons Lab',
    'apiLevel': '2.20'
}

TIP_RACK_DECK_SLOT = 9
COLORS_DECK_SLOT = 6
AGAR_DECK_SLOT = 5
PIPETTE_STARTING_TIP_WELL = 'A1'

well_colors = {
    'A1' : 'Red',
    'B1' : 'Green',
    'C1' : 'Orange'
}


def run(protocol):
  ##############################################################################
  ###   Load labware, modules and pipettes
  ##############################################################################

  # Tips
  tips_20ul = protocol.load_labware('opentrons_96_tiprack_20ul', TIP_RACK_DECK_SLOT, 'Opentrons 20uL Tips')

  # Pipettes
  pipette_20ul = protocol.load_instrument("p20_single_gen2", "right", [tips_20ul])

  # Modules
  temperature_module = protocol.load_module('temperature module gen2', COLORS_DECK_SLOT)

  # Temperature Module Plate
  temperature_plate = temperature_module.load_labware('opentrons_96_aluminumblock_generic_pcr_strip_200ul',
                                                      'Cold Plate')

  color_plate = temperature_plate

  # Agar Plate
  agar_plate = protocol.load_labware('htgaa_agar_plate', AGAR_DECK_SLOT, 'Agar Plate')  ## TA MUST CALIBRATE EACH PLATE!
  # Get the top-center of the plate, make sure the plate was calibrated before running this
  center_location = agar_plate['A1'].top()

  pipette_20ul.starting_tip = tips_20ul.well(PIPETTE_STARTING_TIP_WELL)

  ##############################################################################
  ###   Patterning
  ##############################################################################

  ###
  ### Helper functions for this lab
  ###

  # pass this e.g. 'Red' and get back a Location which can be passed to aspirate()
  def location_of_color(color_string):
    for well,color in well_colors.items():
      if color.lower() == color_string.lower():
        return color_plate[well]
    raise ValueError(f"No well found with color {color_string}")

  # For this lab, instead of calling pipette.dispense(1, loc) use this: dispense_and_detach(pipette, 1, loc)
  def dispense_and_detach(pipette, volume, location):
      """
      Move laterally 5mm above the plate (to avoid smearing a drop); then drop down to the plate,
      dispense, move back up 5mm to detach drop, and stay high to be ready for next lateral move.
      5mm because a 4uL drop is 2mm diameter; and a 2deg tilt in the agar pour is >3mm difference across a plate.
      """
      assert(isinstance(volume, (int, float)))
      above_location = location.move(types.Point(z=location.point.z + 5))  # 5mm above
      pipette.move_to(above_location)       # Go to 5mm above the dispensing location
      pipette.dispense(volume, location)    # Go straight downwards and dispense
      pipette.move_to(above_location)       # Go straight up to detach drop and stay high

  ###
  ### 
  ###
  Red = [(-14.3, 34.1),(-12.1, 34.1),(12.1, 34.1),(-14.3, 31.9),(-12.1, 31.9),(12.1, 31.9),(14.3, 31.9),(-12.1, 29.7),(-9.9, 29.7),(-7.7, 29.7),(-5.5, 29.7),(-3.3, 29.7),(-1.1, 29.7),(3.3, 29.7),(5.5, 29.7),(7.7, 29.7),(9.9, 29.7),(12.1, 29.7),(14.3, 29.7),(-14.3, 27.5),(-12.1, 27.5),(12.1, 27.5),(-12.1, 25.3),(-9.9, 25.3),(9.9, 25.3),(-9.9, 23.1),(-7.7, 23.1),(-3.3, 23.1),(-1.1, 23.1),(1.1, 23.1),(3.3, 23.1),(5.5, 23.1),(7.7, 23.1),(9.9, 23.1),(-9.9, 20.9),(-5.5, 20.9),(5.5, 20.9),(7.7, 20.9),(9.9, 20.9),(-7.7, 18.7),(-5.5, 18.7),(-3.3, 18.7),(3.3, 18.7),(5.5, 18.7),(7.7, 18.7),(-5.5, 16.5),(-3.3, 16.5),(-1.1, 16.5),(1.1, 16.5),(5.5, 16.5),(-3.3, 14.3),(-1.1, 14.3),(1.1, 14.3),(3.3, 14.3),(-5.5, 12.1),(-3.3, 12.1),(1.1, 12.1),(3.3, 12.1),(5.5, 12.1),(-9.9, 9.9),(-7.7, 9.9),(-5.5, 9.9),(5.5, 9.9),(7.7, 9.9),(9.9, 9.9),(-12.1, 7.7),(-9.9, 7.7),(7.7, 7.7),(9.9, 7.7),(-12.1, 5.5),(-9.9, 5.5),(9.9, 5.5),(12.1, 5.5),(-12.1, 3.3),(-9.9, 3.3),(-7.7, 3.3),(-5.5, 3.3),(-3.3, 3.3),(-1.1, 3.3),(3.3, 3.3),(5.5, 3.3),(7.7, 3.3),(9.9, 3.3),(12.1, 3.3),(14.3, 3.3),(-14.3, 1.1),(-12.1, 1.1),(12.1, 1.1),(14.3, 1.1),(-14.3, -1.1),(-12.1, -1.1),(12.1, -1.1),(14.3, -1.1),(-14.3, -3.3),(-12.1, -3.3),(-9.9, -3.3),(-7.7, -3.3),(-5.5, -3.3),(-3.3, -3.3),(-1.1, -3.3),(1.1, -3.3),(3.3, -3.3),(5.5, -3.3),(7.7, -3.3),(9.9, -3.3),(12.1, -3.3),(14.3, -3.3),(-12.1, -5.5),(12.1, -5.5),(-9.9, -7.7),(7.7, -7.7),(9.9, -7.7),(-9.9, -9.9),(-7.7, -9.9),(5.5, -9.9),(7.7, -9.9),(9.9, -9.9),(-5.5, -12.1),(-3.3, -12.1),(-1.1, -12.1),(3.3, -12.1),(5.5, -12.1),(-1.1, -14.3),(1.1, -14.3),(-5.5, -16.5),(-1.1, -16.5),(1.1, -16.5),(3.3, -16.5),(5.5, -16.5),(-7.7, -18.7),(-5.5, -18.7),(-3.3, -18.7),(3.3, -18.7),(5.5, -18.7),(7.7, -18.7),(-9.9, -20.9),(-7.7, -20.9),(-5.5, -20.9),(5.5, -20.9),(7.7, -20.9),(9.9, -20.9),(-9.9, -23.1),(-7.7, -23.1),(-5.5, -23.1),(-3.3, -23.1),(-1.1, -23.1),(1.1, -23.1),(3.3, -23.1),(5.5, -23.1),(7.7, -23.1),(9.9, -23.1),(-12.1, -25.3),(9.9, -25.3),(-12.1, -27.5),(12.1, -27.5),(-14.3, -29.7),(-12.1, -29.7),(-9.9, -29.7),(-7.7, -29.7),(-5.5, -29.7),(-3.3, -29.7),(-1.1, -29.7),(3.3, -29.7),(5.5, -29.7),(7.7, -29.7),(9.9, -29.7),(12.1, -29.7),(-12.1, -31.9),(12.1, -31.9),(14.3, -31.9),(-14.3, -34.1),(-12.1, -34.1),(12.1, -34.1)]
  Green = [(14.3, 34.1),(-14.3, 29.7),(1.1, 29.7),(12.1, 25.3),(-5.5, 23.1),(-7.7, 20.9),(3.3, 16.5),(-14.3, 3.3),(1.1, 3.3),(-9.9, -5.5),(9.9, -5.5),(-5.5, -9.9),(-3.3, -14.3),(3.3, -14.3),(-3.3, -16.5),(-9.9, -25.3),(12.1, -25.3),(1.1, -29.7),(14.3, -29.7),(-14.3, -31.9),(14.3, -34.1)]

  def point_to_location(point):
    x_offset, y_offset = point
    return center_location.move(types.Point(x=x_offset, y=y_offset, z=0))

  pipette_20ul.pick_up_tip()

  # Pipette Red points
  red_color_source = location_of_color('Red')
  for i, point in enumerate(Red):
    if i % 20 == 0: # Aspirate every 20uL to ensure enough liquid
      pipette_20ul.aspirate(min(20, len(Red) - i), red_color_source)
    target_location = point_to_location(point)
    dispense_and_detach(pipette_20ul, 1, target_location)

  pipette_20ul.drop_tip() # Drop tip after finishing Red points
  pipette_20ul.pick_up_tip() # Pick up new tip for Green points

  # Pipette Green points
  green_color_source = location_of_color('Green')
  for i, point in enumerate(Green):
    if i % 20 == 0: # Aspirate every 20uL to ensure enough liquid
      pipette_20ul.aspirate(min(20, len(Green) - i), green_color_source)
    target_location = point_to_location(point)
    dispense_and_detach(pipette_20ul, 1, target_location)

  pipette_20ul.drop_tip()

https://colab.research.google.com/drive/1DEgEUEtKyrVeh5qUuv6brWGXw2Ol7UkB#scrollTo=pczDLwsq64mk&line=5&uniqifier=1

Post-Lab Questions

Published Paper Using Opentrons

A 2020 study demonstrated the use of Opentrons OT-2 robots for automated SARS-CoV-2 diagnostic workflows during the COVID-19 pandemic. Researchers used automated liquid handling to perform RNA extraction and PCR setup at scale, reducing manual error and increasing throughput. The Opentrons platform allowed laboratories to rapidly deploy affordable automation in response to urgent public health needs. This demonstrated how open-source robotics can accelerate diagnostics and improve reproducibility.

Reference: Implementation of an open-source robotic platform for SARS-CoV-2 testing by real-time RT-PCR (Villanueva-Cañas, J. L., Gonzalez-Roca, E., Gastaminza Unanue, A., Titos, E., Martínez Yoldi, M. J., Vergara Gómez, A., & Puig-Butillé, J. A. (2021). Implementation of an open-source robotic platform for SARS-CoV-2 testing by real-time RT-PCR. PloS one, 16(7), e0252509. https://doi.org/10.1371/journal.pone.0252509)

What I Would Automate for My Final Project

Automated Biosensor Screening Platform: I would automate the screening of environmental biosensor constructs using cell-free protein synthesis (CFPS).

What I would automate:

  • Dispensing plasmid DNA into 96-well plates

  • Adding CFPS master mix

  • Adding environmental analytes

  • Incubation timing

  • Fluorescence measurement preparation

Example Pseudocode:

Load 96-well plate
Load tip rack
Load CFPS reagent reservoir

For each well:
    Transfer plasmid DNA
    Add CFPS master mix
    Add test analyte

Seal plate
Incubate at 37°C
Prepare for fluorescence readout 

Week 4 HW: Protein Design Part I

Cover image Cover image

Part A. Conceptual Questions

1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)

Eating a 500-gram piece of meat provides you with approximately 6.62 x 10²³ molecules of amino acids. Since raw meat is roughly 22% protein, you are consuming about 110 grams of actual protein; dividing this by the average mass of 100 Daltons per amino acid tells us you have 1.1 moles of them. By multiplying that by Avogadro’s constant, we find that you are swallowing about 662 sextillion molecules, a number so large it exceeds the number of stars in the observable universe

2. Why do humans eat beef but do not become a cow, eat fish but do not become fish?

Humans do not become the animals they eat because their digestive systems break down foreign DNA and proteins into basic amino acids and nutrients, which are then used to build human-specific cells. Essentially, the body digests the information (DNA) of the food, removing its ability to turn the consumer into the consumed.

Also answered in this thread

3. Why are there only 20 natural amino acids?

While over 500 amino acids exist in nature, only these 20 are universally encoded by the genetic code, representing a “frozen accident” of evolution that optimized functionality while minimizing metabolic costs.

Part B: Protein Analysis and Visualization

1. Briefly describe the protein and why you selected it

The protein I selected is Green Fluorescent Protein (GFP) from the jellyfish Aequorea victoria. GFP naturally emits bright green fluorescence when exposed to ultraviolet or blue light. It is widely used in molecular biology as a reporter protein to visualize gene expression, protein localization, and cellular processes in living organisms. I chose GFP because it is one of the most important tools in modern biotechnology and synthetic biology, and I previously used it in earlier parts of this assignment.

2. Amino acid sequence

The amino acid sequence of GFP was obtained from UniProt (UniProt ID: P42212)). The sequence begins as follows:

MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTT FSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELK GTDEKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIG DGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK

GFP protein on Uniprot GFP protein on Uniprot
  • Protein length and most frequent amino acid: The GFP protein contains 238 amino acids. Using the amino acid frequency analysis tool in the provided Google Colab notebook, I counted the occurrence of each amino acid in the sequence. The most frequent amino acid in GFP is glycine (G), appearing 22 times.
  • Number of protein sequence homologs: Using the UniProt BLAST search tool, I compared the GFP sequence against protein databases to identify homologous proteins. The search returned 205 hits of homologs across different organisms.
  • Protein family: GFP belongs to the Green Fluorescent Protein family, which includes naturally occurring and engineered fluorescent proteins used as biological markers. Members of this family share a conserved beta-barrel structure that protects the internal chromophore responsible for fluorescence.

3. Identify the structure page of your protein in RCSB

The 3D structure of GFP can be found in the RCSB Protein Data Bank(https://www.rcsb.org/structure/1GFL)

  • When the structure was solved and quality: The crystal structure of GFP was solved in 1996. The structure has a resolution of approximately 1.9 Å, which is considered high quality because lower resolution values indicate more precise structural data. This high resolution makes GFP a reliable model for structural and molecular visualization studies.

  • Other molecules present in the structure: Besides the protein itself, the solved structure contains the internal chromophore, which is formed from amino acids within the protein after folding. Some crystal structures may also include water molecules or ions that help stabilize the structure during crystallography experiments.

4. Open the structure of your protein in any 3D molecule visualization software

Using PyMOL, I visualized the GFP structure in three different representations:

  • Cartoon representation to highlight overall folding
  • Ribbon representation to show secondary structures
  • Ball-and-stick representation

After coloring the protein by secondary structure, I observed that the structure contains significantly more beta sheets than alpha helices. These beta sheets form a characteristic barrel structure that surrounds the chromophore in the center of the protein.

When coloring the protein by residue type, I observed that hydrophobic residues are mostly located in the interior of the protein structure, where they help stabilize folding through hydrophobic interactions. Hydrophilic residues are primarily distributed on the outer surface, where they can interact with water and the surrounding environment.

When visualizing the surface of the protein, the structure forms a compact barrel-like shape with small surface cavities and indentations.

Part C. Using ML-Based Protein Design Tools

C1. Protein Language Modeling

Deep Mutational Scan

Using the ESM2 protein language model, I generated a deep mutational scan of my protein, Green Fluorescent Protein. The model predicts how favorable different amino-acid substitutions are at each position in the sequence. I noticed that mutations in residues located near the core of the protein were predicted to be much less favorable than mutations on the surface. For example, substitutions near the chromophore region showed very low likelihood scores, suggesting that these residues are important for maintaining the protein’s structure and fluorescence.

Latent Space Analysis

The notebook also allowed me to embed protein sequences into a reduced dimensional space to compare their similarities. When visualizing these embeddings, proteins with similar sequences and functions tended to cluster together. When I located GFP in the map, it appeared close to other fluorescent or beta-barrel proteins. This suggests the model captures meaningful relationships between proteins based only on their sequence information.

C2. Protein Folding

I used ESMFold to predict the structure of GFP from its amino-acid sequence and compared the prediction with the experimentally solved structure in the RCSB Protein Data Bank. The predicted structure closely resembled the original structure, especially the characteristic beta-barrel fold. When I introduced small mutations into the sequence, the predicted structure stayed mostly similar. Larger sequence changes, however, caused the model to predict structures that deviated more from the original fold.

C3. Protein Generation

Inverse Folding (Protein Generation) Using ProteinMPNN, I generated possible sequences that could fold into the same backbone structure as GFP. The predicted sequences were similar to the original sequence in several important regions, particularly residues located in the protein core. Some positions varied more, which suggests those areas may tolerate mutations without disrupting the overall structure. When I tested one of the generated sequences with ESMFold, the predicted structure still looked very similar to the original GFP fold.

Original GFP Sequence: MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK

New Sequence: SYPGEELFTGKVPIEVELDGDVNGKKFSVKGEGEGDATKGTITLKLICTTGKLPVPWPTLIDTFSGGLPCFTRYPPHMRQHDFFKSAMPEGYKQERTITFEGDGKYETRSEVKMEGDTLVNRIELKGSGFKEDGNILGHKLEFSFNSYKVNITADAAANGIKETYTLELKLKDGSVQKAKVDRKVTPIGDGPVLLPEPHYIEVEVKLSKDPNEKRDHVVIEQKSTAAGIE

Week 5 HW: Protein Design Part II

cover image cover image

Part A: SOD1 Binder Peptide Design

Part 1: PepMLM Peptide Generation

sp|P00441|SODC_HUMAN Superoxide dismutase [Cu-Zn] OS=Homo sapiens OX=9606 GN=SOD1 PE=1 SV=2 MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

SOD1 A4V mutation sequence:

MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

To generate candidate binders, I retrieved the human SOD1 sequence from UniProt (accession P00441) and introduced the A4V mutation at position 4. I then used the PepMLM Colab notebook to generate four 12-amino-acid peptide sequences conditioned on the mutant protein. Since only four peptides were generated, a moderate Top-K(5) ensures enough variation between candidates to evaluate different potential binding interactions with Superoxide Dismutase 1.

IndexBinderPseudo PerplexityInterpretation
0MESILTLLLKRK12.187512Moderate confidence
1MKSLITDQLLVI10.999398Best candidate (lowest perplexity)
2MSLTETDLLIVV13.539035Acceptable but weaker
3MAKILVLLQRKI19.123590Low confidence
Known BinderFLYRWLPSRRGG21.42177599429796Lowest confidence

For comparison, I added the known SOD1-binding peptide FLYRWLPSRRGG, which was also evaluated using PepMLM and produced a perplexity score of 21.42. This score is higher than all four generated peptides, indicating that the model assigns lower likelihood to this sequence compared to the newly generated candidates. Interestingly, the peptide MKSLITDQLLVI showed the lowest perplexity (10.99), suggesting the highest confidence according to the model. This highlights that language model likelihood does not always perfectly correlate with experimentally validated binding but can still help propose plausible new candidates.

Part 2: AlphaFold3 Binding Evaluation

MESILTLLLKRK peptide: ipTM = 0.54 pTM = 0.86 localized near the β-barrel region

MKSLITDQLLVI peptide: ipTM = 0.25 pTM = 0.84 localized near the N-terminus

MSLTETDLLIVV peptide: ipTM = 0.34 pTM = 0.86 localized near the β-barrel region

MAKILVLLQRKI peptide: ipTM = 0.63 pTM = 0.88 localized near the β-barrel region

FLYRWLPSRRGG peptide: ipTM = 0.34 pTM = 0.81 localized near the N-terminus

The peptide–protein complexes were predicted using AlphaFold for mutant Superoxide Dismutase 1, where MAKILVLLQRKI showed the strongest predicted interaction (ipTM = 0.63) near the β-barrel region, while the known binder FLYRWLPSRRGG had a lower interaction score (ipTM = 0.34) and localized near the N-terminus.

Part 3: PeptiVerse Therapeutic Evaluation

MESILTLLLKRK peptide:

MKSLITDQLLVI peptide:

MSLTETDLLIVV peptide:

MAKILVLLQRKI peptide:

FLYRWLPSRRGG peptide:

I evaluated the peptides in PeptiVerse to assess therapeutic properties such as predicted binding affinity, solubility, hemolysis probability, net charge, and molecular weight against the A4V mutant of Superoxide Dismutase 1. In general, peptides with higher ipTM scores from AlphaFold tended to show stronger predicted binding affinity, although some candidates showed slightly lower solubility. None of the peptides showed strong hemolytic risk. Among them, MAKILVLLQRKI appeared to provide the best balance of structural binding confidence, predicted affinity, and acceptable physicochemical properties, so I selected it as the peptide to advance for further study.

Part 4: moPPIt Optimized Peptides

Using moPPIt, I generated the peptides by targeting residues near the A4V mutation on Superoxide Dismutase 1 while optimizing binding affinity and therapeutic properties. Compared to the peptides produced earlier with PepMLM, the moPPIt designs appeared more directed toward the chosen binding region and were predicted to have improved physicochemical properties such as solubility and lower hemolysis risk. Before advancing any candidate toward clinical studies, the peptides would need further evaluation using structural prediction tools like AlphaFold, followed by experimental validation of binding strength, stability, and toxicity in laboratory assays

Part C: L-Protein Mutants

The project focuses on improving the stability and folding efficiency of the lysis protein from the MS2 bacteriophage. In this assignment, computational tools are used to generate mutations that may improve the protein’s structural stability while maintaining its biological function. By comparing predicted folding energies and structural models, promising mutants can be identified for further experimental testing. Improving the stability of the lysis protein could help researchers better understand how bacteriophages disrupt bacterial cells, which may contribute to future strategies against antibiotic-resistant bacteria.