<Jenn Leung> — HTGAA Spring 2026

cover image cover image

About me

Jenn Leung is a creative technologist and simulation developer building game engine simulations and real-time streaming tools. Currently her research focuses on developing UE interfaces for living neurons and agent behaviour simulation, with two papers recently published in the MIT Antikythera Journal and a paper on UE-API for brain-on-a-chip platforms presented at NeurIPS 2025.

She is a Senior Lecturer in Creative Technology & Design at University of the Arts London, a researcher at LifeFabs Institute, and a Visiting Researcher at The Bartlett School of Architecture, UCL, working on the 100 Minds in Motion project combining EEG, eye-tracking, and movement data in an agent simulation. She was previously a studio researcher at Antikythera’s Cognitive Infrastructures Studio in 2024, supported by the Berggruen Institute. Since 2025 she has been collaborating with the biocomputing start-up Cortical Labs to create human-synthetic biological intelligence visualizations.

She is also a part of Off World Live, an engineering and research group for Unreal Engine creators, and previously served as a Programme Head at Architectural Association VS Unit 5 Xalon. Her work has been exhibited at Epic Games Innovation Lab, Ars Electronica, Medialab Matadero, W1 Curates, Tai Kwun Hong Kong, National Communication Museum (Australia), CIVA Festival, DAE Research Festival, PAF Olomouc, ALife Conference Kyoto, Aksioma, and Museum of Art in Public Spaces (Køge) among others, and was featured on Dazed, TANK Magazine, DIS, SHOWStudio, Art Asia Pacific, COEVAL Magazine, and AQNB. 

In collaboration with Daniel Felstead, she has produced a short film series ‘I’m so Janky’ from DIS that explore the myths, ideologies and realities of the metaverse, AI, and Neuralink. She also collaborates with dmstfctn on simulation projects for Serpentine Arts Technologies and the Leonardo Supercomputer at Bologna’s Tecnopolo.

Contact info

Homework

Labs

Projects

Subsections of <Jenn Leung> — HTGAA Spring 2026

Homework

Weekly homework submissions:

  • Week 1: Principles and Practices

    Question 1 First, describe a biological engineering application or tool you want to develop and why. This could be inspired by an idea for your HTGAA class project and/or something for which you are already doing in your research, or something you are just curious about.

  • Week 3: Lab Automation

    Lab: Opentrons Art

  • Week 4: Protein Design Part I

    Protein Design Part I (Thras Karydis, Jon Kaufman)
    Lab: Protein Design I

  • Week 10: Imaging and Measurement

    Week 10: Imaging and Measurement title: “Week 10 — Advanced Imaging & Measurement Technology” linkTitle: “Week 10 (Apr 7)” weight: 200 description: | Advanced Imaging & Measurement Tech (Evan Daugharthy, Waters Corp.) Lab: Mass Spectrometry This lecture presents a range of advanced technologies to do precision measurement of proteins at atomic scales, characterizing chemical composition, and detecting protein sequence and structure.

  • Week 11: Building Genomes

    Week 11: Building Genomes Homework — DUE BY START OF APR 21 LECTURE (TBD)

  • Week 12: Bioproduction

    Week 12: Bioproduction Homework — DUE BY START OF APR 28 LECTURE (TBD)

  • Week 13: Bio Design Living Materials

    Week 13: Bio Design Living Materials Homework: Work on your Final Project Present it May 12 (MIT/Harvard) or May 13 (Committed Listeners)

  • Week 14: Biofabrication

    Week 14: Biofabrication Homework: Finish your Final Project Present it May 12 (MIT/Harvard) or May 13 (Committed Listeners)

  • Week 2: DNA Read, Write, and Edit

    Week 2: DNA Read, Write, and Edit Part 1: Benchling & In-silico Gel Art 1.1 Import Lambda DNA Simulate Restriction Enzyme Digestion Virtual Gel Part 2: Gel Art I have chosen to create a gel art of a person doing a jumping jack through randomization method.

  • Week 5: Protein Design Part II

    Week 5: Protein Design Part II Homework — DUE BY START OF MAR 10 LECTURE Part A: SOD1 Binder Peptide Design Superoxide dismutase 1 (SOD1) is a cytosolic antioxidant enzyme that converts superoxide radicals into hydrogen peroxide and oxygen. In its native state, it forms a stable homodimer and binds copper and zinc.

  • Week 6: Genetic Circuits Part I

    Week 6: Genetic Circuits Part I Homework — DUE BY START OF MAR 17 LECTURE Assignment: DNA Assembly Answer these questions about the protocol in this week’s lab:

  • Week 7: Genetic Circuits Part II

    Week 7: Genetic Circuits Part II Assignment Part 1: Intracellular Artificial Neural Networks (IANNs) What advantages do IANNs have over traditional genetic circuits, whose input/output behaviors are Boolean functions? An artificual neuron is a weighted summation through an activation function that produces outputs, eventually they form networks to become ANN. Intracellular artificial networks still have weighted summation and a non-linear activation function, but we can consider implementing gene circuits as these activation functions. The main difference is that IANNs will have two inputs that can do addition and subtraction. On the one hand, a promoter that through transcription makes a gene, and through translation we create proteins, we can perform addition on this. To subtract, we can treat input x1 as an endoribonuclease CasE that will bind and cleaves the RNA on the sequence and produce output. x1 is negative weight and x2 is positve weight, where the function is max(x2-x1,0). This is also referred to as Sequestration. Sequestration involves using an endorribonucleus to transcribe into mRNA to produce non-linearity (applying single turnover enzyme to remove it out of circulation).

  • Week 9: Cell-Free Systems

    Week 9: Cell-Free Systems Homework — DUE BY START OF Apr 7 LECTURE Homework Part A: General and Lecturer-Specific Questions General homework questions Explain the main advantages of cell-free protein synthesis over traditional in vivo methods, specifically in terms of flexibility and control over experimental variables. Name at least two cases where cell-free expression is more beneficial than cell production. Cell-free systems help us understand biology ‘from scratch’ to bioengineer from smaller units. There’s wider flexibility for scaffolding biology from the ground-up and controlling the environments in a complete model. Existing living cells as we know it are already incredibly complex and hence less controlled in experimental settings. Synthetic cell engineering allows flexibility in size of the cell, proteins, and even expanding largely on the chemistry of the cell. So the two scenarios could be if you want to control the size of the cell and want uniform control it might be ideal to use cell-free system. The other scenario might be to engineer a specific chemical environment or want chemical diversity in the experiment that is not naturally common/ compatible with cells. Compared to in-vivo expression where you have to create plasmids, cell-free protein expressions are faster and cheaper to construct and can also help you through quick iterations with linear fragments and without plasmids.

Subsections of Homework

Week 1: Principles and Practices

cover image cover image

Question 1

First, describe a biological engineering application or tool you want to develop and why. This could be inspired by an idea for your HTGAA class project and/or something for which you are already doing in your research, or something you are just curious about.

Answer 1

I would like to expand on my project on Unreal Engine API for brain-on-a-chip platforms that was presented at NeurIPS 2025 (https://openreview.net/forum?id=BroaBkQAGa). The project proposes to build an API between living neurons interfaced with microelectrode arrays and virtual gaming environments, so that researchers and designers can use this environment to visualize spiking behavior across MEA channels, and to use reinforcement learning algorithms within the game environment to train neuronal cultures as game agents.

I’m currently collaborating with Cortical Labs to use CL-1 to connect via UDP to design closed-loop real-time visualization systems at the National Communication Museum in Melbourne. To start the loop I’ve sent in blob tracking data for Cl1 to process. The spikes from the CL1 are then streamed to Unreal Engine so that the neuronal activity can be used to transform agent parameters. https://ncm.org.au/exhibitions/cortical-labs https://jennleung.xyz/corticallabs

Question 2

Next, describe one or more governance/policy goals related to ensuring that this application or tool contributes to an “ethical” future, like ensuring non-malfeasance (preventing harm). Break big goals down into two or more specific sub-goals. Below is one example framework (developed in the context of synthetic genomics) you can choose to use or adapt, or you can develop your own. The example was developed to consider policy goals of ensuring safety and security, alongside other goals, like promoting constructive uses, but you could propose other goals for example, those relating to equity or autonomy.

Answer 2

One of the main objectives of this project is that it provides an open playground for benchmarking open-source and non-standardized brain-on-a-chip platforms. As we speculate these systems to become democratized and decentralized, there will spawn many different configurations of physical/ neural assemblies with advances in MEA designs, bioprinting technologies, and microfluidic platforms. Therefore, it is important to supporting 1) benchmarking integrity and reproducibility, for example, how do we measure spiking activity across different systems? How do we make sure experiments are scientifically meaningful? How do we translate and deliver virtual environments to channels on different MEA geometries? 2) ensuring accessibility to indepdnent researchers, for example, writing software environments not only for proprietary technologies such as Cortical Lab’s CL1 or FinalSpark’s Neuroplatform. Governance here means committing to abstraction layers that treat CL1 as one implementation among many 3) responsible scalability across new substrates, for example, new substrates includes increasingly complex organoids or assembloids that should go through rigorous bioethical frameworks. 4) Sustainability & longevity of the substrates, there should be rate limitations so that cells aren’t overly stimulated and at risk of quick death.

Question 3

Next, describe at least three different potential governance “actions” by considering the four aspects below (Purpose, Design, Assumptions, Risks of Failure & “Success”). Try to outline a mix of actions (e.g. a new requirement/rule, incentive, or technical strategy) pursued by different “actors” (e.g. academic researchers, companies, federal regulators, law enforcement, etc). Draw upon your existing knowledge and a little additional digging, and feel free to use analogies to other domains (e.g. 3D printing, drones, financial systems, etc.).

For each governance action, address:

  • Purpose: What is done now and what changes are you proposing?
  • Design: What is needed to make it “work”? (including the actor(s) involved - who must opt-in, fund, approve, or implement, etc)
  • Assumptions: What could you have wrong (incorrect assumptions, uncertainties)?
  • Risks of Failure & “Success”: How might this fail, including any unintended consequences of the “success” of your proposed actions?

Answer 3

  1. Benchmarking metadata across different brain-on-a-chip platforms

Purpose: Currently there are multiple commercial/ proprietary brain-on-a-chip platforms such as Cortical Labs’ CL1 and FinalSpark’s Neuroplatform, but there are no standardizations or comparisons metadata of these systems. I am proposing to create a metadata of existing platforms/ systems and develop an open access metadata standard that documents different MEA geometries, channel count, substrates,

Design: Map out a group of academic researchers who have been working on organoid intelligence/ synthetic bioengineered intelligence standardization, and manufacturers such as MaxWell Biosystems, Cortical Labs, etc., join community labs or open-source groups on open-source resesarch. In terms of implementations, I will need to consult all these groups to create an UE plugin that responds to their needs. It would be great to apply for AHRC/ UKRI grants.

Assumptions: This action assumes that all parties are happy to share their manual or manufacturing details, however, some of this data might be protected under NDA.

Risks of Failure and Success: There’s a high chance the open-source projects will grow exponentially, making this metadata impossible to manage at scale.

  1. Developing stimulation protocols at API layer

Purpose: Since there are many different types of brain-on-a-chip platforms, each company/ lab has different protocols of stimulating and recording these systems. It would be great to propose a stimulation protocol that is initiated by the API/ game environment.

Design: Study the stimulation protocols across different systems and apply appropriate time scales, rate limits, response rates, and stimulation/ discretization patterns so we can formalize communication with living neurons.

Assumptions: The biggest assumption here is likely that standardization might not be applicable or scientifically meaninful across different biological systems because of biological variability, they vary by culture, by physical assembly and MEA type.

Risk of Failure: Overstandardization might lead to less meaningful scientific experiments. Certain rate limits and standardization might fail to recognize the plasticity and assumes this technology to not evolve. Constant review and negotiations are needed to make this option work!

  1. Developing a wide range of benchmarking gaming environments/ templates

Purpose: Cortical Labs has compared living neurons against RL algorithms in Pong. I would like to expand on this to develop something adjacent to OpenAI Gym, so that we can create environments for synthetic bioengineered intelligence.

Design: These might include standardized task environments that allow researchers to compare RL agent performance on identical tasks, or have multiplayer/ team battles between two systems for performance evaluations. Standardized environments ensure that experimental results are reproducible and comparable across institutions.

Assumptions: The templates assume that this variability can be characterized statistically across many runs, but if variability is too high, the benchmarks may not be informative.

Risks of Failure and Success: Templates might restrict certain experiment design, so it would be important to balance standardization/ benchmarking vs openness.

Question 4

Next, score (from 1-3 with, 1 as the best, or n/a) each of your governance actions against your rubric of policy goals. The following is one framework but feel free to make your own:

Answer 4

(Fill in the table with your scores for each option.)

Does the option:Option 1Option 2Option 3
Enhance Biosecurity
• By preventing incidents213
• By helping respond132
Foster Lab Safety
• By preventing incident312
• By helping respond123
Protect the environment
• By preventing incidents213
• By helping respond213
Other considerations
• Minimizing costs and burdens to stakeholders231
• Feasibility?213
• Not impede research123
• Promote constructive applications312

Question 5

Last, drawing upon this scoring, describe which governance option, or combination of options, you would prioritize, and why. Outline any trade-offs you considered as well as assumptions and uncertainties. For this, you can choose one or more relevant audiences for your recommendation, which could range from the very local (e.g. to MIT leadership or Cambridge Mayoral Office) to the national (e.g. to President Biden or the head of a Federal Agency) to the international (e.g. to the United Nations Office of the Secretary-General, or the leadership of a multinational firm or industry consortia). These could also be one of the “actor” groups in your matrix.

Answer 5

Option 2 seems to be the most well-considered option because it implies and builds on fundamental knowledge of other research institutions practice and existing start-up solutions. It’s the governance action that most directly addresses the biological welfare and safety concerns that are unique to this field. Since we can’t retroactively un-damage a neuronal culture, having safety protocols embedded at the API layer is the most impactful intervention point.

Question 6

Reflecting on what you learned and did in class this week, outline any ethical concerns that arose, especially any that were new to you. Then propose any governance actions you think might be appropriate to address those issues. This should be included on your class page for this week.

Answer 6

I am interested in the concept of pharmakon - that for research to be really successful also comes at the cost of creating additional problems such as bioweapons or disregulation of illegal substance (biosecurity). The governance actions I am interested in are perhaps on the cloud/ API side of things, around how we may be able to apply trust-based connectivity from software design to bio-design. For example, cloud infrastructure already uses trust models and I think we could potentially learn from internet architecture to look at regulating or modeling remote access to living biological systems.

Homework Questions from Professor Jacobson (Lecture 2 slides)

Question 7

Nature’s machinery for copying DNA is called polymerase. What is the error rate of polymerase? How does this compare to the length of the human genome. How does biology deal with that discrepancy?

Answer 7

Error Rate: 1:106 Throughput Error Rate Product Differential: ~108 The human genome is 3.2 billion letters long and will roughly make 3200 mistakes. Biology can reduce the error rate by shifting mismatched pair and tries again with the corrent nucleotide.

Question 8

How many different ways are there to code (DNA nucleotide code) for an average human protein? In practice what are some of the reasons that all of these different codes don’t work to code for the protein of interest?

Answer 8

There is an astronomical number of ways to code an average human protein. Each amino acid has 3 codons available and there’s more than 300 amino acids long for an average human protein. But some codons have many matching tRNAs that not all codons apply, this means some ribosomes can fall off or misread which leads to less protein produced.

Homework Questions from Dr. LeProust (Lecture 2 slides)

Question 9

What’s the most commonly used method for oligo synthesis currently?

Answer 9

Phosphoramidite method by Caruthers

Question 10

Why is it difficult to make oligos longer than 200nt via direct synthesis?

Answer 10

Chemistry causes cumulative damage and hits a wall around 200 nucleotides.

Question 11

Why can’t you make a 2000bp gene via direct oligo synthesis?

Answer 11

1 in 3,000 bp error rate. There’s too many errors distributed and become unpurifiable. It requires good sequencing analysis and fragment analysis as well as uniform distribution across all oligos.

Homework Question from George Church (Lecture 2 slides)

Choose ONE of the following three questions to answer; and please cite AI prompts or paper citations used, if any.

Option A – Question 12

[Using Google & Prof. Church’s slide #4] What are the 10 essential amino acids in all animals and how does this affect your view of the “Lysine Contingency”?

Answer 12 (if you choose Option A)

Histidine, Isoleucine, Leucine, Lysine, Methionine, Phenylalanine, Theronine, Tryptophan, Valine.

The lysine contingency from Jurassic Park is irrelevant here as all animals already cannot synthesize lysine and require consuming food.

Option B – Question 13

[Given slides #2 & 4 (AA:NA and NA:NA codes)] What code would you suggest for AA:AA interactions?

Answer 13 (if you choose Option B)

(Write your answer here.)

Option C – Question 14 (Advanced students)

[(Advanced students)] Given the one paragraph abstracts for these real 2026 grant programs sketch a response to one of them or devise one of your own: https://arpa-h.gov/explore-funding/programs/boss https://www.darpa.mil/research/programs/smart-rbc https://www.darpa.mil/research/programs/go

Answer 14 (if you choose Option C)

(Write your answer here.)

Week 3: Lab Automation

Python Script for Opentrons Artwork

Question 1

Generate an artistic design using the GUI at opentrons-art.rcdonovan.com.

I first generated a design using an image input of a voxelated ragdoll. The pixels should help simplify the image so that it can be plotted in the dish similarly.

alt text alt textalt text alt text

Because of the lack of contrast and limitations in the range of colors, the image looked different than expected.

Meanwhile at LifeFabs, we only had access to the colors Pink, Blue, and Purple. So I ended up simplifying the number of fluorescent proteins used to three and generated the coordinates appropriately.


Question 2

Using the coordinates from the GUI, follow the instructions in the HTGAA26 Opentrons Colab to write your own Python script which draws your design using the Opentrons.

These were the coordinates generated from the GUI using three fluorescent proteins:

azurite_points = [(-5, 39),(-3, 39),...]
tdtomato_points = [(5, 35),(5, 33),...]
tagrfp_points = [(11, 29),(13, 29),...]

Using the Opentrons Colab document, I successfully integrated the point data into the code:

from opentrons import types

metadata = {
    'author': 'Jenn Leung',
    'protocolName': 'Opentrons Cat',
    'description': 'HTGAA 2026 Opentrons cat drawing',
    'source': 'HTGAA 2026 Opentrons Lab',
    'apiLevel': '2.20'
}

##############################################################################
###   Colour mapping
###   A1 = Blue   → Azurite (cat outline and body)
###   B1 = Purple → mCherry + mPlum (shadow and accent details)
###   C1 = Pink   → tdTomato + tagRFP + mHoneydew (warm fill and face)
##############################################################################

well_colors = {
    'A1': 'Blue',
    'B1': 'Purple',
    'C1': 'Pink',
}

##############################################################################
###   Point data
##############################################################################

azurite_points = [...]
tdtomato_points = [...]
tagrfp_points = [...]

# ... (full code in OpentronsProtocol.py)

# Blue (A1) — Azurite: cat outline and body
paint_layer(azurite_points, 'Blue')

# Pink (C1) — tdTomato + tagRFP + mHoneydew: warm fill and face detail
paint_layer(tagrfp_points, 'Pink')

# Purple (B1) — mCherry + mPlum: shadow accents and deep detail
paint_layer(tdtomato_points, 'Purple')
alt text alt text

This is the result of the final preview on the colab document, using the three colors available.


Post-Lab Questions

Question 1

Find and describe a published paper that utilizes the Opentrons or an automation tool to achieve novel biological applications.

Answer:
‘Fluidic Programmable Gravi-maze Array for High Throughput Multiorgan Drug Testing’ by Wong et al. proposes OrganRX, which is a multi-organ-on-a-chip system that is compatible with automated liquid dispensing robots such as Opentrons, OT2.

The programmable part of the microfluidic architecture uses robotic liquid handlers and automated plate readers, which can help researchers program how much media reaches each organ compartment.

There is also a programmable tilting recirculation mechanism that drives flow between the corner wells of the chip, allowing for directional flow.

The developers developed a Bluetooth-enabled iOS app that allows for remote control of the recirculation system, allowing users to select from multiple shear flow rates, set programmable waiting times between tilt-direction changes, and conduct system reset.


Question 2

Write a description about what you intend to do with automation tools for your final project. You may include example pseudocode, Python scripts, 3D printed holders, a plan for how to use Ginkgo Nebula, and more.

As my research focuses on brains-on-chips and facilitating closed-loop interactions between living substrates and software systems, I’m curious to develop something similar to the OrganRX platform that utilizes Opentrons for chemical I/O with synthetic bioengineered intelligence. The direction is to look into facilitating biochemical feedback loops and designing custom plates for Opentrons via 3D printing.

#pseudo-code for HTGAA final project Assembloid Agency
#design plasmids and custom plate MEA holder > order Twist DNA > 3D print labware > script OT2 protocol > measure spike changes and chemical delivery

prep:
>list type of materials will be needed to facilitate chemical i/o for wetware, e.g. placeholder for neurons, Opentrons OT2, custom labware, Twist order, microfluidic design, media

synbio component:
>design a plasmid in Benchling and identify a chemical 'handshake' between Opentrons and neurons
>synthetic gene for measurable and identifiable chemical signals
>research in DREADD hM3Dq (human M3 muscarinic DREADD coupled to)
>benchling design!

physical system:
>design microfluidics system and composition of the custom labware
>print custom labware for Opentrons OT2, holding an MEA chip, including different wells for cultures and reinforcement agents, waste profusion/filter
>the wells should hold basal media, reinforcement agents, and waste buffer - maybe model after the OrganRX chip to 'tilt' agents into center/substrate.

software:
>develop an API for the OT-2 to detect

assembly:
>test and try to connect the synbio parts, with hardware, and software!
>measure spikes from neurons placeholder after robotic chemical delivery

WIP JSON code for custom labware:

{
  "ordering": [["A1", "A2"]],
  "brand": {"brand": "CorticalLabs-Custom"},
  "metadata": {
    "displayName": "Assembloid Agency Chemical IO Plate",
    "displayCategory": "other",
    "displayVolumeUnits": "µL"
  },
  "dimensions": {
    "xDimension": 127.76,
    "yDimension": 85.48,
    "zDimension": 15.0
  },
  "wells": {
    "A1": {
      "depth": 10.0,
      "diameter": 3.0,
      "shape": "circular",
      "x": 20.0, 
      "y": 40.0,
      "z": 5.0
    },
    "A2": {
      "depth": 10.0,
      "diameter": 3.0,
      "shape": "circular",
      "x": 40.0,
      "y": 40.0,
      "z": 5.0
    }
  }
}

Final Project Ideas — DUE BY START OF FEB 24 LECTURE

As explained in this week’s recitation, add 1-3 slides with 3 ideas you have for an Individual Final Project in the appropriate slide deck for MIT/Harvard/Wellesley students or for Committed Listeners. Be sure to put your name on your slide(s); for CLs, also put your city and country on your slide(s) and be sure you’re putting your slide(s) in your Node’s section of the deck.

Assembloid Agency is a bio-digital interface platform designed to facilitate closed-loop biochemical communication between synthetic neural substrates and automated software systems. I will be integrating the Opentrons OT-2 with Multi-Electrode Array to create chemical I/O bridge between neural substrates and software systems.

I’m looking into using DREADDs to allow software-controlled chemical I/O as well as designing custom 3D-printed labware, housing the biological assembly while providing microfluidic channels for automated media exchange, chemical reinforcement signals, and waste management. The aim is to conduct real-time closed-loop chemical communication with the substrate. alt text alt text alt text alt text alt text alt text alt text alt text

Reading & Resources

Week 4: Protein Design Part I

This week focuses on how sequence, structure, and energetics can be modeled and manipulated to create or optimize proteins with specified functions.

Objective:

  1. Learn basic concepts:
    • amino acid structure
    • 3D protein visualization
    • the variety of ML-based design tools
  2. Brainstorm as a group how to apply these tools to engineer a better bacteriophage (setting the stage for the final project).

Part A. Conceptual Questions

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

Answer any NINE of the following questions from Shuguang Zhang: (i.e. you can select two to skip)

  1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)

For 500 grams of meat, there is roughly 20-25% grams of protein. This means that roughly 100 grams belong to protein, while there is remaining fat, fiber, and water that make up the rest of the mass. Because 1 mole = 100 Da Number of moles = 100 g of protein / 100 Da = 100g/ 100 g / mol $$\text{Molecules} = 1 \text{ mol} \times 6.022 \times 10^{23} \text{ molecules/mol}$$$$\text{Molecules} \approx 6.022 \times 10^{23}$$ There are roughly 602 sextillion amino acids.

  1. Why do humans eat beef but do not become a cow, eat fish but do not become fish?

When humans eat beef, through mastication and digestion we break down the beef into smaller units. First protein is broken down by enzymes (proteases) and into shorter chains of amino acids in the stomach. Then the chains become further broken down into individual amino acids in the small intestine. As these amino acids enter the bloodstream, they require DNA to instruct them into building other things. The human DNA does different things than cows and fish, therefore the amino acids will build a cow or a fish.

  1. Why are there only 20 natural amino acids?

It may be an evolutionary mystery that almost all living things are built from these 20 natural amino acids. The 20 amino acids serve as the building blocks of most proteins, they line up as codons in 3-letter assemblies, in which the ribosomes read to create actions following the DNA sequence. When they read 3 bases at once, the combinations create 4^3 possibilities that are expansive enough for the making of diverse lifeforms.

  1. Can you make other non-natural amino acids? Design some new amino acids.

Yes, there are a lot of non-natural amino acids. Designing new amino acids require us to follow the same chassis but redesign the ‘r-group’ to alter the chemistry of the bond, which is the side chain of the amino acid. One may attach an azide to the chain to create a strong bond for stickiness or bio-glue. For experiments, some researchers also use non-natural florescent amino acids like Acridonylalanine to glow under microscopy or photographs.

  1. Where did amino acids come from before enzymes that make them, and before life started?

This might be related to assembly theory? Lee Cronin proposed that life is composed of different assemblies, in that life is scaffolded by energy, raw sources, and minerals through complex interactions and then becomes amino acids, and longer chains. Gases and energy together can create amino acids. The Miller-Urey Experiment use water, methane, ammonia, and hydrogen to create amino acids.

  1. If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?

Left-handed. D-amino acids create a mirror image of α-helixes, because the building blocks and the structure are completely mirrored.

  1. Can you discover additional helices in proteins?

Yes, since 2020, AlphaFold has allowed us to quickly discover new helices and the instructions to their fold, revealed millions of protein structures.

  1. Why are most molecular helices right-handed?

Because of chirality, most helices are non-identical to their mirror image. As most amino acids are L-form (left-handed), the way they most efficiently stack together is twisting to the right where they can create stable bonds with enough room between each other.

  1. Why do β-sheets tend to aggregate?

β-sheets bond together via hydrogen bonds. The geometry appears like pleated, zigzag, sheet-like structure with side chains protruding.

  • What is the driving force for β-sheet aggregation?

They tend to aggregate because of its geometry, where the hydrophobic faces might sandwich and stick together to hide from water. The force from the water becomes driving force for clumping.

  1. Why do many amyloid diseases form β-sheets?
    • Can you use amyloid β-sheets as materials?
  2. Design a β-sheet motif that forms a well-ordered structure.

Part B: Protein Analysis and Visualization

In this part of the homework, you will be using online resources and 3D visualization software to answer questions about proteins. Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions:

  1. Briefly describe the protein you selected and why you selected it.

alt text alt text I chose GPR3 Orphan G-coupled Protein Receptor in complex with Dominant Negative Gs (8U8F) because I’m interested in GPR3 is a class A orphan G protein-coupled receptor (GPCR) exhibiting broad expression across various brain regions including the hypothalamus, hippocampus, and cortex, as well as in peripheral tissues such as liver and ovary.It has a potential role in modulating a number of brain functions, including behavioral responses to stress, amyloid-beta peptide generation in neurons and neurite outgrowth. For brains-on-chips research I’m interested in different types of expressions in the central nervous system and the brain.

  1. Identify the amino acid sequence of your protein.
    • How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids.

    There are four protein chains. Chain A: 372, Chain B: 339, Chain C: 58, Chain D: 384. The most frequent amino acid seems to be leucine. It is a sturdy, hydrophobic (water-hating) amino acid.

    >8U8F_4|Chain D[auth R]|G-protein coupled receptor 3|Homo sapiens (9606)NSTMKTIIALSYIFCLVFADYKDDDDLEVLFQGPAMWGAGSPLAWLSAGSGNVNVSSVGPAEGPTGPAAPLPSPKAWDVVLCISGTLVSCENALVVAIIVGTPAFRAPMFLLVGSLAVADLLAGLGLVLHFAAVFCIGSAEMSLVLVGVLAMAFTASIGSLLAITVDRYLSLYNALTYYSETTVTRTYVMLALVWGGALGLGLLPVLAWNCLDGLTTCGVVYPLSKNHLVVLAIAFFMVFGIMLQLYAQICRIVCRHAQQIALQRHLLPASHYVATRKGIATLAVVLGAFAACWLPFTVYCLLGDAHSPPLYTYLTLLPATYNSMINPIIYAFRNQDVQKVLWAVCCCCSSSKIPFRSRSPSDVPAGLEVLFQGPHHHHHHHHAAAFESR
    >8U8F_3|Chain C[auth G]|Guanine nucleotide-binding protein G(I)/G(S)/G(O) subunit gamma-2|Homo sapiens (9606)
    NTASIAQARKLVEQLKMEANIDRIKVSKAAADLMAYCEAHAKEDPLLTPVPASENPFR
    >8U8F_2|Chain B|Guanine nucleotide-binding protein G(I)/G(S)/G(T) subunit beta-1|Homo sapiens (9606)
    QSELDQLRQEAEQLKNQIRDARKACADATLSQITNNIDPVGRIQMRTRRTLRGHLAKIYAMHWGTDSRLLVSASQDGKLIIWDSYTTNKVHAIPLRSSWVMTCAYAPSGNYVACGGLDNICSIYNLKTREGNVRVSRELAGHTGYLSCCRFLDDNQIVTSSGDTTCALWDIETGQQTTTFTGHTGDVMSLSLAPDTRLFVSGACDASAKLWDVREGMCRQTFTGHESDINAICFFPNGNAFATGSDDATCRLFDLRADQELMTYSHDNIICGITSVSFSKSGRLLLAGYDDFNCNVWDALKADRAGVLAGHDNRVSCLGVTDDGMAVATGSWDSFLKIWN
    >8U8F_1|Chain A|Guanine nucleotide-binding protein G(s) subunit alpha isoforms short|Homo sapiens (9606)
    MGCLGNSKTEDQRNEEKAQREANKKIEKQLQKDKQVYRATHRLLLLGAGESGKNTIVKQMRILHVNGFNGEGGEEDPQAARSNSDGEKATKVQDIKNNLKEAIETIVAAMSNLVPPVELANPENQFRVDYILSVMNVPDFDFPPEFYEHAKALWEDEGVRACYERSNEYQLIDCAQYFLDKIDVIKQADYVPSDQDLLRCRVLTSGIFETKFQVDKVNFHMFDVGAQRDERRKWIQCFNDVTAIIFVVASSSYNMVIREDNQTNRLQAALKLFDSIWNNKWLRDTSVILFLNKQDLLAEKVLAGKSKIEDYFPEFARYTTPEDATPEPGEDPRVTRAKYFIRDEFLRISTASGDGRHYCYPHFTCSVDTENIRRVFNDCRDIIQRMHLRQYELL
    • How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.

    There are thousands of homologs, incuding human, pygmy chimpanzee, olive babboon, cotton-top tamarin, etc. The protein seems highly conserved and not changed.

    • Does your protein belong to any protein family?

    G Protein-Coupled Receptor (GPCR) Family

  2. Identify the structure page of your protein in RCSB
    • When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)

    The structure is solved around 2023 September and released 2024 Match. The method id electron microscopy but resolution 3.49 Å.

    • Are there any other molecules in the solved structure apart from protein?

    Yes, I see palmitic acid in the structure apart from protein.

    It belongs to a membrain protein, and falls under 7-transmembrane receptive (GPCR).

  3. Open the structure of your protein in any 3D molecule visualization software:
    • PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands)
    • Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.

    Cartoon alt text alt text Ribbon alt text alt text Ball and stick alt text alt text

    • Color the protein by secondary structure. Does it have more helices or sheets? alt text alt text It has a lot more helices than sheets.
    • Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?

    alt text alt text I used an additional script to label the hydrophobicity scale. Hydrophobic residues are red and hydrophilic (polar/charged) residues are white. It is slightly more hydrophobic.

   #https://pymolwiki.org/index.php/Color_h
   from pymol import cmd

def color_h(selection='all'):
        s = str(selection)
        print(s)
        cmd.set_color('color_ile',[0.996,0.062,0.062])
        cmd.set_color('color_phe',[0.996,0.109,0.109])
        cmd.set_color('color_val',[0.992,0.156,0.156])
        cmd.set_color('color_leu',[0.992,0.207,0.207])
        cmd.set_color('color_trp',[0.992,0.254,0.254])
        cmd.set_color('color_met',[0.988,0.301,0.301])
        cmd.set_color('color_ala',[0.988,0.348,0.348])
        cmd.set_color('color_gly',[0.984,0.394,0.394])
        cmd.set_color('color_cys',[0.984,0.445,0.445])
        cmd.set_color('color_tyr',[0.984,0.492,0.492])
        cmd.set_color('color_pro',[0.980,0.539,0.539])
        cmd.set_color('color_thr',[0.980,0.586,0.586])
        cmd.set_color('color_ser',[0.980,0.637,0.637])
        cmd.set_color('color_his',[0.977,0.684,0.684])
        cmd.set_color('color_glu',[0.977,0.730,0.730])
        cmd.set_color('color_asn',[0.973,0.777,0.777])
        cmd.set_color('color_gln',[0.973,0.824,0.824])
        cmd.set_color('color_asp',[0.973,0.875,0.875])
        cmd.set_color('color_lys',[0.899,0.922,0.922])
        cmd.set_color('color_arg',[0.899,0.969,0.969])
        cmd.color("color_ile","("+s+" and resn ile)")
        cmd.color("color_phe","("+s+" and resn phe)")
        cmd.color("color_val","("+s+" and resn val)")
        cmd.color("color_leu","("+s+" and resn leu)")
        cmd.color("color_trp","("+s+" and resn trp)")
        cmd.color("color_met","("+s+" and resn met)")
        cmd.color("color_ala","("+s+" and resn ala)")
        cmd.color("color_gly","("+s+" and resn gly)")
        cmd.color("color_cys","("+s+" and resn cys)")
        cmd.color("color_tyr","("+s+" and resn tyr)")
        cmd.color("color_pro","("+s+" and resn pro)")
        cmd.color("color_thr","("+s+" and resn thr)")
        cmd.color("color_ser","("+s+" and resn ser)")
        cmd.color("color_his","("+s+" and resn his)")
        cmd.color("color_glu","("+s+" and resn glu)")
        cmd.color("color_asn","("+s+" and resn asn)")
        cmd.color("color_gln","("+s+" and resn gln)")
        cmd.color("color_asp","("+s+" and resn asp)")
        cmd.color("color_lys","("+s+" and resn lys)")
        cmd.color("color_arg","("+s+" and resn arg)")
cmd.extend('color_h',color_h)

def color_h2(selection='all'):
        s = str(selection)
        print(s)
        cmd.set_color("color_ile2",[0.938,1,0.938])
        cmd.set_color("color_phe2",[0.891,1,0.891])
        cmd.set_color("color_val2",[0.844,1,0.844])
        cmd.set_color("color_leu2",[0.793,1,0.793])
        cmd.set_color("color_trp2",[0.746,1,0.746])
        cmd.set_color("color_met2",[0.699,1,0.699])
        cmd.set_color("color_ala2",[0.652,1,0.652])
        cmd.set_color("color_gly2",[0.606,1,0.606])
        cmd.set_color("color_cys2",[0.555,1,0.555])
        cmd.set_color("color_tyr2",[0.508,1,0.508])
        cmd.set_color("color_pro2",[0.461,1,0.461])
        cmd.set_color("color_thr2",[0.414,1,0.414])
        cmd.set_color("color_ser2",[0.363,1,0.363])
        cmd.set_color("color_his2",[0.316,1,0.316])
        cmd.set_color("color_glu2",[0.27,1,0.27])
        cmd.set_color("color_asn2",[0.223,1,0.223])
        cmd.set_color("color_gln2",[0.176,1,0.176])
        cmd.set_color("color_asp2",[0.125,1,0.125])
        cmd.set_color("color_lys2",[0.078,1,0.078])
        cmd.set_color("color_arg2",[0.031,1,0.031])
        cmd.color("color_ile2","("+s+" and resn ile)")
        cmd.color("color_phe2","("+s+" and resn phe)")
        cmd.color("color_val2","("+s+" and resn val)")
        cmd.color("color_leu2","("+s+" and resn leu)")
        cmd.color("color_trp2","("+s+" and resn trp)")
        cmd.color("color_met2","("+s+" and resn met)")
        cmd.color("color_ala2","("+s+" and resn ala)")
        cmd.color("color_gly2","("+s+" and resn gly)")
        cmd.color("color_cys2","("+s+" and resn cys)")
        cmd.color("color_tyr2","("+s+" and resn tyr)")
        cmd.color("color_pro2","("+s+" and resn pro)")
        cmd.color("color_thr2","("+s+" and resn thr)")
        cmd.color("color_ser2","("+s+" and resn ser)")
        cmd.color("color_his2","("+s+" and resn his)")
        cmd.color("color_glu2","("+s+" and resn glu)")
        cmd.color("color_asn2","("+s+" and resn asn)")
        cmd.color("color_gln2","("+s+" and resn gln)")
        cmd.color("color_asp2","("+s+" and resn asp)")
        cmd.color("color_lys2","("+s+" and resn lys)")
        cmd.color("color_arg2","("+s+" and resn arg)")
cmd.extend('color_h2',color_h2)
alt text alt text
  • Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

alt text alt text Yes it appears to have a hole in the middle.

Part C. Using ML-Based Protein Design Tools

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

In this section, we will learn about the capabilities of modern protein AI models and test some of them in your chosen protein.

  1. Copy the HTGAA_ProteinDesign2026.ipynb notebook and set up a colab instance with GPU.
  2. Choose your favorite protein from the PDB.
  3. We will now try multiple things in the three sections below; report each of these results in your homework writeup on your HTGAA website:

C1. Protein Language Modeling

Picture Source: Bordin, Nicola et al (2023). Novel machine learning approaches revolutionize protein knowledge. Trends in Biochemical Sciences, Volume 48, Issue 4, 345 - 359

Picture Source: Bordin, Nicola et al (2023). Novel machine learning approaches revolutionize protein knowledge. Trends in Biochemical Sciences, Volume 48, Issue 4, 345 - 359

  1. Deep Mutational Scans

    1. Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.
    2. >Using ESM2 mutational scans, 8U8F looks like >![alt text]()
    3. Can you explain any particular pattern? (choose a residue and a mutation that stands out)
    4. It appears that there are vertical bands in the sequence where across different amino acids, it's predicted to have a low score. This might be due to highly conserved functional and structural reasons. Lysine is the most common amino acid, but it also shows lots of dark spots and low scores because it is may have a hydrophobic mismatch. >There is a yellow band at position 243. >It is interesting Lysine is charged and has lots of blue bands, Leucine is neutral and is mostly high on the score.
    5. (Bonus) Find sequences for which we have experimental scans, and compare the prediction of the language model to experiment.
  2. Latent Space Analysis

    1. Use the provided sequence dataset to embed proteins in reduced dimensionality.
    2. >![alt text]()
    3. Analyze the different formed neighborhoods: do they approximate similar proteins?
    4. >They are positionally far away from each other, they are very different proteins.
    5. Place your protein in the resulting map and explain its position and similarity to its neighbors.
    6. >G-protein subunits ($\alpha, \beta, \text{ and } \gamma$ are much closer to each other on the map. >Chain G is much shorter, only 58 amino acids and is structurally very different to other proteins. Chain G is essentially just two small alpha-helices connected by a loop.

C2. Protein Folding

Picture Source: Lin et al (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model.

Picture Source: Lin et al (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model.

Folding a protein

  1. Fold your protein with ESMFold. Do the predicted coordinates match your original structure?
alt text alt text
  1. Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?

I tried changing small snippets of the sequence and it wasn’t as visible, but adding longer sequences of the same amino acid allowed twists to be more visible. alt text alt text

C3. Protein Generation

Picture Source: 1. Post from Sergey Ovchinnikov 2. Roney, Ovchinnikov et al (2022). State-of-the-art estimation of protein model accuracy using AlphaFold. Phys. Rev. Lett. 129, 238101

Picture Source: 1. Post from Sergey Ovchinnikov 2. Roney, Ovchinnikov et al (2022). State-of-the-art estimation of protein model accuracy using AlphaFold. Phys. Rev. Lett. 129, 238101

Inverse-Folding a protein: Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN

  1. Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one.

Using the fixed-backbone design, we kept the 3D shape of 8U8F Chain A and reskinned a sequence. ProteinMPNN ended up rewriting 75% of the protein, there is a high frequency of Leucine and Lysine. alt text alt text My results look like:

Model weights found in ProteinMPNN/vanilla_model_weights
Using device: cuda:0
Number of edges: 48
Training noise level: 0.2A
Model loaded
{'8u8f': (['A'], [])}
Length of chain A is 381
Generating sequences...
>8u8f, score=2.1622, fixed_chains=[], designed_chains=['A'], model_name=v_48_020
NEEKAQREANKKIEKQLQKDKQVYRATHRLLLLGAGESGKNTIVKQMRIXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXSGIFETKFQVDKVNFHMFDVGAQRDERRKWIQCFNDVTAIIFVVASSSYXXXXXXXXQTNRLQAALKLFDSIWNNKWLRDTSVILFLNKQDLLAEKVLAGKSKIEDYFPEFARYTTPEDATPEPGEDPRVTRAKYFIRDEFLRISTASGDGRHYCYPHFTCSVDTENIRRVFNDCRDIIQRMHLRQYELL
>T=0.1, sample=0, score=1.0949, seq_recovery=0.2511
ELLKLLEELLKKLAEKLKKEEEEEKKIKKILLLGSPSSGKTTLLKNIKKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXEPEEVVEFTIDGKKYKIYDLKNQPPDLREVLAKYKDAKVIIYVFPLGSFXXXXXXXXPEDLEKVALEELEWIWNHPDLKNVPILVIFNRPELLRERVLSGKNPIEERFPEYKGYELPKEVKPPEGVPEEWVKVLAFIIDKILKFANKNRGGIREVYPVISSPESKDIKQIIYDAIKKAEERKKLIAEGKL
>T=0.1, sample=0, score=1.1122, seq_recovery=0.2338
LLLLLLLLLLLLLLVLLLLKLLEESKIKKLLLLGSPSSGKTSLLENIEKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXEPERVLEFEIDGVKYRIIDLSNLPPDLSDVLSEYSDCEIIIYVFSTGSYXXXXXXXXPEDLESVDLERLKWIWNHPALKNTPILVIFNRPELLAKRVLSGEKPIEERFPEYKGYKLPENVKPPPGVPEETVKVLSFLIDKVLEFANQNRGGIREVYPVISSVKSKEIKEIIYEAVKKAEERKKLIAQGLL
>T=0.1, sample=0, score=1.0975, seq_recovery=0.2554
KEEEKKKELEEKLKKEEEKKKEEEEKVIKLLLLGLPNSGKTTILENIKKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXEPEEVIEFEIEGKKYRIVDLKNLPPDLSEILEKYSDCKILVYIFPTGSFXXXXXXXXPENLEKEALELLKRIWNHPSLKNVPLLVIFNRAEKLKEIVLSGEKPIEEYFPEYKGYKLPESAKPPPNTDPEVVKVLSFLIDKILEYANQNRGGIRKVFPVISSPESKDIREIIYKAVKEAEERKKLIALGLL
>T=0.1, sample=0, score=1.1196, seq_recovery=0.2857
AALAEELAKKKALAALKKKEEEEESKVKKLLLLGGPSSGKTTLLENISKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXSSIRELEFEIDGVKYKILDLENRPEDLSEILSEFKDCEIIIYVFPLGSFXXXXXXXXPENLLKKALEEFERIWNHPDLKDVPILVLFNRPELLKEKVLSGKKPLEEIFPEYKGWELPEDAKPPPNTPLEWVKALYFLKEKVLEIANKNRGGRREVFPFIVSPKSKDIKEIIYNAVKEAEKRKALIAAGLL
>T=0.1, sample=0, score=1.1445, seq_recovery=0.2381
LLLLLLLALLLALAALLAALAEEEKKVRKLLLLGLPNSGKTTLLKNISKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXEPEEILKFEIDGVKYEIKDLKNRPPDLSDILKEYSDCDIIIYVFPSGLFXXXXXXXXPENLEEVALEQLKNLLNNPDLKNVPILVLFNRPELLKKIVESGKRPLEEIFPEYKGYELPESAVCPPNTPLEWCKAIYFLIDKILEFANQNRGGISEVYPHITSPDSKDIKQIIYDAVKKAEERKKLIAAGKL

New Sequence: DKKIKKDDKKIIKDIKIIDDDDDIIHIIHKKKFRNRRFISSKKIMHIIYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYDNDDTTDESHCFIIWIHWCKIMPNNCKQDTKIWICITHHWTENKFREYYYYYYYYNDCKDITKDDKDVHVMGNCKIMTNHKTHEMQNDKKQDQTKRFIMNHDDQENDWIFWDKNIDTINNDFTNDDVTITKEHHCIHKIEMIMQFFHQDTWNTHRRNDRICHIPHHWCHIIDDQIIKHDFIK

============================================================
Summary
============================================================
Sequence 1: score=1.0949, recovery=25.11%
Sequence 2: score=1.1122, recovery=23.38%
Sequence 3: score=1.0975, recovery=25.54%
Sequence 4: score=1.1196, recovery=28.57%
Sequence 5: score=1.1445, recovery=23.81%

Google Colab doesn’t work with GPU acceleration so I’ve cloned to work locally.

  1. Input this sequence into ESMFold and compare the predicted structure to your original.

new sequence new sequence alt text alt text

DKKIKKDDKKIIKDIKIIDDDDDIIHIIHKKKFRNRRFISSKKIMHIIYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYDNDDTTDESHCFIIWIHWCKIMPNNCKQDTKIWICITHHWTENKFREYYYYYYYYNDCKDITKDDKDVHVMGNCKIMTNHKTHEMQNDKKQDQTKRFIMNHDDQENDWIFWDKNIDTINNDFTNDDVTITKEHHCIHKIEMIMQFFHQDTWNTHRRNDRICHIPHHWCHIIDDQIIKHDFIK

The predicted structure has retained the structure but upon comparison on PyMOL, the white structure (new) looks displaced. alt text alt text

Part D. Group Brainstorm on Bacteriophage Engineering

Assignees for the following sections
MIT/Harvard studentsOptional
Committed ListenersRequired
  1. Find a group of ~3–4 students

  2. Read through the Phage Reading material listed under “Reading & Resources” below.

  3. Review the Bacteriophage Final Project Goals for engineering the L Protein:

    • Increased stability (easiest)
    • Higher titers (medium)
    • Higher toxicity of lysis protein (hard)
  4. Brainstorm Session

    • Choose one or two main goals from the list that you think you can address computationally (e.g., “We’ll try to stabilize the lysis protein,” or “We’ll attempt to disrupt its interaction with E. coli DnaJ.”).

    optimizing protein’s binding affinity to e coli to accelerate lysis trigger increasing stability of L protein, ensuring proteins are folded and integrated into membrane to perform function.

    • Write a 1-page proposal (bullet points or short paragraphs) describing:

      • Which tools/approaches from recitation you propose using (e.g., “Use Protein Language Models to do in silico mutagenesis, then AlphaFold-Multimer to check complexes.”).

      We would like to use protein language models such as ESM2 in the colab document to perform in sillilco mutagenesis. We will calculate single point mutations in the L protein sequence, and try to idenitfy mutations that are more evolutionarily favorable. Like the assignment I am interested to use ProteinMPNN for to redesign and generate a new sequence. Given the backbone structure of the L protein, this tool will help us generate alternative sequences that maintain the same fold but with higher thermal stability, thereby achieving our goals. AlphaFold Multimer was particularly interesting too, as it predicts 3D structures of protein complexes (co-folding multiple chains). Novel complexes create range and breadth.

      • Why do you think those tools might help solve your chosen sub-problem?

      ProteinMPNN was very robust in developing sequences that fit a specific shape, there is guarantee we will be able to increase protein stabililty. ESM2 allows us to scan so many mutations at once, which allows us to very quickly narrow down a direction that we couldn’t perform in wet lab setting.

      • Name one or two potential pitfalls (e.g., “We lack enough training data on phage–bacteria interactions.”).

      L protein is a membrane protein. Most standard protein models like AlphaFold multimer seem to be trained primarily on soluble proteins. The specific lipid-protein interactions required for lysis may not be fully captured, leading to “stable” designs that fail to insert into the membrane. In my assignment I don’t understsand still how the shape will fit as it seems displaced?

      • Include a schematic of your pipeline.
    • This resource may be useful: HTGAA Protein Engineering Tools

  5. Each individually put your plan on your HTGAA website

    • Include your group’s short plan for engineering a bacteriophage

Input a L protein sequence > use ESM2 to generate favorable mutations, the heat map should show us green-light vs no-go directions in the sequence > Use protein MPNN to generate and find a skeleton template for core stability > add complexity via alphaFold, predicting an interaction. >use PyMOL to check shape and geometry > calculate binding affinity score via colab > and select best candidates!


Reading & Resources (click to expand)

Tools

Phage Reading

Week 10: Imaging and Measurement

Week 10: Imaging and Measurement



title: “Week 10 — Advanced Imaging & Measurement Technology” linkTitle: “Week 10 (Apr 7)” weight: 200 description: | Advanced Imaging & Measurement Tech (Evan Daugharthy, Waters Corp.)
Lab: Mass Spectrometry

This lecture presents a range of advanced technologies to do precision measurement of proteins at atomic scales, characterizing chemical composition, and detecting protein sequence and structure.

Lecture (Tues, Apr 7)

Advanced Imaging & Measurement Tech
(▶️Recording)
Evan Daugharthy, Lindsay Morrison.

Recitation (Wed, Apr 8)

Mass spectrometry
(▶️Recording | 💻Slides)
Waters Corp. Team

Lab (Thurs-Fri, Apr 9 - 10)

Homework — DUE BY START OF Apr 14 LECTURE

Homework is partly based on data that will be generated in the Waters Immerse Lab in Cambridge, MA. Students will characterize green fluorescent protein (eGFP, a recombinant protein standard) structure (primary, secondary/tertiary) in the lab using liquid chromatography and mass spectrometry, as well as Keyhole Limpet Hemocyanin (KLH) oligomeric states using charge detection mass spectrometry (CDMS). Data generated in the lab needed to do the homework is included both within this document and in the Appendix of the laboratory protocol.

Homework: Final Project

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

For your final project:

  • Please identify at least one (ideally many) aspect(s) of your project that you will measure. It could be the mass or sequence of a protein, the presence, absence, or quantity of a biomarker, etc.

I will need to measure how much Nurr1 and FoxA2 was successfully introduced which will be reflected with florescent proteins.

  • Please describe all of the elements you would like to measure, and furthermore describe how you will perform these measurements.

  • What are the technologies you will use (e.g., gel electrophoresis, DNA sequencing, mass spectrometry, etc.)? Describe in detail.

Homework: Waters Part I — Molecular Weight

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

We will analyze an eGFP standard on a Waters Xevo G3 QTof MS system to determine the molecular weight of intact eGFP and observe its charge state distribution in the native and denatured (unfolded) states. The conditions for LC-MS analysis of intact protein cause it to unfold and be detected in its denatured form (due to the solvents and pH used for analysis).

  1. Based on the predicted amino acid sequence of eGFP (see below) and any known modifications, what is the calculated molecular weight? You can use an online calculator like the one at https://web.expasy.org/compute_pi/

    eGFP Sequence:
    MVSKGEELFTG VVPILVELDG DVNGHKFSVS GEGEGDATYG KLTLKFICTT GKLPVPWPTL VTTLTYGVQC FSRYPDHMKQ HDFFKSAMPE GYVQERTIFF KDDGNYKTRA EVKFEGDTLV NRIELKGIDF KEDGNILGHK LEYNYNSHNV YIMADKQKNG IKVNFKIRHN IEDGSVQLAD HYQQNTPIGD GPVLLPDNHY LSTQSALSKD PNEKRDHMVL LEFVTAAGIT LGMDELYKLE HHHHHH
    Note: This contains a His-purification tag (HHHHHH) and a linker (the LE before it).

  2. Calculate the molecular weight of the eGFP using the adjacent charge state approach described in the recitation. Select two charge states from the intact LC-MS data (Figure 1) and:
    1. Determine $z$ for each adjacent pair of peaks $(n, n+1)$ using: $$ {\large z} = {\Large \frac{\frac{m}{z_{n+1}}}{\frac{m}{z_n} - \frac{m}{z_{n+1}}}} $$
    2. Determine the MW of the protein using the relationship between $\frac{m}{z_n}$, $MW$, and $z$
    3. Calculate the accuracy of the measurement using the deconvoluted MW from 2.2 and the predicted weight of the protein from 2.1 using: $$ \text{Accuracy} = \frac{|MW_{\text{experiment}} - MW_{\text{theory}}|}{MW_{\text{theory}}} $$
      Figure 1. Mass Spectrum of intact eGFP protein from the Waters Xevo G3 LC-MS (a mass spectrometer with 30,000 resolution) with individual charge state peaks labeled with $\frac{m}{z}$ values.

      Figure 1. Mass Spectrum of intact eGFP protein from the Waters Xevo G3 LC-MS (a mass spectrometer with 30,000 resolution) with individual charge state peaks labeled with $\frac{m}{z}$ values.

  3. Can you observe the charge state for the zoomed-in peak in the mass spectrum for the intact eGFP? If yes, what is it? If no, why not?

Homework: Waters Part II — Secondary/Tertiary structure

Assignees for the following sections
MIT/Harvard studentsOptional but highly recommended
Committed ListenersOptional but highly recommended

We will analyze eGFP in its native, folded state and compare it to its denatured, unfolded state on a quadrupole time-of-flight MS. We will be doing MS-only analysis (no liquid chromatography, also known as “direct infusion” experiments) on the Waters Xevo G3-QToF MS.

  1. Based on learnings in the lab, please explain the difference between native and denatured protein conformations. For example, what happens when a protein unfolds? How is that determined with a mass spectrometer? What changes do you see in the mass spectrum between the native and denatured protein analyses (Figure 2)?
    Figure 2.  Comparison of the mass spectra between denatured (top) and native (bottom) eGFP standard on the Waters Xevo G3 QTof MS.

    Figure 2. Comparison of the mass spectra between denatured (top) and native (bottom) eGFP standard on the Waters Xevo G3 QTof MS.

  2. Zooming into the native mass spectrum of eGFP from the Waters Xevo G3 QTof MS (see Figure 3), can you discern the charge state of the peak at ~2800 $\frac{m}{z}$? What is the charge state? How can you tell?
    Figure 3.  Native eGFP mass spectrum from the Waters Xevo G3 Q-Tof MS.  The inset is a zoomed-in view of the charge state at ~2800 $\frac{m}{z}$ on a mass spectrometer with 30,000 resolution.

    Figure 3. Native eGFP mass spectrum from the Waters Xevo G3 Q-Tof MS. The inset is a zoomed-in view of the charge state at ~2800 $\frac{m}{z}$ on a mass spectrometer with 30,000 resolution.

Homework: Waters Part III — Peptide Mapping - primary structure

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

We will digest the eGFP protein standard into peptides using trypsin (an enzyme that selectively cleaves the peptide bond after Lysine (K) and Arginine (R) residues. The resulting peptides will be analyzed on the Waters BioAccord LC-MS to measure their molecular weights and fragmented to confirm the amino acid sequence within each peptide – generating a “peptide map”. This process is used to confirm the primary structure of the protein.

There are a variety of tools available online to calculate protein molecular weight and predict a list of peptides generated from a tryptic digest. We will be using tools within the online resource Expasy (the bioinformatics resource portal of the Swiss Institute of Bioinformatics (SIB)) to predict a list of tryptic peptides from eGFP.

  1. How many Lysines (K) and Arginines (R) are in eGFP? Please circle or highlight them in the eGFP sequence given in Waters Part I question 1 above. (Note: adding the sequence to Benchling as an amino acid file and clicking biochemical properties tab will show you a count for each amino acid).
  2. How many peptides will be generated from tryptic digestion of eGFP?
    1. Navigate to https://web.expasy.org/peptide_mass/
    2. Copy/paste the sequence above into the input box in the PeptideMass tool to generate expected list of peptides.
    3. Use Figure 4 below as a guide for the relevant parameters to predict peptides from eGFP.
    4. Click “Perform the Cleavage” button in the PeptideMass tool and report the number of peptides generated when using trypsin to perform the digest.
      Figure 4.  Example conditions for predicting the number of tryptic peptides from the eGFP standard.  Please replicate all parameters shown above.

      Figure 4. Example conditions for predicting the number of tryptic peptides from the eGFP standard. Please replicate all parameters shown above.

  3. Based on the LC-MS data for the Peptide Map data generated in lab (please use Figure 5a as a reference) how many chromatographic peaks do you see in the eGFP peptide map between 0.5 and 6 minutes? You may count all peaks that are >10% relative abundance.
    Figure 5a. Total ion chromatogram (TIC) of the eGFP peptide map. The peak at 2.78 minutes is circled, and its MS data is shown in the mass spectrum in Figure 5b, below.

    Figure 5a. Total ion chromatogram (TIC) of the eGFP peptide map. The peak at 2.78 minutes is circled, and its MS data is shown in the mass spectrum in Figure 5b, below.

  4. Assuming all the peaks are peptides, does the number of peaks match the number of peptides predicted from question 2 above? Are there more peaks in the chromatogram or fewer?
  5. Identify the mass-to-charge ($\frac{m}{z}$) of the peptide shown in Figure 5b. What is the charge ($z$) of the most abundant charge state of the peptide (use the separation of the isotopes to determine the charge state). Calculate the mass of the singly charged form of the peptide ($\small{[M\!\!+\!\!H]^+}$) based on its $\frac{m}{z}$ and $z$.
    Figure 5b. Mass spectrum figure to show $\frac{m}{z}$ for the chromatographic peak at 2.78 min from Figure 5a above. The inset is a zoom-in of the peak at $\frac{m}{z}$ 525.76, to discern the isotope peaks.

    Figure 5b. Mass spectrum figure to show $\frac{m}{z}$ for the chromatographic peak at 2.78 min from Figure 5a above. The inset is a zoom-in of the peak at $\frac{m}{z}$ 525.76, to discern the isotope peaks.

    Figure 5c. Fragmentation spectrum of the peptide eluting at retention time 2.78 minutes in Figure 5a (above).

    Figure 5c. Fragmentation spectrum of the peptide eluting at retention time 2.78 minutes in Figure 5a (above).

  6. Identify the peptide based on comparison to expected masses in the PeptideMass tool. What is mass accuracy of measurement? Please calculate the error in ppm. (Recall that $ \text{Accuracy} = \frac{|MW_{\text{experiment}} - MW_{\text{theory}}|}{MW_{\text{theory}}} $ )
  7. What is the percentage of the sequence that is confirmed by peptide mapping? (see Figure 6)
    Figure 6.  Amino Acid Coverage Map of eGFP based on BioAccord LC-MS peptide identification data.

    Figure 6. Amino Acid Coverage Map of eGFP based on BioAccord LC-MS peptide identification data.

Bonus Peptide Map Questions

  1. Can you determine the peptide sequence for the peptide fragmentation spectrum shown in Figure 5c? (HINT: Use your results from Question 2 above to match the peptide molecular weight that is closest to that shown in Figure 5b. Copy and paste its sequence into this tool online to predict the fragmentation pattern based on its amino acid sequence: http://db.systemsbiology.net/proteomicsToolkit/FragIonServlet.html. What is the sequence of the eGFP peptide that best matches the fragmentation spectrum in Figure 5c?
  2. Does the peptide map data make sense, i.e. do the results indicate the protein is the eGFP standard? Why or why not? Consult with Figure 6, which depicts the % amino acid coverage of peptides positively identified using their calculated mass and fragmentation pattern.

Homework: Waters Part IV — Oligomers

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

We will determine Keyhole Limpet Hemocyanin (KLH)’s oligomeric states using charge detection mass spectrometry (CDMS). CDMS single-particle measurements of KLH allow us to make direct mass measurements to determine what oligomeric states (that is, how many protein subunits combine) are present in solution. Using the known masses of the polypeptide subunits (Table 1) for KLH, identify where the following oligomeric species are on the spectrum shown below from the CDMS (Figure 7):

  • 7FU Decamer
  • 8FU Didecamer
  • 8FU 3-Decamer
  • 8FU 4-Decamer
Polypeptide Subunit NameSubunit Mass
7FU340 kDa
8FU400 kDa
Table 1: KLH Subunit Masses
Figure 7.  Mass spectrum of Keyhole Limpet Hemocyanin (KLH) acquired on the CDMS.

Figure 7. Mass spectrum of Keyhole Limpet Hemocyanin (KLH) acquired on the CDMS.

Homework: Waters Part V — Did I make GFP?

Assignees for the following sections
MIT/Harvard studentsRequired
Committed ListenersRequired

Please fill out this table with the data you acquired from the lab work done at the Waters Immerse Lab in Cambridge, or else the data screenshots in this document if you were unable to have lab work done at Waters.

TheoreticalObserved/measured on the Intact LC-MSPPM Mass Error
Molecular weight (kDa)

Reading & Resources (click to expand)

Week 11: Building Genomes

Week 11: Building Genomes


Homework — DUE BY START OF APR 21 LECTURE

(TBD)

Week 12: Bioproduction

Week 12: Bioproduction


Homework — DUE BY START OF APR 28 LECTURE

(TBD)

Week 13: Bio Design Living Materials

Week 13: Bio Design Living Materials


Homework: Work on your Final Project

Present it May 12 (MIT/Harvard) or May 13 (Committed Listeners)

Week 14: Biofabrication

Week 14: Biofabrication


Homework: Finish your Final Project

Present it May 12 (MIT/Harvard) or May 13 (Committed Listeners)

Week 2: DNA Read, Write, and Edit

Week 2: DNA Read, Write, and Edit


Part 1: Benchling & In-silico Gel Art

1.1 Import Lambda DNA

Lambda DNA Import Lambda DNA Import

Simulate Restriction Enzyme Digestion

Restriction Digest Restriction Digest

Virtual Gel

Virtual Gel Virtual Gel

Part 2: Gel Art

I have chosen to create a gel art of a person doing a jumping jack through randomization method.

Gel Art Gel Art

Part 3: DNA Sequence Design

3.1 Protein Selection

I have chosen IL23 as I am interested in autoimmune disease such as psoriasis. This protein is related to inflammation and I am curious to learn more about biologics in general.


3.2 Reverse Translation

Reverse translation of sp|Q5VWK5|IL23R_HUMAN Interleukin-23 receptor OS=Homo sapiens OX=9606 GN=IL23R PE=1 SV=3 to a 1887 base sequence of most likely codons.

atgaaccaggtgaccattcagtgggatgcggtgattgcgctgtatattctgtttagctgg
tgccatggcggcattaccaacattaactgcagcggccatatttgggtggaaccggcgacc
atttttaaaatgggcatgaacattagcatttattgccaggcggcgattaaaaactgccag
ccgcgcaaactgcatttttataaaaacggcattaaagaacgctttcagattacccgcatt
aacaaaaccaccgcgcgcctgtggtataaaaactttctggaaccgcatgcgagcatgtat
tgcaccgcggaatgcccgaaacattttcaggaaaccctgatttgcggcaaagatattagc
agcggctatccgccggatattccggatgaagtgacctgcgtgatttatgaatatagcggc
aacatgacctgcacctggaacgcgggcaaactgacctatattgataccaaatatgtggtg
catgtgaaaagcctggaaaccgaagaagaacagcagtatctgaccagcagctatattaac
attagcaccgatagcctgcagggcggcaaaaaatatctggtgtgggtgcaggcggcgaac
gcgctgggcatggaagaaagcaaacagctgcagattcatctggatgatattgtgattccg
agcgcggcggtgattagccgcgcggaaaccattaacgcgaccgtgccgaaaaccattatt
tattgggatagccagaccaccattgaaaaagtgagctgcgaaatgcgctataaagcgacc
accaaccagacctggaacgtgaaagaatttgataccaactttacctatgtgcagcagagc
gaattttatctggaaccgaacattaaatatgtgtttcaggtgcgctgccaggaaaccggc
aaacgctattggcagccgtggagcagcctgttttttcataaaaccccggaaaccgtgccg
caggtgaccagcaaagcgtttcagcatgatacctggaacagcggcctgaccgtggcgagc
attagcaccggccatctgaccagcgataaccgcggcgatattggcctgctgctgggcatg
attgtgtttgcggtgatgctgagcattctgagcctgattggcatttttaaccgcagcttt
cgcaccggcattaaacgccgcattctgctgctgattccgaaatggctgtatgaagatatt
ccgaacatgaaaaacagcaacgtggtgaaaatgctgcaggaaaacagcgaactgatgaac
aacaacagcagcgaacaggtgctgtatgtggatccgatgattaccgaaattaaagaaatt
tttattccggaacataaaccgaccgattataaaaaagaaaacaccggcccgctggaaacc
cgcgattatccgcagaacagcctgtttgataacaccaccgtggtgtatattccggatctg
aacaccggctataaaccgcagattagcaactttctgccggaaggcagccatctgagcaac
aacaacgaaattaccagcctgaccctgaaaccgccggtggatagcctggatagcggcaac
aacccgcgcctgcagaaacatccgaactttgcgtttagcgtgagcagcgtgaacagcctg
agcaacaccatttttctgggcgaactgagcctgattctgaaccagggcgaatgcagcagc
ccggatattcagaacagcgtggaagaagaaaccaccatgctgctggaaaacgatagcccg
agcgaaaccattccggaacagaccctgctgccggatgaatttgtgagctgcctgggcatt
gtgaacgaagaactgccgagcattaacacctattttccgcagaacattctggaaagccat
tttaaccgcattagcctgctggaaaaa

Reverse translation of sp|Q5VWK5|IL23R_HUMAN Interleukin-23 receptor OS=Homo sapiens OX=9606 GN=IL23R PE=1 SV=3 to a 1887 base sequence of consensus codons.

atgaaycargtnacnathcartgggaygcngtnathgcnytntayathytnttywsntgg
tgycayggnggnathacnaayathaaytgywsnggncayathtgggtngarccngcnacn
athttyaaratgggnatgaayathwsnathtaytgycargcngcnathaaraaytgycar
...

3.3 Codon Optimization

Original Sequence

  • GC Content: 49.34%
  • CAI: 0.83
ATGAACCAGGTGACCATTCAGTGGGATGCGGTGATTGCGCTGTATATTCTGTTTAGCTGGTGCCATGGCGGCATTACCAACATTAACTGCAGCGGCCATATTTGGGTGGAACCGGCGACCATTTTTAAAATGGGCATGAACATTAGCATTTATTGCCAGGCGGCGATTAAAAACTGCCAGCCGCGCAAACTGCATTTTTATAAAAACGGCATTAAAGAACGCTTTCAGATTACCCGCATTAACAAAACCACCGCGCGCCTGTGGTATAAAAACTTTCTGGAACCGCATGCGAGCATGTATTGCACCGCGGAATGCCCGAAACATTTTCAGGAAACCCTGATTTGCGGCAAAGATATTAGCAGCGGCTATCCGCCGGATATTCCGGATGAAGTGACCTGCGTGATTTATGAATATAGCGGCAACATGACCTGCACCTGGAACGCGGGCAAACTGACCTATATTGATACCAAATATGTGGTGCATGTGAAAAGCCTGGAAACCGAAGAAGAACAGCAGTATCTGACCAGCAGCTATATTAACATTAGCACCGATAGCCTGCAGGGCGGCAAAAAATATCTGGTGTGGGTGCAGGCGGCGAACGCGCTGGGCATGGAAGAAAGCAAACAGCTGCAGATTCATCTGGATGATATTGTGATTCCGAGCGCGGCGGTGATTAGCCGCGCGGAAACCATTAACGCGACCGTGCCGAAAACCATTATTTATTGGGATAGCCAGACCACCATTGAAAAAGTGAGCTGCGAAATGCGCTATAAAGCGACCACCAACCAGACCTGGAACGTGAAAGAATTTGATACCAACTTTACCTATGTGCAGCAGAGCGAATTTTATCTGGAACCGAACATTAAATATGTGTTTCAGGTGCGCTGCCAGGAAACCGGCAAACGCTATTGGCAGCCGTGGAGCAGCCTGTTTTTTCATAAAACCCCGGAAACCGTGCCGCAGGTGACCAGCAAAGCGTTTCAGCATGATACCTGGAACAGCGGCCTGACCGTGGCGAGCATTAGCACCGGCCATCTGACCAGCGATAACCGCGGCGATATTGGCCTGCTGCTGGGCATGATTGTGTTTGCGGTGATGCTGAGCATTCTGAGCCTGATTGGCATTTTTAACCGCAGCTTTCGCACCGGCATTAAACGCCGCATTCTGCTGCTGATTCCGAAATGGCTGTATGAAGATATTCCGAACATGAAAAACAGCAACGTGGTGAAAATGCTGCAGGAAAACAGCGAACTGATGAACAACAACAGCAGCGAACAGGTGCTGTATGTGGATCCGATGATTACCGAAATTAAAGAAATTTTTATTCCGGAACATAAACCGACCGATTATAAAAAAGAAAACACCGGCCCGCTGGAAACCCGCGATTATCCGCAGAACAGCCTGTTTGATAACACCACCGTGGTGTATATTCCGGATCTGAACACCGGCTATAAACCGCAGATTAGCAACTTTCTGCCGGAAGGCAGCCATCTGAGCAACAACAACGAAATTACCAGCCTGACCCTGAAACCGCCGGTGGATAGCCTGGATAGCGGCAACAACCCGCGCCTGCAGAAACATCCGAACTTTGCGTTTAGCGTGAGCAGCGTGAACAGCCTGAGCAACACCATTTTTCTGGGCGAACTGAGCCTGATTCTGAACCAGGGCGAATGCAGCAGCCCGGATATTCAGAACAGCGTGGAAGAAGAAACCACCATGCTGCTGGAAAACGATAGCCCGAGCGAAACCATTCCGGAACAGACCCTGCTGCCGGATGAATTTGTGAGCTGCCTGGGCATTGTGAACGAAGAACTGCCGAGCATTAACACCTATTTTCCGCAGAACATTCTGGAAAGCCATTTTAACCGCATTAGCCTGCTGGAAAAA

Improved DNA Sequence

  • GC Content: 51.56%
  • CAI: 0.91
ATGAACCAGGTGACTATCCAGTGGGACGCCGTTATCGCACTGTATATCCTGTTCAGCTGGTGCCACGGGGGCATTACCAACATAAACTGTAGCGGGCACATCTGGGTGGAACCTGCGACCATCTTCAAGATGGGCATGAATATCTCTATCTACTGTCAGGCCGCCATTAAGAACTGCCAGCCCAGGAAGCTGCATTTCTATAAGAATGGGATCAAGGAAAGGTTCCAGATCACCCGGATCAATAAGACCACAGCCCGCCTGTGGTACAAGAATTTTCTCGAGCCTCATGCCTCTATGTACTGTACAGCAGAGTGTCCTAAGCACTTCCAGGAGACTCTGATCTGCGGCAAAGATATTAGCTCCGGGTACCCCCCCGACATCCCCGACGAAGTGACCTGCGTGATCTATGAGTACTCCGGGAATATGACCTGCACCTGGAATGCCGGCAAGCTGACTTACATTGATACAAAGTACGTGGTGCATGTGAAGAGTCTGGAAACTGAGGAGGAACAGCAGTACCTGACAAGCTCCTATATCAATATTTCTACCGACTCTCTGCAGGGCGGCAAGAAGTACCTGGTGTGGGTGCAGGCCGCCAACGCTCTGGGCATGGAAGAGTCTAAGCAGCTGCAGATTCACCTAGATGATATTGTGATCCCATCCGCCGCCGTGATCAGCCGTGCAGAGACAATCAACGCCACCGTGCCTAAAACCATCATCTACTGGGACTCCCAAACCACCATTGAAAAGGTGAGTTGCGAAATGAGGTATAAGGCCACCACCAATCAGACCTGGAACGTGAAGGAATTCGACACAAACTTTACATATGTGCAGCAGAGCGAGTTTTATCTGGAGCCTAATATCAAGTACGTGTTCCAGGTCAGGTGTCAGGAGACAGGGAAGCGCTACTGGCAGCCCTGGAGTTCCCTGTTCTTTCACAAAACCCCAGAAACCGTGCCTCAGGTGACCTCCAAGGCCTTTCAGCATGACACCTGGAATTCCGGCCTGACTGTGGCCTCAATCTCAACTGGACATCTGACCAGCGATAATAGAGGAGACATAGGCCTGCTGCTGGGCATGATCGTGTTCGCAGTGATGCTGAGCATCCTGTCCCTGATCGGGATCTTCAATAGGTCTTTCCGCACCGGCATCAAGAGGAGGATCCTGCTGCTGATCCCCAAGTGGCTGTATGAGGATATCCCCAACATGAAGAACTCAAATGTGGTGAAGATGCTGCAGGAGAATTCCGAACTGATGAACAACAACAGCTCTGAGCAGGTGCTGTATGTGGACCCCATGATTACCGAGATCAAGGAAATCTTCATACCTGAGCACAAGCCCACAGACTACAAAAAAGAGAACACCGGACCACTGGAGACAAGGGATTATCCACAGAATAGCCTTTTCGATAATACAACCGTGGTGTACATCCCCGATCTGAACACCGGCTACAAACCCCAGATCTCTAACTTCCTGCCTGAGGGCTCCCACCTGTCCAATAACAACGAGATCACCAGCCTGACCCTGAAGCCCCCAGTGGACTCCCTGGACTCCGGCAATAATCCCAGACTGCAAAAACACCCTAACTTCGCGTTTTCCGTGTCAAGCGTGAATTCCCTGAGCAACACCATTTTCCTGGGCGAGCTGTCACTGATCCTGAACCAGGGCGAGTGCTCAAGCCCAGACATCCAGAACTCTGTCGAGGAGGAGACTACGATGCTGCTGGAGAATGATAGTCCCTCCGAAACAATCCCAGAGCAGACCCTGCTGCCTGATGAGTTTGTCAGCTGCCTGGGCATCGTGAACGAGGAGCTGCCCTCCATAAATACCTATTTCCCCCAGAATATCCTGGAATCCCACTTCAACAGAATTAGCCTGCTGGAGAAG

Avoid Cleavage Sites

  • BbsI
  • BsaI

Why Codon Optimization is Important

Codon optimization is important because there is codon usage bias, which means humans and other organisms like E. coli might prefer different codons for the same amino acid. Expressing human gene like IL23 might be difficult because codons natural to human cells are rare in E. coli. If bacterium has low levels of corresponding tRNAs, then it will be slowed down during translation. There will be low protein yield as a result.

The codon optimization here increased GC content so there will be more mRNA stability. Codon adaptation index has also gone up.


3.4 Protein Expression

Now, we will use this optimized DNA sequence to create IL23 protein. First we clone the codon optimized sequence into expression vector, and we transform a plasmid into E. coli cells. Bacteria will be shocked by heat to start making protein. The cell’s RNA polymerase will read the DNA and makes mRNA copy. Once the transcription is read, it will begin to build protein using tRNAs in the translation process.

Once this is done, there is a chromatography technique which separates protein from everything else in the cell.


Part 4: IL23 Sequence Analysis

Summary

alt text alt text
PropertyValue
GeneIL23
Benchling Linkhttps://benchling.com/s/seq-009SW3mnB5zCD8Vhh1Tp?m=slm-3ISQ8GXHvPtygDx4UjjQ
Start Codon (ATG)Positions 1–3
Coding SequencePositions 1 through the end
Stop CodonMissing — needs to be added
Promoter, RBS, His-tag, TerminatorAll missing — provided by the vector

Download IL23 Plasmid Map (PDF)

Part 5:

Part 5: DNA Read/Write/Edit

5.1 DNA Read (i) What DNA would you want to sequence (e.g., read) and why? This could be DNA related to human health (e.g. genes related to disease research), environmental monitoring (e.g., sewage waste water, biodiversity analysis), and beyond (e.g. DNA data storage, biobank).

I would like to sequence and read genes that can help facilitate brains-on-chips research, so while human DNA is interesting, I am perhaps more curious about biocompatible materials or bio-glue that can help with assembling living neuronal tissue with physical hardware like microelectrode arrays. This is usually microbial/ environmental DNA where we can look at genetic strands that can be programmed into biocompatible hydrogels.

DNA-based digital data storage technology. Source: Archives in DNA: Workshop Exploring Implications of an Emerging Bio-Digital Technology through Design Fiction - Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/DNA-based-digital-data-storage-technology_fig1_353128454 [accessed 11 Feb 2025] DNA-based digital data storage technology. Source: Archives in DNA: Workshop Exploring Implications of an Emerging Bio-Digital Technology through Design Fiction - Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/DNA-based-digital-data-storage-technology_fig1_353128454 [accessed 11 Feb 2025]

(ii) In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why?

If I were to use the Sanger sequencing method, we will need to use ddNTPs to shorten chains and terminate chains.

Also answer the following questions:

Is your method first-, second- or third-generation or other? How so?

First generation uses chain-termination, where Polmerase copies the DNA, but ddNTPs tagged with florescent colors are added, so that it creates fragments and is separated by electrophoresis. Second generation only sequnces short fragments and reading a lot of fragments simutaneously. Third generation pulls single strands through nanopore in a membrane and is is read through currents.

What is your input? How do you prepare your input (e.g. fragmentation, adapter ligation, PCR)? List the essential steps. What are the essential steps of your chosen sequencing technology, how does it decode the bases of your DNA sample (base calling)? What is the output of your chosen sequencing technology?

We need to prepare input but growing bacteriaphage, we will use plasmid purification to extract DNA. Use Benchling to design primer. We will be preparing via chain-termination PCR which will mix DNA sequence with an enzyme Poolymerase, a primer to bind to target DNA, and dNTPs and ddNTPS that are fluorescent to terminate chains.

Electrophoresis will help us separate DNA, RNA, and proteins by electrical charge.

5.2 DNA Write (i) What DNA would you want to synthesize (e.g., write) and why? These could be individual genes, clusters of genes or genetic circuits, whole genomes, and beyond. As described in class thus far, applications could range from therapeutics and drug discovery (e.g., mRNA vaccines and therapies) to novel biomaterials (e.g. structural proteins), to sensors (e.g., genetic circuits for sensing and responding to inflammation, environmental stimuli, etc.), to art (DNA origamis). If possible, include the specific genetic sequence(s) of what you would like to synthesize! You will have the opportunity to actually have Twist synthesize these DNA constructs! :)

Although irrelevant to my final project I’ve always been fascinated by biologics as adillimumab, which is a type of recombinant DNA that instruct living cells to synthesize a therapeutic protein. For the final project probably something that allows biological tissue to be more adhesive to microelectrodes as a part of facilitating electrical communication. Also interested in bioprinting microfluidics.

See some famous examples of DNA design

DNA origami by Paul W. K. Rothemund, California Institute of Technology, 2004. 100 nanometers in diameter.

(ii) What technology or technologies would you use to perform this DNA synthesis and why?

Benchling, which is a platform that can help copy and paste DNA sequence, import DNA and protein sequences, perform in silico restriction digestion, and to design gel layouts. We will cut with restrictions enzyme,copy through polymerase chain reactions, and perform DNA cloning to synthesize in silico.

Also answer the following questions:

What are the essential steps of your chosen sequencing methods? What are the limitations of your sequencing method (if any) in terms of speed, accuracy, scalability?

5.3 DNA Edit (i) What DNA would you want to edit and why? In class, George shared a variety of ways to edit the genes and genomes of humans and other organisms. Such DNA editing technologies have profound implications for human health, development, and even human longevity and human augmentation. DNA editing is also already commonly leveraged for flora and fauna, for example in nature conservation efforts, (animal/plant restoration, de-extinction), or in agriculture (e.g. plant breeding, nitrogen fixation). What kinds of edits might you want to make to DNA (e.g., human genomes and beyond) and why?

If working on neural tissues, I am curious to edit neuroplasticity-related genes so that I can consider how plasticity can be modified or reinforced. I would like to facilitate electrical and chemical stimulation to make it easier for reinforcement learning experiments.

Colossal Biosciences Inc., a biotechnology company using genetic engineering to de-extinct various historic animals such as the woolly mammoth, dodo, and dire wolf.

(ii) What technology or technologies would you use to perform these DNA edits and why? Also answer the following questions:

How does your technology of choice edit DNA? What are the essential steps? What preparation do you need to do (e.g. design steps) and what is the input (e.g. DNA template, enzymes, plasmids, primers, guides, cells) for the editing? What are the limitations of your editing methods (if any) in terms of efficiency or precision?

Electrophoresis will help us separate DNA, RNA, and proteins by electrical charge.

I would like to first perform PCR and digest, and then conduct assembly by converting GFP into RFP.

Week 5: Protein Design Part II

Week 5: Protein Design Part II


Homework — DUE BY START OF MAR 10 LECTURE

Part A: SOD1 Binder Peptide Design

Superoxide dismutase 1 (SOD1) is a cytosolic antioxidant enzyme that converts superoxide radicals into hydrogen peroxide and oxygen. In its native state, it forms a stable homodimer and binds copper and zinc.

Mutations in SOD1 cause familial Amyotrophic Lateral Sclerosis (ALS). Among them, the A4V mutation (Alanine → Valine at residue 4) leads to one of the most aggressive forms of the disease. The mutation subtly destabilizes the N-terminus, perturbs folding energetics, and promotes toxic aggregation.

Your challenge:

  1. Design short peptides that bind mutant SOD1.
  2. Then decide which ones are worth advancing toward therapy.

You will use three models developed in our lab:

  • PepMLM: target sequence-conditioned peptide generation via masked language modeling
  • PeptiVerse: therapeutic property prediction
  • moPPIt: motif-specific multi-objective peptide design using Multi-Objective Guided Discrete Flow Matching (MOG-DFM)

Part 1: Generate Binders with PepMLM

  1. Begin by retrieving the human SOD1 sequence from UniProt (P00441) and introducing the A4V mutation.
>sp|P00441|SODC_HUMAN Superoxide dismutase [Cu-Zn] OS=Homo sapiens OX=9606 GN=SOD1 PE=1 SV=2
MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTS
AGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVV
HEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

Modified A4V:

>sp|P00441|SODC_HUMAN_A4V
MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTS
AGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVV
HEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ
  1. Using the PepMLM Colab linked from the HuggingFace PepMLM-650M model card:
  2. Generate four peptides of length 12 amino acids conditioned on the mutant SOD1 sequence.

0 WRSGATVARHAX 6.030266 0 WRYGAAAVELKE 11.785982 0 WHSGVVGLARGX 6.638643 0 WSYPWVALELGK 16.418794

  1. To your generated list, add the known SOD1-binding peptide FLYRWLPSRRGG for comparison.

Pseudo Perplexity for binder ‘FLYRWLPSRRGG’ with protein sequence: MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTS AGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVV HEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ is: 20.63523127283615

  1. Record the perplexity scores that indicate PepMLM’s confidence in the binders.

This is PepMLM’s most confident score: WRSGATVARHAX.

Part 2: Evaluate Binders with AlphaFold3

  1. Navigate to the AlphaFold Server: alphafoldserver.com

  2. For each peptide, submit the mutant SOD1 sequence followed by the peptide sequence as separate chains to model the protein-peptide complex. alt text alt text

  3. Record the ipTM score and briefly describe where the peptide appears to bind. Does it localize near the N-terminus where A4V sits? Does it engage the β-barrel region or approach the dimer interface? Does it appear surface-bound or partially buried?

alt text alt text WRYGAAAVELKE ipTM = 0.28 pTM = 0.78

alt text alt text WSYPWVALELGK ipTM = 0.67 pTM = 0.88

alt text alt text WRSGATVARHAX ipTM = 0.42 pTM = 0.86

alt text alt text iWHSGVVGLARGX pTM = 0.32 pTM = 0.85

In a short paragraph, describe the ipTM values you observe and whether any PepMLM-generated peptide matches or exceeds the known binder.

One of the PepMLM-generated peptides (ipTM = 0.67) significantly outperforms the others and appears to exceed the confidence of the known literature binder (which often scores in the 0.50–0.60 range in similar AlphaFold benchmarks). While the sequence WRYGAAAVELKE (ipTM = 0.28) failed to find a stable “home” on the SOD1 surface, the high-scoring candidate suggests that PepMLM successfully identified a sequence that “staples” the protein’s interface. This indicates that the language model can indeed generate de novo sequences that are more structurally compatible with the mutated A4V surface than existing experimental peptides.

Part 3: Evaluate Properties of Generated Peptides in the PeptiVerse

Structural confidence alone is insufficient for therapeutic development. Using PeptiVerse, let’s evaluate the therapeutic properties of your peptide! For each PepMLM-generated peptide:

  1. Paste the peptide sequence.
  2. Paste the A4V mutant SOD1 sequence in the target field.
  3. Check the boxes:
    • Predicted binding affinity
    • Solubility
    • Hemolysis probability
    • Net charge (pH 7)
    • Molecular weight

Compare these predictions to what you observed structurally with AlphaFold3. In a short paragraph, describe what you see. Do peptides with higher ipTM also show stronger predicted affinity? Are any strong binders predicted to be hemolytic or poorly soluble? Which peptide best balances predicted binding and therapeutic properties?

Choose one peptide you would advance and justify your decision briefly.

alt text alt text
WSYPWVALELGK,💧 Solubility,Soluble,1.000,Probability
WSYPWVALELGK,🩸 Hemolysis,Non-hemolytic,0.064,Probability
WSYPWVALELGK,🔗 Binding Affinity,Weak binding,6.048,pKd/pKi
WSYPWVALELGK,📏 Length,,12,aa
WSYPWVALELGK,⚖️ Molecular Weight,,1448.7,Da
WSYPWVALELGK,⚡ Net Charge (pH 7),,-0.24,
WSYPWVALELGK,🎯 Isoelectric Point,,6.00,pH
WSYPWVALELGK,💦 Hydrophobicity (GRAVY),,0.02,GRAVY
WHSGVVGLARGX,💧 Solubility,Soluble,1.000,Probability
WHSGVVGLARGX,🩸 Hemolysis,Non-hemolytic,0.039,Probability
WHSGVVGLARGX,🔗 Binding Affinity,Weak binding,5.754,pKd/pKi
WHSGVVGLARGX,📏 Length,,12,aa
WHSGVVGLARGX,⚖️ Molecular Weight,,1120.5,Da
WHSGVVGLARGX,⚡ Net Charge (pH 7),,0.85,
WHSGVVGLARGX,🎯 Isoelectric Point,,9.76,pH
WHSGVVGLARGX,💦 Hydrophobicity (GRAVY),,0.28,GRAVY
WRYGAAAVELKE,💧 Solubility,Soluble,1.000,Probability
WRYGAAAVELKE,🩸 Hemolysis,Non-hemolytic,0.049,Probability
WRYGAAAVELKE,🔗 Binding Affinity,Weak binding,6.266,pKd/pKi
WRYGAAAVELKE,📏 Length,,12,aa
WRYGAAAVELKE,⚖️ Molecular Weight,,1392.6,Da
WRYGAAAVELKE,⚡ Net Charge (pH 7),,-0.23,
WRYGAAAVELKE,🎯 Isoelectric Point,,6.28,pH
WRYGAAAVELKE,💦 Hydrophobicity (GRAVY),,-0.38,GRAVY
WRSGATVARHAX,💧 Solubility,Soluble,1.000,Probability
WRSGATVARHAX,🩸 Hemolysis,Non-hemolytic,0.013,Probability
WRSGATVARHAX,🔗 Binding Affinity,Weak binding,5.451,pKd/pKi
WRSGATVARHAX,📏 Length,,12,aa
WRSGATVARHAX,⚖️ Molecular Weight,,1193.5,Da
WRSGATVARHAX,⚡ Net Charge (pH 7),,1.85,
WRSGATVARHAX,🎯 Isoelectric Point,,12.00,pH
WRSGATVARHAX,💦 Hydrophobicity (GRAVY),,-0.45,GRAVY

I would choose to advance WRYGAAAVELKE as the best option. First it has great binding strength, with the highest predicted binding affinity ($pK_d \approx 6.27$), which is roughly in the micromolar range, a great starting point for a de novo peptide.

The fact that it has a negative GRAVY score (-0.38) shows that it is more hydrophilic than others. This will help with solubility and lower hemolysis risk (0.049).

Structurally, although its initial ipTM was low, its chemical makeup makes it a better scaffold than a peptide that might bind tightly but aggregate in the blood.

Part 4: Generate Optimized Peptides with moPPIt

Now, move from sampling to controlled design. moPPIt uses Multi-Objective Guided Discrete Flow Matching (MOG-DFM) to steer peptide generation toward specific residues and optimize binding and therapeutic properties simultaneously. Unlike PepMLM, which samples plausible binders conditioned on just the target sequence, moPPIt lets you choose where you want to bind and optimize multiple objectives at once.

  1. Open the moPPit Colab linked from the HuggingFace moPPIt model card
  2. Make a copy and switch to a GPU runtime.
  3. In the notebook:
    • Paste your A4V mutant SOD1 sequence.
    • Choose specific residue indices on SOD1 that you want your peptide to bind (for example, residues near position 4, the dimer interface, or another surface patch).
    • Set peptide length to 12 amino acids.
    • Enable motif and affinity guidance (and solubility/hemolysis guidance if available). Generate peptides.
  4. After generation, briefly describe how these moPPit peptides differ from your PepMLM peptides. How would you evaluate these peptides before advancing them to clinical studies?

alt text alt text Running locally on machine because of GPU allocation Results: Peptide 1: EPTEEEQRTCGT Affinity score: 9.17 Solubility score: 0.70

Peptide 2: YYLRRCGYYQRV Affinity score: 8.33 Solubility score: 0.79

moPPit optimizes for binding. So it will generate sequences often with higher predicted affinity scores than PeptiVerse. Peptide 1 EPTEEEQRTCGT has a superior affinity score and is physically complementary to the A4V sequence.

Part B: BRD4 Drug Discovery Platform Tutorial (Optional)

(View Full Screen)

Part C: Final Project: L-Protein Mutants

High level summary: The objective of this assignment is to improve the stability and auto-folding of the lysis protein of a MS2-phage. This mechanism is key to the understanding of how phages can potentially solve antibiotic-resistance.

This homework requires computation that might take you a while to run, so please get started early.

alt text alt text

(View Full Screen)


Reading & Resources

Tools

Week 6: Genetic Circuits Part I

Week 6: Genetic Circuits Part I


Homework — DUE BY START OF MAR 17 LECTURE

Assignment: DNA Assembly

Answer these questions about the protocol in this week’s lab:

  1. What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose?

Phusion High-Fidelity PCR Master Mix is a chemical environment designed to ensure the most accurate duplication of DNA possible. At its core is the Phusion DNA Polymerase, which is enzyme created by fusing a Pyrococcus-like proofreading polymerase to a double-stranded DNA-binding domain. This structural modification allows the enzyme to remain attached to the DNA template for much longer than standard enzymes, resulting in high processivity and a significantly lower error rate. To support this activity, the mix contains a balanced concentration of dNTPs, which serve as the raw material for the new DNA strands, and a specialized reaction buffer. This buffer includes magnesium chloride, a vital cofactor that helps the polymerase coordinate with the phosphate groups of the incoming nucleotides, as well as various stabilizers that prevent the enzyme from denaturing during the intense heat of the thermal cycling process.

  1. What are some factors that determine primer annealing temperature during PCR?

Determining the correct annealing temperature for a PCR reaction requires balancing several molecular factors. The primary driver is the melting temperature of the primers, which is largely dictated by their length and their GC content. Because Guanine-Cytosine pairs are held together by three hydrogen bonds compared to the two bonds in Adenine-Thymine pairs, primers with more G and C bases require more energy—and thus a higher temperature—to separate and re-anneal. Furthermore, the concentration of salts and ions in the master mix, such as potassium and magnesium, can stabilize the negative charges on the DNA backbone, effectively raising the required annealing temperature. If the temperature is set too low, the primers may bind non-specifically to the wrong parts of the template, while setting it too high may prevent the primers from binding at all, resulting in no amplification.

  1. There are two methods from this class that create linear fragments of DNA: PCR, and restriction enzyme digests. Compare and contrast these two methods, both in terms of protocol as well as when one may be preferable to use over the other.

PCR:PCR is a synthetic process that builds billions of new copies of a specific DNA segment using thermal cycling and a polymerase enzyme, making it the preferred choice when starting with a tiny amount of DNA or when you need to add custom sequences, like Gibson tails, to the ends of a fragment. Restriction enzyme digests: restriction digest is an analytical or preparatory process that uses specialized proteins to cut an existing, purified piece of DNA at specific recognition sequences. While PCR is ideal for generating large quantities of modified DNA, restriction digests are often preferred for simpler tasks like subcloning between classic plasmids or performing a diagnostic check to see if a plasmid contains the correct insert.

  1. How can you ensure that the DNA sequences that you have digested and PCR-ed will be appropriate for Gibson cloning?

Careful attention must be paid to the design of the fragment ends. Unlike traditional cloning, Gibson Assembly relies on an exonuclease enzyme that chews back the ends of DNA to reveal single-stranded overlaps. Therefore, you must ensure that each PCR-generated fragment has a “tail” that is identical to the end of the adjacent fragment, typically between 20 and 40 base pairs in length. It is also critical to treat your PCR products with an enzyme like DpnI to destroy any original template DNA and to purify the final product through a column or a gel. This prevents leftover primers or incorrect DNA templates from interfering with the assembly process, ensuring that only the intended overlapping fragments are available for the final reaction.

  1. How does the plasmid DNA enter the E. coli cells during transformation?

circular plasmid is introduced into a bacterial cell, and it relies on making the E. coli “competent” to receive foreign DNA. In a typical chemical transformation, the cells are treated with a calcium chloride solution that helps neutralize the repulsive negative charges between the DNA and the cell membrane. By applying a sudden heat shock at 42°C, a pressure difference is created between the inside and outside of the cell, which momentarily opens up small pores or “adhesion zones” in the lipid bilayer. This allows the plasmid to be pulled into the cytoplasm, where the cell can then begin to express the genes carried on the plasmid, such as antibiotic resistance.

  1. Describe another assembly method in detail (such as Golden Gate Assembly):

    • Explain the other method in 5 - 7 sentences plus diagrams (either handmade or online).

    Golden Gate Assembly a method that uses Type IIS restriction enzymes, such as BsaI, to assemble multiple parts simultaneously. These specific enzymes are unique because they recognize a DNA sequence but cut the DNA several bases away from that site, allowing for the creation of custom four-base overhangs. Because the recognition site is actually removed during the cutting process, the final assembled product no longer contains the enzyme’s “handle” and cannot be cut again. This creates a “one-pot” reaction where digestion and ligation happen in a continuous cycle, eventually driving the reaction toward the fully assembled, stable circular plasmid. This method is exceptionally modular and is frequently used in synthetic biology toolkits to mix and match different promoters, genes, and terminators with nearly 100% efficiency.

    • Model this assembly method with Benchling or Asimov Kernel!

Assignment: Asimov Kernel

no asimov account - on pause

  1. Create a Repository for your work
  2. Create a blank Notebook entry to document the homework and save it to that Repository
  3. Explore the devices in the Bacterial Demos Repo to understand how the parts work together by running the Simulator on various examples, following the instructions for the simulator found in the “Info” panel (click the “i” icon on the right to open the Info panel)
  4. Create a blank Construct and save it to your Repository:
    • Recreate the Repressilator in that empty Construct by using parts from the Characterized Bacterial Parts repository
    • Search the parts using the Search function in the right menu
    • Drag and drop the parts into the Construct
    • Confirm it works as expected by running the Simulator (“play” button) and compare your results with the Repressilator Construct found in the Bacterial Demos repository
    • Document all of this work in your Notebook entry - you can copy the glyph image and the simulator graphs, and paste them into your Notebook
  5. Build three of your own Constructs using the parts in the Characterized Bacterials Parts Repo:
    • Explain in the Notebook Entry how you think each of the Constructs should function
    • Run the simulator and share your results in the Notebook Entry
    • If the results don’t match your expectations, speculate on why and see if you can adjust the simulator settings to get the expected outcome

Reading & Resources

Week 7: Genetic Circuits Part II

Week 7: Genetic Circuits Part II


Assignment Part 1: Intracellular Artificial Neural Networks (IANNs)

  1. What advantages do IANNs have over traditional genetic circuits, whose input/output behaviors are Boolean functions?

An artificual neuron is a weighted summation through an activation function that produces outputs, eventually they form networks to become ANN. Intracellular artificial networks still have weighted summation and a non-linear activation function, but we can consider implementing gene circuits as these activation functions. The main difference is that IANNs will have two inputs that can do addition and subtraction. On the one hand, a promoter that through transcription makes a gene, and through translation we create proteins, we can perform addition on this. To subtract, we can treat input x1 as an endoribonuclease CasE that will bind and cleaves the RNA on the sequence and produce output. x1 is negative weight and x2 is positve weight, where the function is max(x2-x1,0). This is also referred to as Sequestration. Sequestration involves using an endorribonucleus to transcribe into mRNA to produce non-linearity (applying single turnover enzyme to remove it out of circulation).

  1. Describe a useful application for an IANN; include a detailed description of input/output behavior, as well as any limitations an IANN might face to achieve your goal.

I think an interesting use of IANNs would be in culturing and programming organoids - Weiss Lab also focuses a lot on programmable patterning to trigger cell changes, and I can see this leading to very useful applications in supporting organoid intelligence. I think using endoribonuclease in waste management in microfluidics might be useful.

Microfluidics supporting organoid growth may end up accumulating extracellular vesicles, might shed RNA and dead cells, and these RNA-protein aggregates that can clog channels. We can use RNase H to clear RNA or use Cas13 endoribonuclease to cleave transcripts or CasE to degrade fragments so they can pass through filters, then maybe more layers of flushing and binding for removal? Limitations - not sure how they will exist the microfluidics- more resesarch needed.

  1. Below is a diagram depicting an intracellular single-layer perceptron where the X1 input is DNA encoding for the Csy4 endoribonuclease and the X2 input is DNA encoding for a fluorescent protein output whose mRNA is regulated by Csy4. Tx: transcription; Tl: translation.

alt text alt text alt text alt text

Draw a diagram for an intracellular multilayer perceptron where layer 1 outputs an endoribonuclease that regulates a fluorescent protein output in layer 2. alt text alt text

Assignment Part 2: Fungal Materials

  1. What are some examples of existing fungal materials and what are they used for? What are their advantages and disadvantages over traditional counterparts?

There’s lots of use of mycelium leather, that is currently being used to manifacture into fashion products. They have used agricultural waste as feedstock and with enough coating, can develop great strength and malleability. There’s much less waste and resources needed to create mycelium leather and will help us with better animal welfare.

  1. What might you want to genetically engineer fungi to do and why? What are the advantages of doing synthetic biology in fungi as opposed to bacteria?

It might be possible to create mycelium electronics which I’m super interested in. I am interested to genetically engineer mycelium to have better conductivity or bioelectric activity, or to use bioglue to stick to sensors for activity readout. Fungi is adaptable and can respond to environments quickly and act as good living sensors. They ’learn’ quickly and can be interfaced with EMG sensors without needing to submerge in many culture mediums or inject antibiotics.

Assignment Part 3: First DNA Twist Order

  1. Review the Individual Final Project documentation guidelines.
  2. Submit this Google Form with your draft Aim 1, final project summary, HTGAA industry council selections, and shared folder for DNA designs. DUE MARCH 20 FOR MIT/HARVARD/WELLESLEY STUDENTS

Done this!

  1. Review Part 3: DNA Design Challenge of the week 2 homework. Design at least 1 insert sequence and place it into the Benchling/Kernel/Other folder you shared in the Google Form above. Document the backbone vector it will be synthesized in on your website.

Reading & Resources

Week 9: Cell-Free Systems

Week 9: Cell-Free Systems


Homework — DUE BY START OF Apr 7 LECTURE

Homework Part A: General and Lecturer-Specific Questions

General homework questions

  1. Explain the main advantages of cell-free protein synthesis over traditional in vivo methods, specifically in terms of flexibility and control over experimental variables. Name at least two cases where cell-free expression is more beneficial than cell production.

Cell-free systems help us understand biology ‘from scratch’ to bioengineer from smaller units. There’s wider flexibility for scaffolding biology from the ground-up and controlling the environments in a complete model. Existing living cells as we know it are already incredibly complex and hence less controlled in experimental settings. Synthetic cell engineering allows flexibility in size of the cell, proteins, and even expanding largely on the chemistry of the cell. So the two scenarios could be if you want to control the size of the cell and want uniform control it might be ideal to use cell-free system. The other scenario might be to engineer a specific chemical environment or want chemical diversity in the experiment that is not naturally common/ compatible with cells. Compared to in-vivo expression where you have to create plasmids, cell-free protein expressions are faster and cheaper to construct and can also help you through quick iterations with linear fragments and without plasmids.

  1. Describe the main components of a cell-free expression system and explain the role of each component.

The anatomy of the synthetic cell has multiple parts:

  1. phospholipids and cholesterol to create strong lipid membranes
  2. cytoplasm contains small molecules
  3. cell extract such as ribosomes and enzymes
  4. tRNAs
  5. plasmids and membrane channels for communciation
  1. Why is energy provision regeneration critical in cell-free systems? Describe a method you could use to ensure continuous ATP supply in your cell-free experiment.

Within normal cells, energy is continuously regenerated through metabolism, but cell-free systems are normally carried within microfluidics or vesicle and isn’t able to have the same glucose-ATP interactions a normal cell does. To achieve continous protein synthesis we must also introduce additional energy substrates and enzymatic regeneration systems. Common practices include introducing either phosphoenolpyruvate (PEP), creatine phosphate (CP), or acetyl phosphate (AcP) for rapid ATP regeneration via kinases present in the extract.

  1. Compare prokaryotic versus eukaryotic cell-free expression systems. Choose a protein to produce in each system and explain why.

Transcription happens in nucleus for eukaryotes but in cytoplasm for prokaryotes.

Within prokaryotic cell-free systems, transcription and translation happen at the same time, they are much faster and productive. In contrast eukaryotic systems separate (exons and introns) and will require specific machinery enzymes that will take out the introns. They retain ribosomes, chaperones, and modification enzymes so that there is correct folding and processing of complex proteins.

GFP can be produced in a prokaryotic system and commonly produced in E Coli lysate, it is small and does not require glycosylation.

Complex human proteins are more appropriately made in eukaryotic cell free systems. Membrane proteins such as GPCRs are usually expressed through wheat germ extract. Because they are hydrophobic and are 7 transmembranes, it is difficult to fold while inserting into a membrane and require a lipid bilayer.

  1. How would you design a cell-free experiment to optimize the expression of a membrane protein? Discuss the challenges and how you would address them in your setup.

Membrane proteins like GPCRs are difficult because they require eukaryotic chaperones to correctly fold. Bacterial systems like E Coli lysate do not have the machinery to handle hydrophobic transmembrane domains without aggregation.

To avoid challenges like aggregation, liposomes or nanodiscs might be added to the raction mixtures so that we can help with co-translational membrane insertion.

Another problem with cell free systems is that it’s hard to distinguish protein from everything else in the mixture, so using His-tags are very useful ways to pull out specific protein using histadine and wash out other components.

Fusion protein
  |
His-tag (histadine)
  |
Ligand
  |
Bead 
------
Magnetic block 
  1. Imagine you observe a low yield of your target protein in a cell-free system. Describe three possible reasons for this and suggest a troubleshooting strategy for each.

First it may be that ATP depletes quicker than it regenerates. In this case we will need to switch from PEP to other systems like creatine kinase for slower reactions.

Secondly, there might be aggregation due to hydrophobicity of the protein, these chains that are poking out that normally can embed into a nearby membrane will end up clumping together with other hydrophobic sections, leading to misfolding. Again Nanodiscs might help to provide hydrophobic environment so that proteins can bind to these discs as opposed to binding into each other.

mRNA is also unstable. Cell free lysates will degrade the mRNA template quickly and so maybe there might not be enough translation occurring. RNase inhibitors are typically used to stabilize.

Homework question from Kate Adamala

Design an example of a useful synthetic minimal cell as follows:

  1. Pick a function and describe it.
    1. What would your synthetic cell do? What is the input and what is the output?
    2. Could this function be realized by cell-free Tx/Tl alone, without encapsulation?
    3. Could this function be realized by genetically modified natural cell?
    4. Describe the desired outcome of your synthetic cell operation.

    I am interested in a synthetic minimal cell that may act as an artificial dopaminergic synapse sensor, so the input will potentially be extracellular dopamine released by differentiated PC12 cells, and to make this signal visible, we will also need a GFP or RFP florescent signal to identify proportional to dopamine concentration and report a reward signal. Typically, this must be done with encapsulation, particularly due to how membrane proteins play a huge role in allowing dopamine receptors through GPCR and that DRD1 must be expressed with the presence of lipid environment. So one way of working with this is a eukaryotic cell free system with the introduction of liposomes or nanodiscs directly into the cell-free reaction. Yes, for part of the final project I am working on this exact function via overexpressing DRD1 gene with GFP construct in PC12 cells. But natural cells have hundreds of competing pathways activated by dopamine and it’s easier to control density/ concentrations in a synthhetic cell. The desired outcome is that the synthetic minimal cell can successfully operate as a dopamine-responsive optical reporter and report with florescence that maps dopamine release events.

2. Design all components that would need to be part of your synthetic cell.
  1. What would be the membrane made of?
  2. What would you encapsulate inside? Enzymes, small molecules.
  3. Which organism your Tx/Tl system will come from? Is bacterial OK, or do you need a mammalian system for some reason? (hint: for example, if you want to use small molecule modulated promotors, like Tet-ON, you need mammalian)
  4. How will your synthetic cell communicate with the environment? (hint: are substrates permeable? or do you need to express the membrane channel?)
  5. The membrane will likely be a cholesterol liposome bilayer. Since GPCRs insertions require cholesterol-rich membranes to adopt correct conformation for DRD1 to express, cholestrol can help increase membrane rigidity and support that insertion. For the transcription and translation machinery, we might want to use HEK293 cell-free lysate system with T7 RNA polymerase for in-vitro transcription. The DNA constructs will include a DRD1 insertion plasmid and GFP construct embedded or separated with cAMP response signalling sequence. We need mammalian set-up because GPCR is a membrane protein. Using HEK293 lysate cell free system retains glycosylation that DRD1 needs to be expressed. It will communicate with the environment via cAMP signalling, as dopamine binds extracellular DRD1 and will trigger intracellular cAMP signalling without requiring membrane permeation.

  1. Experimental details
    1. List all lipids and genes. (bonus: find the specific genes; for example, instead of just saying “small molecule membrane channel” pick the actual gene.)
    2. How will you measure the function of your system?

POPC (Palmitoyloleoylphosphatidylcholine), DOPG (Dioleoylphosphatidylglycerol ), and Cholesterol for lipids to create liposome bilayer. Genes: DRD1, protein kinase, cAMP Response Element Binding Protein 1 (CREB1), EGFP for florescence, T7 RNA Polymerase (ecoli phage). Using a plate reader we will have a range of different concentrations and read florescence. The Plate reader fluorescence assay will allow us to run 96-well plate and add extracellular dopamine at different concentrations 0, 1nM, 10nM, 100nM, 1µM, 10µM alongside lipid bilayer POPC:DOPG:Cholesterol = 60:10:30 mol%. Then we can measure GFP florescence every 30 min.

Homework question from Peter Nguyen

Freeze-dried cell-free systems can be incorporated into all kinds of materials as biological sensors or as inducible enzymes to modify the material itself or the surrounding environment. Choose one application field — Architecture, Textiles/Fashion, or Robotics — and propose an application using cell-free systems that are functionally integrated into the material. Answer each of these key questions for your proposal pitch:

  • Write a one-sentence summary pitch sentence describing your concept.
  • How will the idea work, in more detail? Write 3-4 sentences or more.
  • What societal challenge or market need will this address?
  • How do you envision addressing the limitation of cell-free reactions (e.g., activation with water, stability, one-time use)?

I would like to design a biosensor for robotics ‘in the wild’, where only under specific weather conditions it will use freeze dried cell free system for repair or environment-adaptive changes. For example, performing localised self-repair protein expression on the skin. To think about robotics as regenerative rather than designed and completed ‘at a factory’. Just like a tub of instant coffee that can be used ‘on tap’- the fact that it can be activated with water is extremely stable!

Homework question from Ally Huang

Freeze-dried cell-free reactions have great potential in space, where resources are constrained. As described in my talk, the Genes in Space competition challenges students to consider how biotechnology, including cell-free reactions, can be used to solve biological problems encountered in space. While the competition is limited to only high school students, your assignment will be to develop your own mock Genes in Space proposal to practice thinking about biotech applications in space!

For this particular assignment, your proposal is required to incorporate the BioBits® cell-free protein expression system, but you may also use the other tools in the Genes in Space toolkit (the miniPCR® thermal cycler and the P51 Molecular Fluorescence Viewer). For more inspiration, check out https://www.genesinspace.org/ .

  1. Provide background information that describes the space biology question or challenge you propose to address. Explain why this topic is significant for humanity, relevant for space exploration, and scientifically interesting. (Maximum 100 words)

There’s actually a lot of research right now out of UC San Diego via Alysson Muotri on brain organoids in space. There’s growing research and need to maybe thinking about hybrid brains or neural repair in space that might be useful in rescuing living substrates mid-journey? Neural surgery performed in brain organoids or to support fusion in space also means that we will need to do bioprinting or scaffolding using cell-free systems via lab robot.

  1. Name the molecular or genetic target that you propose to study. Examples of molecular targets include individual genes and proteins, DNA and RNA sequences, or broader -omics approaches. (Maximum 30 words)

I am interested in DRD1 dopamine receptor D1, Forkhead Box A2 and Nurr1 transcription factors that regulate and amplify dopamine reception in cells. We could do more chemical signalling and reinforcement if we are able to make these cells more receptive to signals.

  1. Describe how your molecular or genetic target relates to the space biology question or challenge your proposal addresses. (Maximum 100 words)

DRD1 is the primary receptor mediating dopamine’s effects on motivation, working memory, and motor coordination. By expressing DRD1 in a BioBits cell-free system with fluorescent reporter, we can either detect dopamine concentration in biofluid samples or test receptiveness of cells in space experiments.

  1. Clearly state your hypothesis or research goal and explain the reasoning behind it. (Maximum 150 words)

The hypothesis will be centered around whether dopaminergic functions will be retained after freeze-drying and rehydration under simulated microgravity conditions.

  1. Outline your experimental plan - identify the sample(s) you will test in your experiment, including any necessary controls, the type of data or measurements that will be collected, etc. (Maximum 100 words)

It could be a comparative experiment against freeze-dried samples with rehydration against living samples to test if freeze-drying retains dopeminergic functions. Since PC12 cell-based dopamine research on Earth requires living cell cultures, CO2 incubators, freeze-drying will streamline and potentially use as neurochemical monitoring toolkit.

Homework Part B: Individual Final Project

Check final project page!


Reading & Resources (click to expand)

Labs

Lab writeups:

  • PCR

    PCR Photocopier and amplifier qPCR quantitative PCR mastermix pcr tubes DENATURALIZATION ANNEALING EXTENSION

  • Week 1 Lab: Pipetting

  • Week 11 Lab: Microfluidics

    Bends in microfluidics devices Separate and sort particles Add weight and shape in particle Microreactors - cavities in the middle (stationery area - form assay, UV curating, heat, using time to activate the reaction) Reynolds number Re is the ratio of inertial forces to viscous forces. Force to viscosity

  • Week 7 Lab: Cell-free systems

    Protein synthesis requires transcription and translation Transcription eukaryotes or in cytoplasm in prokaryotes RNA polymerase DNA nucleotides to make RNA polymerase will bind promoter and in the space will Translation tRNA, amino acids and ribosomes, mRNA inside nucleas, splicing take introns and leave extrons RBS (ribosome binding site) - attach to small subunit - mRNA to merge with small subunit tRNA will bind start codon (AUG/ ATG) (complementary to start codon) codons are 3 nucleotides EPA E P A are three sites of tRNA working on codons [exit, peptide, amino acid] bind to start, peptides growing protein on P site, then amino acid on site and leave ribosome

Subsections of Labs

PCR

PCR Photocopier and amplifier

qPCR quantitative PCR

mastermix pcr tubes

DENATURALIZATION

ANNEALING

EXTENSION

Week 1 Lab: Pipetting

cover image cover image

Week 11 Lab: Microfluidics

Bends in microfluidics devices Separate and sort particles

Screenshot 2026-04-23 at 18.42.28.png Screenshot 2026-04-23 at 18.42.28.png

Add weight and shape in particle

Microreactors - cavities in the middle (stationery area - form assay, UV curating, heat, using time to activate the reaction)

Reynolds number

Re is the ratio of inertial forces to viscous forces. Force to viscosity

Capillary Number

Peclet Number

Types of channels

Rectangular channels Circular Channels Trapezoidal Channels V shaped channels Herringbone or grooves –> things rolling along the pattern

Cavities

Network Architectures

Chamber Filter Tesla Valve Droplet

if shapes are sandwiches - there are sealants e.g. PDMS bonding system

Stereolithography DLP

Syringe pump Flow.io Nano litres per minute

Design Challenge

Fluid3D

Week 7 Lab: Cell-free systems

cover image cover image

Protein synthesis requires transcription and translation

Transcription

eukaryotes or in cytoplasm in prokaryotes RNA polymerase DNA nucleotides to make RNA polymerase will bind promoter and in the space will

Translation

tRNA, amino acids and ribosomes, mRNA inside nucleas, splicing take introns and leave extrons

RBS (ribosome binding site) - attach to small subunit - mRNA to merge with small subunit tRNA will bind start codon (AUG/ ATG) (complementary to start codon) codons are 3 nucleotides EPA E P A are three sites of tRNA working on codons [exit, peptide, amino acid] bind to start, peptides growing protein on P site, then amino acid on site and leave ribosome

can happen outside of cells

TX transcription TL translation CFPS (Cell free protein synthesis)

Cell lysate

ribosomes for translations tRNA Initiation, transcription, and trasnslation factors Microsomes (membranes phospholipid bilayer)

Template

plasmid DNA (more stable) linear PCR

Supplements

Nucleotides amino acids ATP (transcription) GTP (translation)

buffer stabilize the pH

Prokaryotes transcription and translation can happen at the same time Eukaryotes separate (extrons and introns) you need specific machinery enzymes that will take out the introns

Cell free systems will give code sequence no need introns and extrons plasmid -coding sequence - use messenger RNA to produce insulin

Endosymbiotic theory (using mitochondria will be bactera - circular DNA too complex, use more energy keep mitochondria working)

-no time-consuming cloning steps required -reaction conditions can be fully controlled and modified -proteins that are toxic to cells can still be produced

Tx-TL system can be classified by the source of the cell extract

bacterial cell-free system E Coli eukaryotic yeast, mammalian, insect, plant

Insulin are made by disulfide bond make two polypeptide chains

Cell lysis and a lot of purification

Chromoproteins just have colors

GFP Green florescent protein requires blacklight

RFP

purification filters the protein you want

his-tag + protein of interest –> the tag will bind to another metal ion

Fusion protein | His-tag (histadine) | Ligand | Bead

Magnetic block

Today’s protocol:

GFP, RFP, mix Affinity Purification to isolate our samples Mixture –> will get separation

1 GFP 2 RFP 3 mix 4 mix

Projects

Final projects:

  • Wet-lab: In-person run protocols Remote Opentrons based Gingko Nebula Cloud Competent e.coli Cloning transformation kit Plating media (LB) Antibiotic selection marker Adhesion protein binding peptide / cardiac junction protein PC12 - enhance dopaminic thing or find disease what would cells secret with dopamine? How to do read out?
    Synaptic dopamine release is positively regulated by SNAP-25 that involves in benzo[a]pyrene-induced neurotoxicity https://www.sciencedirect.com/science/article/abs/pii/S0045653519315991?via%3Dihub this paper confirms that DRD1 and DRD3 are endogenously expressed in PC12 and co-localise with SNAP-25, and that CRISPR-Cas9 plasmid transfection works effectively in these cells using Lipofectamine 2000 — directly relevant to your BSL-1 modification approach. The SNAP-25/DRD interaction also suggests that if you’re overexpressing DRD1 for your reinforcement system, SNAP-25 expression levels will influence how effectively that receptor gets trafficked

Subsections of Projects

Group Final Project

cover image cover image

Individual Final Project

Presentation Slide Presentation Slide

Wet-lab:

  1. In-person run protocols
  2. Remote Opentrons based
  3. Gingko Nebula Cloud

Competent e.coli Cloning transformation kit Plating media (LB) Antibiotic selection marker

Adhesion protein binding peptide / cardiac junction protein

PC12 - enhance dopaminic thing or find disease what would cells secret with dopamine? How to do read out? \

Synaptic dopamine release is positively regulated by SNAP-25 that involves in benzo[a]pyrene-induced neurotoxicity https://www.sciencedirect.com/science/article/abs/pii/S0045653519315991?via%3Dihub this paper confirms that DRD1 and DRD3 are endogenously expressed in PC12 and co-localise with SNAP-25, and that CRISPR-Cas9 plasmid transfection works effectively in these cells using Lipofectamine 2000 — directly relevant to your BSL-1 modification approach. The SNAP-25/DRD interaction also suggests that if you’re overexpressing DRD1 for your reinforcement system, SNAP-25 expression levels will influence how effectively that receptor gets trafficked

The role of the dopamine D1 receptor in social cognition: studies using a novel genetic rat model doi:10.1242/dmm.024752

Drd1 dopamine receptor D1 [ Rattus norvegicus (Norway rat) ] https://www.ncbi.nlm.nih.gov/gene/24316

ComponentBSL statusNotesPC12 Adh cellsBSL-1Rat, non-human, non-primateNGF proteinBSL-1Recombinant protein, no pathogen riskpcDNA3.1 plasmidBSL-1Non-replicating in mammalian cells, no viral genesDRD1 insertBSL-1Normal rat receptor gene, not a pathogen or toxinLipofectamine 2000BSL-1Chemical transfection reagentDopamine/adenosineBSL-1Standard lab chemicalsHA tagBSL-1Small peptide epitope, inert
The one thing to flag with your institution: any plasmid transfection technically requires IBC (Institutional Biosafety Committee) notification at most universities, even at BSL-1. It's usually a simple form, not a full review. Check with your UAL/UCL biosafety officer before starting — it protects you.

Drafting your Twist order — step by step

Step 1: Find the rat DRD1 coding sequence
Go to NCBI Gene database:

Search: "Drd1 Rattus norvegicus"
You want gene ID: 24316
Click through to the RefSeq mRNA record: NM_012546
This gives you the full mRNA — you need the CDS (coding sequence) only, which is the region between the start codon (ATG) and stop codon

The rat DRD1 CDS encodes a 446 amino acid protein. Copy the CDS nucleotide sequence — this is your starting material.

Step 2: Add the HA tag sequence
You're adding a small epitope tag to the N-terminus of DRD1 so you can verify membrane insertion later (as in the Homberg paper).
The HA tag sequence is: TACCCATACGATGTTCCAGATTACGCT
This encodes the peptide: YPYDVPDYA
You insert this immediately after the start ATG and before the DRD1 coding sequence, with a short glycine-serine linker between them to avoid disrupting folding:
Structure of your final insert:
ATG - HA tag - GGS linker - DRD1 CDS (minus its own ATG)
The GGS linker in DNA is: GGCGGCTCC
So your insert reads:
ATG [HA] GGCGGCTCC [DRD1 CDS from codon 2 onward]

Step 3: Codon optimise for rat
Take your assembled insert sequence and run it through a codon optimisation tool. The best free option is:
Integrated DNA Technologies (IDT) codon optimisation tool

Go to: idtdna.com/CodonOpt
Paste your sequence
Select organism: Rattus norvegicus
Export the optimised sequence

This ensures the synthetic gene is expressed efficiently in your rat PC12 cells. The optimised sequence will look different from the natural sequence at the nucleotide level but encode exactly the same protein.

Step 4: What exactly to order from Twist
You need one construct — a complete circular plasmid. Specifically:
Twist product type: Gene Fragment cloned into expression vector, or full plasmid synthesis
Ask Twist to deliver: pcDNA3.1(+) containing your insert
Your insert from 5' to 3':
[Kozak sequence] - ATG - [HA tag] - [GGS linker] - [Codon-optimised rat DRD1 CDS] - [Stop codon]
The Kozak sequence goes right before your ATG to ensure efficient translation initiation in mammalian cells: GCCACCATG (the ATG is your start codon)
So the full insert order is:
GCCACC - ATG - TACCCATACGATGTTCCAGATTACGCT - GGCGGCTCC - [DRD1 codons 2-446] - TGA
Flanking this, the pcDNA3.1 vector provides:

CMV promoter (upstream)
BGH polyA signal (downstream)
Ampicillin resistance (for bacterial propagation)
Neomycin/G418 resistance (for mammalian selection if needed)


Step 5: Also order a control plasmid
You need one additional construct — same vector, same promoter, but replace the DRD1-HA insert with EGFP.
This gives you:

Visual confirmation that transfection worked (green cells)
A matched negative control for all your cAMP and electrophysiology assays
Something to optimise your Lipofectamine conditions against before using your precious DRD1 plasmid

Twist sells a standard EGFP insert — you can often just specify "pcDNA3.1-EGFP" as a catalogue item rather than custom synthesis.

Summary of what you're ordering
OrderConstructPurpose#1 (custom)pcDNA3.1-CMV-HA-DRD1 (codon optimised rat)Your experimental construct#2 (standard)pcDNA3.1-CMV-EGFPTransfection control
That's it — just two plasmids to start. Once you've validated membrane insertion and functional cAMP response with construct #1, you have everything you need to run the Opentrons RL experiment.

Practical next steps right now

Go to NCBI, pull NM_012546, copy the CDS
Manually add the Kozak + HA + linker to the 5' end
Run through IDT codon optimisation
Submit to Twist as a full plasmid order in pcDNA3.1(+)
Notify your IBC in parallel

https://www.ncbi.nlm.nih.gov/nuccore/NM_012546

>lcl|NM_012546.3_cds_NP_036678.3_1 [gene=Drd1] [db_xref=GeneID:24316,RGD:2518] [protein=D(1A) dopamine receptor] [protein_id=NP_036678.3] [location=439..1779] [gbkey=CDS]
ATGGCTCCTAACACTTCTACCATGGATGAGGCCGGGCTGCCAGCGGAGAGGGATTTCTCCTTTCGCATCC
TCACGGCCTGTTTCCTGTCACTGCTCATCCTGTCCACTCTCCTGGGCAATACCCTTGTCTGTGCGGCCGT
CATCCGGTTTCGACACCTGAGGTCCAAGGTGACCAACTTCTTTGTCATCTCTTTAGCTGTGTCAGATCTC
TTGGTGGCTGTCCTGGTCATGCCCTGGAAAGCTGTGGCCGAGATTGCTGGCTTTTGGCCCTTTGGGTCCT
TTTGTAACATCTGGGTAGCCTTTGACATCATGTGCTCTACGGCGTCCATTCTGAACCTCTGCGTGATCAG
CGTGGACAGGTACTGGGCTATCTCCAGCCCTTTCCAGTATGAGAGGAAGATGACCCCCAAAGCAGCCTTC
ATCCTGATTAGCGTAGCATGGACTCTGTCTGTCCTTATATCCTTCATCCCAGTACAGCTAAGCTGGCACA
AGGCAAAGCCCACATGGCCCTTGGATGGCAATTTTACCTCCCTGGAGGACACCGAGGATGACAACTGTGA
CACAAGGTTGAGCAGGACGTATGCCATTTCATCGTCCCTCATCAGCTTTTACATCCCCGTAGCCATTATG
ATCGTCACCTACACCAGTATCTACAGGATTGCCCAGAAGCAAATCCGGCGCATCTCAGCCTTGGAGAGGG
CAGCAGTCCATGCCAAGAATTGCCAGACCACCGCAGGTAACGGGAACCCCGTCGAATGCGCCCAGTCTGA
AAGTTCCTTTAAGATGTCCTTCAAGAGGGAGACGAAAGTTCTAAAGACGCTGTCTGTGATCATGGGGGTG
TTTGTGTGCTGCTGGCTCCCTTTCTTCATCTCGAACTGTATGGTGCCCTTCTGTGGCTCTGAGGAGACCC
AGCCATTCTGCATCGATTCCATCACCTTCGATGTGTTTGTGTGGTTTGGGTGGGCGAATTCTTCCCTGAA
CCCCATTATTTATGCTTTTAATGCTGACTTCCAGAAGGCGTTCTCAACCCTCTTAGGATGCTACAGACTC
TGCCCTACTACGAATAATGCCATAGAGACGGTGAGCATTAACAACAATGGGGCTGTGGTGTTTTCCAGCC
ACCATGAGCCCCGAGGCTCCATCTCCAAGGACTGTAATCTGGTTTACCTGATCCCTCATGCCGTGGGCTC
CTCTGAGGACCTGAAGAAGGAAGAGGCTGGTGGAATAGCTAAGCCACTGGAGAAGCTGTCCCCAGCCTTA
TCGGTCATATTGGACTATGACACCGATGTCTCTCTAGAAAAGATCCAACCTGTCACACACAGTGGACAGC
ATTCCACTTGA
[Kozak] - [ATG] - [HA tag] - [GGS linker] - [rDrd1 CDS] - [TGA stop]
Kozak sequence (GCCACC)
A short regulatory signal immediately before the start codon. Ribosomes use it to recognise exactly where to begin reading and making protein. Without it expression is poor.
ATG start codon
Every protein begins here. The ribosome reads this and starts building the amino acid chain.
HA epitope tag (YPYDVPDYA)
A tiny peptide tag borrowed from the influenza virus surface protein. Completely inert in your system. Its only purpose is to give you something to detect with an anti-HA antibody — so you can confirm your DRD1 protein is being made and has reached the cell membrane correctly.
GGS flexible linker
Three small amino acids (Glycine-Glycine-Serine) acting as a molecular spacer between the HA tag and DRD1. Prevents the tag from physically interfering with DRD1 folding and membrane insertion.
Rat DRD1 coding sequence
The actual gene — 1338 nucleotides encoding 446 amino acids of the dopamine D1 receptor. This is the functional part. Codon optimised for efficient expression in rat PC12 cells.
TGA stop codon
Tells the ribosome to stop. Protein synthesis ends here.

pcDNA3.1 - CMV - HA - rDrd1 - coopt
    │         │     │     │       │
    │         │     │     │       └── Codon optimised for rat PC12
    │         │     │     └────────── Rat Dopamine D1 Receptor gene
    │         │     └──────────────── HA detection tag
    │         └────────────────────── Strong mammalian promoter
    └──────────────────────────────── Vector backbone

circular DNA plasmid with Cytomegalovirus (CMV) promoter that drives high level expression of DRD1 gene in mammalian cell. HA tag is borrowed from influenza virus to act as molecular flag on DRD1 protein so i can detect with an anti-HA antibody. Codon optimized via Twist