Homework

Weekly homework submissions:

  • Week 1 HW: Principles and Practices

    First, describe a biological engineering application or tool you want to develop and why. I am working on the epidemiology and development of management practices of Aster Yellows Phytoplasma (AYP) in the Plant Molecular Virology Laboratory at University of Minnesota. AYP was first detected in garlic in Minnesota in 2012, with outbreaks recorded in 2017 and 2021, and spread to planting material in 2018 and 2022. In 2024, infestations were detected throughout Minnesota. However, there are no precise data on its incidence. AYP is an obligate type of bacteria that resides in the phloem and it is transmitted by leafhoppers. AYP is a concern for production, because there are not available effective treatments for this emergent pathogen in Minnesota’s garlic crops, and current diagnostic methods are time-consuming and costly. There is a need to develop biotechnological tools for the detection and management of this plant pathogen.

  • Week 2 HW: DNA Read, Write, & Edit

    My Homework DNA! Part 1: Benchling & In-silico Gel Art See this week’s lab protocol “Gel Art: Restriction Digests and Gel Electrophoresis” for details. Overview: Make a free account at benchling.com Import the Lambda DNA. Simulate Restriction Enzyme Digestion with the following Enzymes: EcoRI HindIII BamHI KpnI EcoRV SacI SalI Create a pattern/image in the style of Paul Vanouse’s Latent Figure Protocol artworks. You might find Ronan’s website a helpful tool for quickly iterating on designs!

  • Week 3 HW: Lab Automation

    My Homework DNA! Assignment: Python Script for Opentrons Artwork Your task this week is to Create a Python file to run on an Opentrons liquid handling robot. Review this week’s recitation and this week’s lab for details on the Opentrons and programming it. Generate an artistic design using the GUI at opentrons-art.rcdonovan.com. Using the coordinates from the GUI, follow the instructions in the HTGAA26 Opentrons Colab to write your own Python script which draws your design using the Opentrons.

  • Week 4 HW: Protein Design Part I

    My homework DNA! Part A. Conceptual Questions Answer any NINE of the following questions from Shuguang Zhang: (i.e. you can select two to skip) How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons) Why do humans eat beef but do not become a cow, eat fish but do not become fish? Humans do not become cows or fish after eating them because the proteins we consume are first broken down into their basic components during digestion. Although all living organisms use the same 20 standard amino acids as building blocks of proteins, these proteins are digested into individual amino acids by enzymes such as pepsin in the stomach and trypsin in the small intestine. These amino acids are then absorbed and reused by our cells to synthesize new proteins according to the instructions encoded in human DNA. Therefore, the amino acids from beef or fish are simply raw materials that our bodies use to build human proteins.

  • Week 5 HW: Protein Design part II

    My Homework Part 1: Generate Binders with PepMLM Begin by retrieving the human SOD1 sequence from UniProt (P00441) and introducing the A4V mutation. Using the PepMLM Colab linked from the HuggingFace PepMLM-650M model card: Generate four peptides of length 12 amino acids conditioned on the mutant SOD1 sequence. To your generated list, add the known SOD1-binding peptide FLYRWLPSRRGG for comparison. Record the perplexity scores that indicate PepMLM’s confidence in the binders. sp|P00441|SODC_HUMAN Superoxide dismutase [Cu-Zn] OS=Homo sapiens OX=9606 GN=SOD1 PE=1 SV=2 MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTS AGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVV HEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

  • Week 6 HW: Genetic Circuits Design part I

    My Homework DNA Assembly Answer these questions about the protocol in this week’s lab: What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose? Phusion High-Fidelity PCR master mix contains Phusion DNA polymerase, deoxynucleotides and HF reaction buffer with MgCl2. This master mix is used for long or difficult PCR amplifications and applications where high sequence fidelity is critical, such as cloning, mutagenesis, or amplicon sequencing.

Subsections of Homework

Week 1 HW: Principles and Practices

cover image cover image
  1. First, describe a biological engineering application or tool you want to develop and why.

I am working on the epidemiology and development of management practices of Aster Yellows Phytoplasma (AYP) in the Plant Molecular Virology Laboratory at University of Minnesota. AYP was first detected in garlic in Minnesota in 2012, with outbreaks recorded in 2017 and 2021, and spread to planting material in 2018 and 2022. In 2024, infestations were detected throughout Minnesota. However, there are no precise data on its incidence. AYP is an obligate type of bacteria that resides in the phloem and it is transmitted by leafhoppers. AYP is a concern for production, because there are not available effective treatments for this emergent pathogen in Minnesota’s garlic crops, and current diagnostic methods are time-consuming and costly. There is a need to develop biotechnological tools for the detection and management of this plant pathogen.

One of the objectives of my research is to develop a low-cost, field-deployable biosensing platform to detect AYP and growers can have access to this test. The idea is to create a platform free of lab equipment, user-friendly, rapid detection and programmable. The biosensor is a cell-free system that integrates a nucleic acid extraction step with minimal sample preparation, a pre-amplification step (multiplexed RPA reaction for internal control and AYP) to improve sensitivity and specificity and detection step by CRISPR Cas12a or Toehold switch technology. Lateral flow assay could be adapted to visualize the results inside of a microfluidic chip or another Point-of-care testing (POCT) device. The technology will be validated to detect AYP in other potential hosts in Minnesota like grapevines, pennycress and camelina.

  1. Next, describe one or more governance/policy goals related to ensuring that this application or tool contributes to an “ethical” future, like ensuring non-malfeasance (preventing harm). Break big goals down into two or more specific sub-goals. Below is one example framework (developed in the context of synthetic genomics) you can choose to use or adapt, or you can develop your own. The example was developed to consider policy goals of ensuring safety and security, alongside other goals, like promoting constructive uses, but you could propose other goals for example, those relating to equity or autonomy.

The development of field-deployable diagnostic devices has several ethical risk categories that I consider in the policy governance goals and subgoals to promote safety and security in this area.

  1. Prevent economic loss from false results Describe performance thresholds such as limit of detection “LoD”, specificity and sensitivity that must be met before use in the field. Further validation with additional phytoplasma samples and off-target effects of unrelated species. Establish an internal extraction/amplification control lane in every test to differentiate correctly inhibited samples from negative samples. Confirm results using other detection methods such as qPCR to avoid crop destruction, spread of plant material such as cloves, and unnecessary quarantines. Determine contamination prevention practices such as workflow separation inside the device. Establish clear result categories like negative, positive and invalid.

  2. Support equitable access and benefits Low-cost platform and user-friendly allow farmers, growers and research laboratories to use the platform. Training materials provided to growers and researchers. Inform and explain about results, findings and the technology to farmers. Participation in events, festivals and extension programs, field days to disseminate findings.

  3. Next, describe at least three different potential governance “actions” by considering the four aspects below (Purpose, Design, Assumptions, Risks of Failure & “Success”). Try to outline a mix of actions (e.g. a new requirement/rule, incentive, or technical strategy) pursued by different “actors” (e.g. academic researchers, companies, federal regulators, law enforcement, etc). Draw upon your existing knowledge and a little additional digging, and feel free to use analogies to other domains (e.g. 3D printing, drones, financial systems, etc.). Purpose: What is done now and what changes are you proposing? Design: What is needed to make it “work”? (including the actor(s) involved - who must opt-in, fund, approve, or implement, etc) Assumptions: What could you have wrong (incorrect assumptions, uncertainties)? Risks of Failure & “Success”: How might this fail, including any unintended consequences of the “success” of your proposed actions?

Accessibility text Accessibility text
  1. Next, score (from 1-3 with, 1 as the best, or n/a) each of your governance actions against your rubric of policy goals. The following is one framework but feel free to make your own:

Governance Goal 1: Prevent economic loss from false results

Accessibility text Accessibility text

Governance Goal 2: Support equitable access and benefits

Accessibility text Accessibility text
  1. Last, drawing upon this scoring, describe which governance option, or combination of options, you would prioritize, and why. Outline any trade-offs you considered as well as assumptions and uncertainties. For this, you can choose one or more relevant audiences for your recommendation, which could range from the very local (e.g. to MIT leadership or Cambridge Mayoral Office) to the national (e.g. to President Biden or the head of a Federal Agency) to the international (e.g. to the United Nations Office of the Secretary-General, or the leadership of a multinational firm or industry consortia). These could also be one of the “actor” groups in your matrix. Reflecting on what you learned and did in class this week, outline any ethical concerns that arose, especially any that were new to you. Then propose any governance actions you think might be appropriate to address those issues. This should be included on your class page for this week.

Based on the scoring analysis, the prioritized governance approach includes Option 1 with Option 3. The internal control supports the goal of preventing economic loss by reducing false negative results. Additional tests such as gold standard qPCR are required to validate results in uncertainty scenarios ensuring proper management practices. This prioritization involves balancing speed, cost, and confidence. Although confirmatory testing can cause delays and access difficulties, these are justified by the reduction of irreversible economic damage. The approach is based on the assumption that internal controls are robust in field conditions and that the infrastructure for confirmatory testing is accessible through agricultural agencies or extension services. There may be uncertainty in user behavior, particularly in terms of how people respond to invalid results or delays in procedures. A major ethical concern is the risk of dependence on POCT, where users may consider the results as definitive. To address this issue, it will implement measures such as clear communication of the limitations of the tests and feedback mechanisms to report unexpected results.

Slides Homework

  1. Nature’s machinery for copying DNA is called polymerase. What is the error rate of polymerase? How does this compare to the length of the human genome? How does biology deal with that discrepancy?

Error rate of a proofreading polymerase is around 1 in 10⁶ per base. Considering that the human genome is around 3.2 billion base pairs it will be around 3200 errors in every cell division. There are some different repair mechanisms post replication as MutS mechanism. This protein recognizes and binds regions that were not fixed during replication proofreading.

  1. How many different ways are there to code (DNA nucleotide code) for an average human protein? In practice what are some of the reasons that all of these different codes don’t work to code for the protein of interest?

As we know the genetic code is degenerate, which means one amino acid could be encoded by multiple codons. There are 61 sense codons that encode 20 aa, that means a ratio of 3. An average human protein of 500 aa could have around 3 to the power of 500. There are several reasons all of these different codes do not work to code for the protein. One is that organisms do not use synonymic codons equally, instead preferring some codons rather than others and that depends on tRNA abundance and other factors. mRNA secondary structure and GC content can produce hairpins which affect the availability of ribosomes to translate the protein of interest.

  1. What’s the most commonly used method for oligo synthesis currently?

Gold standard method is Phosphoramidite Chemistry.

  1. Why is it difficult to make oligos longer than 200nt via direct synthesis?

Coupling and capping can fail and produce short sequences which decrease the abundance of full-length molecules and makes it difficult to purify. The larger the oligonucleotide, the more errors accumulate and the lower the percentage of correct oligo size.

  1. Why can’t you make a 2000bp gene via direct oligo synthesis?

For the same reasons of question 2, you will have a lot of truncated molecules with a lot of error accumulations which make it impossible to synthesize a gene.

Choose ONE of the following three questions to answer; and please cite AI prompts or paper citations used, if any. [Using Google & Prof. Church’s slide #4] What are the 10 essential amino acids in all animals and how does this affect your view of the “Lysine Contingency”?

The essential amino acids are methionine, valine, tryptophan, threonine, lysine, histidine, leucine, arginine, isoleucine, and phenylalanine. Lysine contingency is a term introduced in Jurassic Park to control dinosaurs and states that dinosaurs can survive only if they are supplemented with diets rich in lysine. However, this strategy is unrealistic because dinosaurs that escape could survive by eating other sources of lysine, such as plants and animals.

[Given slides #2 & 4 (AA:NA and NA:NA codes)] What code would you suggest for AA:AA interactions? [(Advanced students)] Given the one paragraph abstracts for these real 2026 grant programs sketch a response to one of them or devise one of your own:

Week 2 HW: DNA Read, Write, & Edit

My Homework

DNA!

Part 1: Benchling & In-silico Gel Art

See this week’s lab protocol “Gel Art: Restriction Digests and Gel Electrophoresis” for details. Overview:

Make a free account at benchling.com Import the Lambda DNA. Simulate Restriction Enzyme Digestion with the following Enzymes: EcoRI HindIII BamHI KpnI EcoRV SacI SalI Create a pattern/image in the style of Paul Vanouse’s Latent Figure Protocol artworks. You might find Ronan’s website a helpful tool for quickly iterating on designs!

My in silico design in Benchling

Accessibility text Accessibility textAccessibility text Accessibility textAccessibility text Accessibility text

Part 2: Gel Art - Restriction Digests and Gel Electrophoresis

In the wet-lab perform the lab experiment you designed in Part 1 and outlined in this week’s lab protocol “Gel Art: Restriction Digests and Gel Electrophoresis”.

Part 3: DNA Design Challenge

3.1. Choose your protein.

In recitation, we discussed that you will pick a protein for your homework that you find interesting. Which protein have you chosen and why? Using one of the tools described in recitation (NCBI, UniProt, google), obtain the protein sequence for the protein you chose.

[Example from our group homework, you may notice the particular format — The example below came from UniProt]

sp|P03609|LYS_BPMS2 Lysis protein OS=Escherichia phage MS2 OX=12022 PE=2 SV=1 METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLL EAVIRTVTTLQQLLT

For my final project, I decided to develop a field-deployable diagnostic platform that is user-friendly and low-cost for growers. For this platform, one idea could be to use the CRISPR-Cas12a system; therefore, I chose the enzyme LbCas12a-Ultra. In the article, they improved different components of the CRISPR assay, such as Cas12a and the reporter system. They tested different Cas12a variants, and the best-performing one was LbCas12a-Ultra. This enzyme was generated by IDT, and the sequence is not available in databases or articles. I found this information in UniProt:

tr|A0A5S8WF58|A0A5S8WF58_9FIRM LbCas12a OS=Lachnospiraceae bacterium OX=1898203 PE=1 SV=1 MSKLEKFTNCYSLSKTLRFKAIPVGKTQENIDNKRLLVEDEKRAEDYKGVKKLLDRYYLS FINDVLHSIKLKNLNNYISLFRKKTRTEKENKELENLEINLRKEIAKAFKGNEGYKSLFKKDIIETILPEFLDDKDEIALVNSFNGFTTAFTGFFDNRENMFSEEAKSTSIAFRCINENL TRYISNMDIFEKVDAIFDKHEVQEIKEKILNSDYDVEDFFEGEFFNFVLTQEGIDVYNAIIGGFVTESGEKIKGLNEYINLYNQKTKQKLPKFKPLYKQVLSDRESLSFYGEGYTSDEEV LEVFRNTLNKNSEIFSSIKKLEKLFKNFDEYSSAGIFVKNGPAISTISKDIFGEWNVIRDKWNAEYDDIHLKKKAVVTEKYEDDRRKSFKKIGSFSLEQLQEYADADLSVVEKLKEIIIQ KVDEIYKVYGSSEKLFDADFVLEKSLKKNDAVVAIMKDLLDSVKSFENYIKAFFGEGKETNRDESFYGDFVLAYDILLKVDHIYDAIRNYVTQKPYSKDKFKLYFQNPQFMGGWDKDKET DYRATILRYGSKYYLAIMDKKYAKCLQKIDKDDVNGNYEKINYKLLPGPNKMLPKVFFSKKWMAYYNPSEDIQKIYKNGTFKKGDMFNLNDCHKLIDFFKDSISRYPKWSNAYDFNFSET EKYKDIAGFYREVEEQGYKVSFESASKKEVDKLVEEGKLYMFQIYNKDFSDKSHGTPNLHTMYFKLLFDENNHGQIRLSGGAELFMRRASLKKEELVVHPANSPIANKNPDNPKKTTTLS YDVYKDKRFSEDQYELHIPIAINKCPKNIFKINTEVRVLLKHDDNPYVIGIDRGERNLLYIVVVDGKGNIVEQYSLNEIINNFNGIRIKTDYHSLLDKKEKERFEARQNWTSIENIKELK AGYISQVVHKICELVEKYDAVIALEDLNSGFKNSRVKVEKQVYQKFEKMLIDKLNYMVDKKSNPCATGGALKGYQITNKFESFKSMSTQNGFIFYIPAWLTSKIDPSTGFVNLLKTKYTS IADSKKFISSFDRIMYVPEEDLFEFALDYKNFSRTDADYIKKWKLYSYGNRIRIFRNPKKNNVFDWEEVCLTSAYKELFNKYGINYQQGDIRALLCEQSDKAFYSSFMALMSLMLQMRNS ITGRTDVDFLISPVKNSDGIFYDSRNYEAQENAILPKNADANGAYNIARKVLWAIGQFKKAEDEKLDKVKIAISNKEWLEYAQTSVKH

3.2. Reverse Translate: Protein (amino acid) sequence to DNA (nucleotide) sequence.

The Central Dogma discussed in class and recitation describes the process in which DNA sequence becomes transcribed and translated into protein. The Central Dogma gives us the framework to work backwards from a given protein sequence and infer the DNA sequence that the protein is derived from. Using one of the tools discussed in class, NCBI or online tools (google “reverse translation tools”), determine the nucleotide sequence that corresponds to the protein sequence you chose above.

[Example: Get to the original sequence of phage MS2 L-protein from its genome phage MS2 genome - Nucleotide - NCBI]

Lysis protein DNA sequence atggaaacccgattccctcagcaatcgcagcaaactccggcatctactaatagacgccggccattcaaacatgaggattacccatgtcgaagacaacaaagaagttcaactctttatgtattgatcttcctcgcgatctttctctcgaaatttaccaatcaattgcttctgtcgctactggaagcggtgatccgcacagtgacgactttacagcaattgcttacttaa

LbCas12a from NCBI Genbank: OK557998.1 atgagcaagctggagaagtttacaaactgctactccctgtctaagaccctgaggttcaaggccatccctgtgggcaagacccaggagaacatcgacaataagcggctgctggtggaggacgagaagagagccgaggattataagggcgtgaagaagctgctggatcgctactatctgtcttttatcaacgacgtgctgcacagcatcaagctgaagaatctgaacaattacatcagcctgttccggaagaaaaccagaaccgagaaggagaataaggagctggagaacctggagatcaatctgcggaaggagatcgccaaggccttcaagggcaacgagggctacaagtccctgtttaagaaggatatcatcgagacaatcctgccagagttcctggacgataaggacgagatcgccctggtgaacagcttcaatggctttaccacagccttcaccggcttctttgataacagagagaatatgttttccgaggaggccaagagcacatccatcgccttcaggtgtatcaacgagaatctgacccgctacatctctaatatggacatcttcgagaaggtggacgccatctttgataagcacgaggtgcaggagatcaaggagaagatcctgaacagcgactatgatgtggaggatttctttgagggcgagttctttaactttgtgctgacacaggagggcatcgacgtgtataacgccatcatcggcggcttcgtgaccgagagcggcgagaagatcaagggcctgaacgagtacatcaacctgtataatcagaaaaccaagcagaagctgcctaagtttaagccactgtataagcaggtgctgagcgatcgggagtctctgagcttctacggcgagggctatacatccgatgaggaggtgctggaggtgtttagaaacaccctgaacaagaacagcgagatcttcagctccatcaagaagctggagaagctgttcaagaattttgacgagtactctagcgccggcatctttgtgaagaacggccccgccatcagcacaatctccaaggatatcttcggcgagtggaacgtgatccgggacaagtggaatgccgagtatgacgatatccacctgaagaagaaggccgtggtgaccgagaagtacgaggacgatcggagaaagtccttcaagaagatcggctccttttctctggagcagctgcaggagtacgccgacgccgatctgtctgtggtggagaagctgaaggagatcatcatccagaaggtggatgagatctacaaggtgtatggctcctctgagaagctgttcgacgccgattttgtgctggagaagagcctgaagaagaacgacgccgtggtggccatcatgaaggacctgctggattctgtgaagagcttcgagaattacatcaaggccttctttggcgagggcaaggagacaaacagggacgagtccttctatggcgattttgtgctggcctacgacatcctgctgaaggtggaccacatctacgatgccatccgcaattatgtgacccagaagccctactctaaggataagttcaagctgtattttcagaaccctcagttcatgggcggctgggacaaggataaggagacagactatcgggccaccatcctgagatacggctccaagtactatctggccatcatggataagaagtacgccaagtgcctgcagaagatcgacaaggacgatgtgaacggcaattacgagaagatcaactataagctgctgcccggccctaataagatgctgccaaaggtgttcttttctaagaagtggatggcctactataaccccagcgaggacatccagaagatctacaagaatggcacattcaagaagggcgatatgtttaacctgaatgactgtcacaagctgatcgacttctttaaggatagcatctcccggtatccaaagtggtccaatgcctacgatttcaacttttctgagacagagaagtataaggacatcgccggcttttacagagaggtggaggagcagggctataaggtgagcttcgagtctgccagcaagaaggaggtggataagctggtggaggagggcaagctgtatatgttccagatctataacaaggacttttccgataagtctcacggcacacccaatctgcacaccatgtacttcaagctgctgtttgacgagaacaatcacggacagatcaggctgagcggaggagcagagctgttcatgaggcgcgcctccctgaagaaggaggagctggtggtgcacccagccaactcccctatcgccaacaagaatccagataatcccaagaaaaccacaaccctgtcctacgacgtgtataaggataagaggttttctgaggaccagtacgagctgcacatcccaatcgccatcaataagtgccccaagaacatcttcaagatcaatacagaggtgcgcgtgctgctgaagcacgacgataacccctatgtgatcggcatcgataggggcgagcgcaatctgctgtatatcgtggtggtggacggcaagggcaacatcgtggagcagtattccctgaacgagatcatcaacaacttcaacggcatcaggatcaagacagattaccactctctgctggacaagaaggagaaggagaggttcgaggcccgccagaactggacctccatcgagaatatcaaggagctgaaggccggctatatctctcaggtggtgcacaagatctgcgagctggtggagaagtacgatgccgtgatcgccctggaggacctgaactctggctttaagaatagccgcgtgaaggtggagaagcaggtgtatcagaagttcgagaagatgctgatcgataagctgaactacatggtggacaagaagtctaatccttgtgcaacaggcggcgccctgaagggctatcagatcaccaataagttcgagagctttaagtccatgtctacccagaacggcttcatcttttacatccctgcctggctgacatccaagatcgatccatctaccggctttgtgaacctgctgaaaaccaagtataccagcatcgccgattccaagaagttcatcagctcctttgacaggatcatgtacgtgcccgaggaggatctgttcgagtttgccctggactataagaacttctctcgcacagacgccgattacatcaagaagtggaagctgtactcctacggcaaccggatcagaatcttccggaatcctaagaagaacaacgtgttcgactgggaggaggtgtgcctgaccagcgcctataaggagctgttcaacaagtacggcatcaattatcagcagggcgatatcagagccctgctgtgcgagcagtccgacaaggccttctactctagctttatggccctgatgagcctgatgctgcagatgcggaacagcatcacaggccgcaccgacgtggattttctgatcagccctgtgaagaactccgacggcatcttctacgatagccggaactatgaggcccaggagaatgccatcctgccaaagaacgccgacgccaatggcgcctataacatcgccagaaaggtgctgtgggccatcggccagttcaagaaggccgaggacgagaagctggataaggtgaagatcgccatctctaacaaggagtggctggagtacgcccagaccagcgtgaagcac

3.3. Codon optimization.

Once a nucleotide sequence of your protein is determined, you need to codon optimize your sequence. You may, once again, utilize google for a “codon optimization tool”. In your own words, describe why you need to optimize codon usage. Which organism have you chosen to optimize the codon sequence for and why?

[Example from Codon Optimization Tool | Twist Bioscience while avoiding Type IIs enzyme recognition sites BsaI, BsmBI, and BbsI]

Lysis protein DNA sequence with Codon-Optimization ATGGAAACCCGCTTTCCGCAGCAGAGCCAGCAGACCCCGGCGAGCACCAACCGCCGCCGCCCGTTCAAACATGAAGATTATCCGTGCCGTCGTCAGCAGCGCAGCAGCACCCTGTATGTGCTGATTTTTCTGGCGATTTTTCTGAGCAAATTCACCAACCAGCTGCTGCTGAGCCTGCTGGAAGCGGTGATTCGCACAGTGACGACCCTGCAGCAGCTGCTGACCTAA

I used the codon optimization tool from Twist Bioscience because I plan to produce LbCas12a in E. coli, purify it, and prepare it for use in a cell-free system. E. coli is a well-characterized organism and is relatively simple to handle for protein expression and purification. LbCas12a is an enzyme originally derived from Lachnospiraceae bacterium. Since I chose the expression vector pET-28a(+), I selected the option in the program to optimize codon usage for E. coli and to avoid introducing restriction sites for BamHI and XhoI, as I will use these enzymes for cloning.

LbCas12a DNA sequence with Codon-optimization ATGTCGAAGCTGGAGAAGTTCACGAACTGTTACTCTTTGTCCAAGACGTTACGCTTCAAAGCGATACCTGTGGGTAAAACACAAGAGAACATCGATAACAAGAGACTGCTGGTGGAAGACGAGAAACGAGCTGAAGATTACAAGGGCGTCAAGAAGTTATTGGATCGATACTATCTTTCCTTTATCAACGATGTTCTGCACTCGATTAAATTGAAGAACCTCAATAACTATATATCTCTGTTCCGCAAGAAGACTCGAACCGAAAAGGAAAACAAGGAGCTTGAGAACTTAGAGATTAACCTCAGAAAGGAGATAGCCAAAGCTTTCAAGGGAAACGAGGGGTACAAATCCTTGTTCAAGAAGGATATAATCGAAACGATACTGCCCGAATTTCTGGACGATAAAGATGAAATTGCTCTGGTCAACTCGTTCAATGGCTTCACGACTGCATTTACGGGTTTCTTCGACAACCGTGAGAACATGTTCAGCGAAGAGGCTAAAAGCACGAGCATTGCCTTTCGTTGCATTAACGAGAACCTGACCCGGTACATAAGTAATATGGACATATTTGAGAAAGTAGACGCTATCTTTGATAAGCACGAAGTCCAAGAGATTAAAGAGAAGATCCTTAATTCAGACTACGATGTAGAGGACTTCTTTGAAGGGGAGTTCTTCAACTTCGTACTGACCCAAGAGGGCATAGATGTCTACAATGCAATTATAGGCGGCTTCGTCACCGAGTCGGGCGAAAAGATAAAAGGCTTGAACGAATACATTAACCTGTACAACCAGAAGACAAAACAAAAGCTCCCGAAGTTCAAGCCCCTGTATAAACAAGTTCTCTCAGACCGCGAGAGTTTGTCTTTCTACGGTGAGGGTTACACTTCGGATGAAGAGGTCCTCGAAGTATTTCGCAATACGTTAAACAAGAACTCAGAGATATTTTCTTCCATTAAGAAATTAGAGAAGCTCTTCAAGAACTTTGATGAGTACTCCTCAGCTGGTATATTCGTGAAGAATGGGCCAGCAATCTCCACCATTAGTAAGGATATATTTGGCGAATGGAACGTAATCCGTGACAAATGGAACGCTGAATACGATGATATCCATCTGAAGAAGAAGGCCGTAGTGACGGAGAAGTACGAAGACGACCGTCGGAAGTCCTTTAAGAAGATAGGAAGCTTCAGTCTCGAACAGCTCCAGGAGTACGCAGATGCGGATTTAAGTGTAGTCGAGAAATTGAAGGAGATAATAATTCAAAAGGTTGATGAGATCTACAAGGTCTATGGGTCTAGTGAGAAACTCTTCGACGCCGACTTCGTGTTGGAAAAGTCACTGAAGAAGAATGATGCCGTTGTGGCCATTATGAAGGACCTGCTGGATTCAGTGAAGAGCTTTGAGAACTACATTAAAGCATTCTTCGGAGAAGGTAAGGAAACTAATCGCGACGAGTCTTTCTACGGAGACTTCGTTCTTGCGTATGACATCCTTTTGAAGGTGGACCACATTTACGATGCAATTAGAAACTACGTCACTCAGAAGCCTTACTCCAAGGACAAGTTTAAGCTTTACTTCCAAAACCCTCAATTTATGGGCGGTTGGGACAAAGACAAAGAGACTGACTATCGCGCCACGATCTTACGCTATGGAAGCAAGTACTACTTAGCGATTATGGACAAGAAGTACGCAAAGTGCCTGCAAAAGATAGACAAGGATGATGTTAACGGCAACTACGAGAAAATTAATTACAAGCTGCTTCCTGGCCCAAATAAGATGCTGCCAAAGGTCTTCTTTTCTAAGAAGTGGATGGCTTACTACAACCCATCAGAGGACATACAAAAGATTTATAAGAACGGCACTTTCAAGAAGGGTGACATGTTTAACCTCAATGACTGTCATAAGTTGATTGACTTCTTCAAGGACAGTATATCCCGATACCCTAAGTGGTCCAATGCGTACGACTTCAATTTCTCAGAGACAGAAAAGTACAAAGACATCGCGGGTTTCTATCGCGAGGTGGAAGAACAAGGATACAAGGTTTCTTTCGAGTCAGCTTCTAAGAAAGAGGTCGACAAGCTGGTGGAAGAGGGTAAGCTGTACATGTTTCAGATTTACAATAAGGATTTCTCAGATAAGTCACACGGAACTCCCAACCTGCACACCATGTACTTCAAACTGCTGTTCGATGAGAACAACCACGGTCAAATTCGCTTGTCTGGCGGCGCAGAACTTTTCATGCGGCGCGCTTCTCTGAAGAAAGAGGAACTTGTTGTTCATCCAGCCAACAGCCCTATCGCTAACAAGAATCCTGACAACCCCAAGAAAACTACTACGTTGAGTTACGACGTTTACAAGGATAAGAGATTTTCCGAGGACCAGTACGAGTTGCATATTCCGATAGCCATTAATAAATGCCCTAAGAACATTTTCAAGATCAATACGGAAGTACGTGTCCTTTTGAAACACGACGACAACCCATACGTGATTGGGATTGACAGAGGAGAGCGTAACTTATTATATATCGTCGTCGTCGACGGCAAGGGAAACATAGTTGAACAGTATAGCTTAAATGAGATAATCAACAACTTCAACGGTATCCGTATCAAAACAGATTACCACTCTTTACTTGACAAGAAGGAGAAGGAGCGCTTCGAGGCTCGTCAAAATTGGACTAGCATTGAGAATATTAAAGAGCTCAAGGCTGGGTATATCAGCCAGGTTGTACATAAGATCTGCGAGCTGGTTGAAAAGTACGACGCGGTCATCGCACTTGAGGACCTGAACAGTGGTTTTAAGAACAGCCGTGTGAAGGTCGAGAAGCAGGTATACCAAAAGTTTGAGAAGATGCTGATAGATAAGCTGAACTACATGGTAGACAAGAAATCTAATCCATGTGCGACTGGTGGGGCCTTAAAGGGATATCAGATAACCAACAAATTTGAGTCCTTCAAGAGCATGTCGACCCAGAACGGATTCATATTTTACATCCCGGCTTGGTTGACATCCAAGATCGATCCTAGCACTGGTTTCGTTAACCTCTTGAAGACTAAGTACACGAGTATAGCTGACAGTAAGAAGTTTATCTCATCTTTCGATCGTATTATGTACGTTCCAGAGGAAGACCTTTTCGAGTTTGCGTTAGATTACAAGAACTTCAGCCGAACTGACGCAGACTACATTAAGAAGTGGAAATTGTATAGCTATGGGAACCGAATTCGCATCTTCCGGAATCCAAAGAAGAATAACGTTTTCGACTGGGAAGAGGTGTGTTTAACTTCAGCGTATAAAGAGCTGTTTAATAAGTACGGAATAAATTACCAACAAGGTGACATCCGCGCCCTTCTGTGCGAGCAATCGGATAAAGCATTTTACTCCTCGTTCATGGCATTGATGTCATTAATGTTGCAAATGCGGAACTCCATAACTGGTAGAACCGATGTAGACTTTCTTATCTCACCGGTTAAGAACAGTGATGGTATCTTCTACGATTCTCGGAATTATGAGGCTCAAGAGAACGCAATCCTGCCCAAGAACGCTGACGCGAATGGAGCCTATAATATCGCACGTAAAGTACTGTGGGCTATAGGTCAATTTAAGAAGGCCGAAGACGAGAAATTGGACAAAGTTAAAATCGCAATTTCTAACAAAGAGTGGCTGGAATATGCGCAAACAAGTGTGAAGCAT

Accessibility text Accessibility text

3.4. You have a sequence! Now what?

What technologies could be used to produce this protein from your DNA? Describe in your words the DNA sequence can be transcribed and translated into your protein. You may describe either cell-dependent or cell-free methods, or both.

It could be used both tecnologies cell-dependent system and cell-free system. In a cell-dependent system, the optimized LbCas12a gene will be cloned into an expression vector such as pET-28a(+) and transformed into E. coli expression strains (for example, BL21(DE3)). Once inside the bacteria, the DNA is first transcribed into messenger RNA (mRNA) by RNA polymerase. In the pET system, transcription is typically driven by the T7 promoter, which is activated after induction (commonly with IPTG). The mRNA is then translated by ribosomes in the cytoplasm, where transfer RNAs (tRNAs) recognize codons and incorporate the corresponding amino acids to synthesize the LbCas12a protein. After translation, the protein can be purified, often using Nickle affinity and cation exchange chromatography.

In a cell-free expression system, the same DNA template (plasmid or linear DNA) is added to a reaction mixture containing the molecular machinery required for transcription and translation. This mixture typically includes RNA polymerase, ribosomes, tRNAs, amino acids, nucleotides, enzymes, and energy sources. The DNA is transcribed into mRNA in vitro, and ribosomes translate the mRNA into protein directly in the reaction tube. Cell-free systems allow faster protein production and easier control of reaction conditions, which is useful for rapid prototyping or diagnostic applications.

3.5. [Optional] How does it work in nature/biological systems?

Describe how a single gene codes for multiple proteins at the transcriptional level. Try aligning the DNA sequence, the transcribed RNA, and also the resulting translated Protein!!! See example below. [Example shows the biomolecular flow in central dogma from DNA to RNA to Protein] Special note that all “T” were transcribed into “U” and that the 3-nt codon represents 1-AA.

  1. Splicing can generate different mRNA variants with different exon organization, resulting in the production of different proteins. Alternative promoters within a single gene or different transcription start sites can also generate different protein isoforms.

  2. Accessibility text Accessibility text

Part 4: Prepare a Twist DNA Synthesis Order

In this case, I can order the synthetic fragment (LbCas12a gene) containing the restriction sites and clone it into the expression vector pET-28a(+) using the restriction enzymes BamHI and XhoI.

Order from Twist

Accessibility text Accessibility text

Insert and desired expression vector

Accessibility text Accessibility text

Part 5: DNA Read/Write/Edit

5.1 DNA Read

(i) What DNA would you want to sequence (e.g., read) and why? This could be DNA related to human health (e.g. genes related to disease research), environmental monitoring (e.g., sewage waste water, biodiversity analysis), and beyond (e.g. DNA data storage, biobank). DNA-based digital data storage technology. Source: Archives in DNA: Workshop Exploring Implications of an Emerging Bio-Digital Technology through Design Fiction - Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/DNA-based-digital-data-storage-technology_fig1_353128454 [accessed 11 Feb 2025]

For the development of a field-deployable CRISPR platform to detect Aster Yellows Phytoplasma (AYP), I would sequence DNA from symptomatic garlic plants that test positive for AYP. I plan to do this in two ways: (1) cloning and sequencing three molecular marker genes to confirm AYP identity and compare strains, and (2) whole-genome sequencing to fully characterize the entire AYP genome from field isolates. These sequencing approaches are important because they reveal conserved and variable regions of the pathogen, which helps me design guide RNAs that are specific, sensitive, and robust across different AYP strains found in real agricultural samples.

(ii) In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why?

Also answer the following questions: Is your method first-, second- or third-generation or other? How so? What is your input? How do you prepare your input (e.g. fragmentation, adapter ligation, PCR)? List the essential steps. What are the essential steps of your chosen sequencing technology, how does it decode the bases of your DNA sample (base calling)? What is the output of your chosen sequencing technology?

Accessibility text Accessibility text
5.2 DNA Write

(i) What DNA would you want to synthesize (e.g., write) and why? These could be individual genes, clusters of genes or genetic circuits, whole genomes, and beyond. As described in class thus far, applications could range from therapeutics and drug discovery (e.g., mRNA vaccines and therapies) to novel biomaterials (e.g. structural proteins), to sensors (e.g., genetic circuits for sensing and responding to inflammation, environmental stimuli, etc.), to art (DNA origamis). If possible, include the specific genetic sequence(s) of what you would like to synthesize! You will have the opportunity to actually have Twist synthesize these DNA constructs! :) See some famous examples of DNA design DNA origami by Paul W. K. Rothemund, California Institute of Technology, 2004. 100 nanometers in diameter.

For now I want to synthetize the individual gene LbCas12a to clone in E.coli and produce and purify it. Also, it could be a genetic circuit in a cell free-system where LbCas12a generates itself or by using Toehold switch system for detection. (ii) What technology or technologies would you use to perform this DNA synthesis and why? Also answer the following questions: What are the essential steps of your chosen sequencing methods? What are the limitations of your sequencing method (if any) in terms of speed, accuracy, scalability?

(i) For this project, I would like to synthesize the individual gene encoding LbCas12a (from Lachnospiraceae bacterium) for heterologous expression in E. coli. The goal is to clone the codon-optimized gene into an expression vector (such as pET-28a(+)), produce the recombinant protein, and purify it for CRISPR-based detection of AYP. Synthesizing this gene allows me to optimize codon usage for E. coli and remove unwanted restriction sites. In addition to expressing LbCas12a as a standalone protein, I am also interested in synthesizing a genetic circuit for a cell-free system. This system could integrate Cas12a or toehold switch technology. In this design, the DNA construct could encode LbCas12a under a T7 promoter, enabling in vitro transcription and translation directly within a cell-free reaction. This would allow on-demand production of the CRISPR effector enzyme during the diagnostic reaction itself, reducing the need for pre-purified protein. I do not yet have a fully defined design, but I plan to refine and optimize the system as the project progresses.

(ii) To synthesize the LbCas12a gene and the genetic circuit constructs, I would use commercial gene synthesis services, such as those provided by Twist Bioscience (synthetic fragments or clonal genes). Twist uses high-throughput, phosphoramidite silicon-based DNA synthesis technology to chemically synthesize DNA fragments with high accuracy and scalability.

Gene synthesis typically follows these essential steps:

  1. Oligonucleotide synthesis Short DNA oligonucleotides (~60–200 nt) are chemically synthesized using phosphoramidite chemistry.

  2. Oligo assembly Overlapping oligos are assembled into longer double-stranded DNA (dsDNA) fragments.

  3. Error correction and amplification Gene fragments are purified and subjected to quality control, including size and length verification. Sequence verification may be performed by next-generation sequencing (NGS), and enzymatic error correction combined with PCR amplification improves sequence fidelity. The validated gene fragments are then ready to be shipped. For clonal genes, additional cloning and bacterial amplification steps may be required.

  4. Cloning into a vector The synthesized gene is inserted into a plasmid backbone for propagation and expression.

  5. Sequence verification The final construct is validated by Sanger sequencing or next-generation sequencing to confirm sequence accuracy.

  6. Although commercial DNA synthesis is highly efficient and accurate, it has several limitations. Errors can still occur during oligonucleotide synthesis and assembly, particularly in repetitive sequences or regions with very high GC content, which may require additional verification and correction. The synthesis of very long DNA constructs can be technically challenging and more expensive, as longer sequences increase the probability of errors and assembly difficulties. Turnaround time may also vary depending on the complexity and size of the construct, sometimes requiring several days to weeks.

5.3 DNA Edit

(i) What DNA would you want to edit and why? In class, George shared a variety of ways to edit the genes and genomes of humans and other organisms. Such DNA editing technologies have profound implications for human health, development, and even human longevity and human augmentation. DNA editing is also already commonly leveraged for flora and fauna, for example in nature conservation efforts, (animal/plant restoration, de-extinction), or in agriculture (e.g. plant breeding, nitrogen fixation). What kinds of edits might you want to make to DNA (e.g., human genomes and beyond) and why? Colossal Biosciences Inc., a biotechnology company using genetic engineering to de-extinct various historic animals such as the woolly mammoth, dodo, and dire wolf.

(ii) What technology or technologies would you use to perform these DNA edits and why? Also answer the following questions: How does your technology of choice edit DNA? What are the essential steps? What preparation do you need to do (e.g. design steps) and what is the input (e.g. DNA template, enzymes, plasmids, primers, guides, cells) for the editing? What are the limitations of your editing methods (if any) in terms of efficiency or precision?

(i) I am interested in editing plant genomes to improve agriculturally important traits such as disease resistance, stress tolerance (drought, heat), and yield stability. Specifically, I would target genes involved in plant-pathogen interactions, immune signaling pathways, and susceptibility (S) genes that facilitate pathogen infection. (ii) I would use CRISPR-Cas systems, delivered through engineered plant viruses. CRISPR-Cas is highly programmable, efficient, and relatively simple to design compared to earlier genome-editing tools (e.g., ZFNs or TALENs). Using plant viruses as delivery vectors allows systemic spread of guide RNAs throughout the plant, enabling editing directly in planta. First, the target gene sequence must be identified and analyzed to select an appropriate editing site adjacent to a compatible PAM sequence. Guide RNAs (gRNAs) are then computationally designed to ensure high specificity and minimal off-target effects. The selected guide sequence is cloned into a plant viral vector capable of systemic infection. The input materials include the guide RNA construct, a Cas nuclease (either stably expressed in the plant or delivered separately), the engineered viral vector, and the target plant tissue. Once delivered into plant cells, the Cas protein and guide RNA form a ribonucleoprotein complex that recognizes the target DNA sequence and introduces a double-strand break, which is subsequently repaired by the plant’s endogenous DNA repair mechanisms, resulting in the desired mutation. Editing efficiency can vary depending on plant species, en vironemtnal conditions and viral delivery efficiency. Off-target mutations may occur if guide RNAs are not carefully designed. Precise edits using HDR are typically inefficient in plants compared to knockouts generated through NHEJ. Additionally, viral vectors may have cargo size limitations and may not efficiently infect all plant tissues or germline cells, which can affect heritability of edits.

Week 3 HW: Lab Automation

My Homework

DNA!

Assignment: Python Script for Opentrons Artwork

Your task this week is to Create a Python file to run on an Opentrons liquid handling robot. Review this week’s recitation and this week’s lab for details on the Opentrons and programming it. Generate an artistic design using the GUI at opentrons-art.rcdonovan.com. Using the coordinates from the GUI, follow the instructions in the HTGAA26 Opentrons Colab to write your own Python script which draws your design using the Opentrons.

You may use AI assistance for this coding — Google Gemini is integrated into Colab (see the stylized star bottom center); it will do a good job writing functional Python, while you probably need to take charge of the art concept. If you’re a proficient programmer and you’d rather code something mathematical or algorithmic instead of using your GUI coordinates, you may do that instead. If the Python component is proving too problematic even with AI and human assistance, download the full Python script from the GUI website and submit that:

If you use AI to help complete this homework or lab, document how you used AI and which models made contributions.

Sign up for a robot time slot if you are at MIT/Harvard/Wellesley or at a Node offering Opentrons automation. The Python script you created will be run on the robot to produce your work of art!

At MIT/Harvard? Lab times are on Thursday Feb.19 between 10AM and 6PM. At other Nodes? Please coordinate with your Node.

Submit your Python file via this form.

Yes, I used AI to help generate my code. The majority of the script was generated using ChatGPT. I provided detailed instructions describing what I wanted to achieve specifically, generating a Yin-Yang pattern using the Opentrons platform, including how I wanted the shape to look and how the points should be distributed. However, I still needed to review, test, and modify the code to ensure it behaved as expected. Some adjustments were required to correct positioning and curvature details.

My Google Colab

Post-Lab Questions — DUE BY START OF FEB 24 LECTURE

One of the great parts about having an automated robot is being able to precisely mix, deposit, and run reactions without much intervention, and design and deploy experiments remotely.

For this week, we’d like for you to do the following:

Find and describe a published paper that utilizes the Opentrons or an automation tool to achieve novel biological applications.

Malcı et al. (2026) developed Slowpoke, an automated Golden Gate cloning workflow for the Opentrons OT-2 and Flex platforms. The system automates DNA assembly, transformation, plating, and colony PCR, with manual colony picking and plasmid purification. The authors validated the workflow by assembling Level 1 transcription units in Saccharomyces cerevisiae using 19 promoters driving sfGFP expression, and five-part GFP constructs in Bacillus subtilis with different promoter and RBS combinations. They further demonstrated scalability by constructing 62 assemblies encoding secreted recombinant proteins such as endolysin and scFv fragments. Assembly efficiencies exceeded 90% for most constructs, demonstrating that affordable liquid-handling automation can enable high-throughput synthetic biology applications.

Write a description about what you intend to do with automation tools for your final project. You may include example pseudocode, Python scripts, 3D printed holders, a plan for how to use Ginkgo Nebula, and more. You may reference this week’s recitation slide deck for lab automation details. While your description/project idea doesn’t need to be set in stone, we would like to see core details of what you would automate. This is due at the start of lecture and does not need to be tested on the Opentrons yet. Example 1: You are creating a custom fabric, and want to deposit art onto specific parts that need to be intertwined in odd ways. You can design a 3D printed holder to attach this fabric to it, and be able to deposit bio art on top. Check out the Opentrons 3D Printing Directory. Example 2: You are using the cloud laboratory to screen an array of biosensor constructs that you design, synthesize, and express using cell-free protein synthesis. Echo transfer biosensor constructs and any required cofactors into specified wells. Bravo stamp in CPFS reagent master mix into all wells of a 96-well / 384-well plate. Multiflo dispense the CFPS lysate to all wells to start protein expression. PlateLoc seal the plate. Inheco incubate the plate at 37°C while the biosensor proteins are synthesized. XPeel remove the seal. PHERAstar measure fluorescence to compare biosensor responses.

For my final project, I intend to use automation tools to optimize a CRISPR-Cas12a diagnostic platform that combines RPA (Recombinase Polymerase Amplification) with cell-free CRISPR detection. The goal is to systematically screen primers, reaction conditions, and crRNAs using high-throughput liquid handling to accelerate assay development. RPA is an isothermal amplification reaction performed at 37-42°C (typically ~39°C). Small changes in primer design, Mg²⁺ concentration, and template input significantly affect performance. Instead of optimizing conditions manually, I will automate reaction matrix setup using an Opentrons OT-2 (or Flex). The robot will:

  • Dispense multiple primer pairs across rows
  • Generate Mg²⁺ concentration gradients across columns
  • Add template dilution series
  • Add RPA master mix

Plates will be sealed and incubated at controlled isothermal temperature using a thermocycler module or external heater. After identifying optimal RPA conditions, I will automate screening of CRISPR detection reactions in a cell-free format.

Variables to test include:

  • Multiple crRNAs targeting different regions
  • Cas12a enzyme concentrations
  • Reporter concentrations
  • Positive and negative controls

Echo transfer RPA products into designated wells. Echo transfer crRNA variants into specific wells. Bravo dispense Cas12a master mix across plate. Multiflo add fluorescent ssDNA reporter. PlateLoc seal plate. Inheco incubate at 37°C for collateral cleavage activation. PHERAstar measure fluorescence kinetics.

Fluorescence over time will be used to calculate signal-to-noise ratios and rank crRNA performance.

Using Ginkgo Nebula or a cloud lab platform, I could upload a crRNA design library and screen 50-200 crRNAs in parallel, integrating automated synthesis, liquid handling, and fluorescence analytics. This would significantly accelerate crRNA optimization.

Final Project Ideas — DUE BY START OF FEB 24 LECTURE

As explained in this week’s recitation, add 1-3 slides in your Node’s section of this slide deck with 3 ideas you have for an Individual Final Project. Be sure to put your name, city, and country on your slide!

Week 4 HW: Protein Design Part I

My homework

DNA!

Part A. Conceptual Questions

Answer any NINE of the following questions from Shuguang Zhang: (i.e. you can select two to skip)

  1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)
Accessibility text Accessibility text
  1. Why do humans eat beef but do not become a cow, eat fish but do not become fish?

Humans do not become cows or fish after eating them because the proteins we consume are first broken down into their basic components during digestion. Although all living organisms use the same 20 standard amino acids as building blocks of proteins, these proteins are digested into individual amino acids by enzymes such as pepsin in the stomach and trypsin in the small intestine. These amino acids are then absorbed and reused by our cells to synthesize new proteins according to the instructions encoded in human DNA. Therefore, the amino acids from beef or fish are simply raw materials that our bodies use to build human proteins.

  1. Why are there only 20 natural amino acids?

According to Doig (2016), the amino acids were selected during the early “RNA world” because they allow proteins to form stable, soluble structures with tightly packed hydrophobic cores and functional binding pockets. The chosen amino acids cover a wide range of properties such as size, charge, hydrophobicity, and functional groups, which enable proteins to have diverse and stable 3D structures. Other possible amino acids were likely excluded because they would not improve protein stability, solubility, or biosynthetic efficiency and cost more conformational entropy to fold than their branched isomers.

  1. Where did amino acids come from before enzymes that make them, and before life started?

Before life began and before enzymes existed, amino acids likely formed through prebiotic chemical reactions under early Earth conditions. According to Kirschning (2022), amino acids have been detected in meteorites, suggesting that some were synthesized in space and delivered to Earth through extraterrestrial material. In addition, amino acids could be produced on the early Earth through abiotic reactions involving simple molecules such as CO₂, NH₃, H₂O, and other small compounds under energy sources like lightning, UV radiation, or hydrothermal activity.

  1. If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?

Natural proteins are made of L-amino acids, which typically form right-handed α-helices. If the chirality is reversed and D-amino acids are used, the geometry is mirrored, producing a left-handed α-helix.

  1. Can you discover additional helices in proteins?

Yes, additional helices in proteins can be identified using experimental structural techniques such as X-ray crystallography and cryo-electron microscopy, as well as computational prediction tools like AlphaFold and ESMFold.

  1. Why are most molecular helices right-handed?

Most molecular helices in proteins are right-handed because proteins are composed of L-amino acids. The stereochemistry of L-amino acids favors backbone conformations that form stable hydrogen bonds between the carbonyl oxygen of residue i and the amide hydrogen of residue i+4, which stabilizes a right-handed α-helix. A left-handed helix would create unfavorable steric interactions for L-amino acids, so the right-handed α-helix is energetically preferred.

  1. Why do β-sheets tend to aggregate? What is the driving force for β-sheet aggregation?

β-sheets tend to aggregate because they form hydrogen-bonding networks that can interact with neighboring β-strands from other protein molecules. The main driving forces for β-sheet aggregation are hydrogen bonding and hydrophobic interactions, which allow β-strands from different proteins to stack together into stable sheet-like structures.

  1. Why do many amyloid diseases form β-sheets? Can you use amyloid β-sheets as materials?

Many amyloid diseases involve β-sheet structures because misfolded proteins can reorganize into highly stable cross-β sheet fibrils. In these structures, β-strands align perpendicular to the fibril axis and form repetitive hydrogen-bond networks between protein molecules. Amyloid β-sheet structures can be used as biomaterials because they form extremely stable, self-assembling nanofibers with high mechanical strength.

Part B: Protein Analysis and Visualization

In this part of the homework, you will be using online resources and 3D visualization software to answer questions about proteins. Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions:

  1. Briefly describe the protein you selected and why you selected it.

I chose the enzyme LbCas12a because I am going to develop a field-deployable diagnostic platform that is user-friendly and low-cost for growers using the CRISPR-Cas12a system.

  1. Identify the amino acid sequence of your protein. How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids.

Sequence length: 1,228 AA. According to google Colab script, the most common amino acid is: K, which appears 154 times.

Accessibility text Accessibility text

How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs. Does your protein belong to any protein family? According to Interpro, family membership: CRISPR-associated endonuclease Cas12a.

Accessibility text Accessibility text

NCBIFam: type V CRISPR-associated protein Cas12a/Cpf1

  1. Identify the structure page of your protein in RCSB When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å) The structure was deposited in July 2019 and solved at 3.25 Å resolution by cryo-EM. This is considered a good-quality structure for electron microscopy.
Accessibility text Accessibility text

Are there any other molecules in the solved structure apart from protein? Yes, it is a complex. There are several molecules such as AcrVA4 protein, Mg+2 ion and RNA (42-MER) (crRNA).

Does your protein belong to any structure classification family? When I enter the PDB ID in the search bar, no results appear.

  1. Open the structure of your protein in any 3D molecule visualization software: PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands) Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.

Cartoon

Accessibility text Accessibility text

Ribbon

Accessibility text Accessibility text

Ball and stick

Accessibility text Accessibility text

Color the protein by secondary structure. Does it have more helices or sheets? The protein has more helices (red) than sheets (yellow). It is predominantly alpha-helical, with fewer beta-sheet regions.

Accessibility text Accessibility text

Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?

Script

Hydrophobic (yellow) select hydrophobic, resn ALA+VAL+LEU+ILE+MET+PHE+TRP+PRO color yellow, hydrophobic

Polar without charge (cyan) select polar, resn SER+THR+ASN+GLN+TYR color cyan, polar

Negatives (red) select negative, resn ASP+GLU color red, negative

Positives (blue) select positive, resn LYS+ARG+HIS color blue, positive

After coloring the protein by residue type, hydrophobic residues (ALA, VAL, LEU, ILE, MET, PHE, TRP, PRO) are mainly located in the interior of the protein structure. These residues form the structural core and help stabilize the protein through hydrophobic interactions. In contrast, hydrophilic and charged residues (SER, THR, ASN, GLN, TYR, ASP, GLU, LYS, ARG, HIS) are predominantly exposed on the surface. Many positively charged residues are concentrated in the nucleic acid-binding region, which is consistent with the protein’s function in binding negatively charged RNA and DNA molecules. Overall, the distribution shows a typical soluble protein organization: a hydrophobic core and a hydrophilic, charged surface.

Accessibility text Accessibility text

Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

Yes, the protein surface shows several cavities and grooves. There is a prominent cleft that likely corresponds to the nucleic acid-binding pocket of Cas12a. These surface indentations are consistent with functional binding sites.

Accessibility text Accessibility text

Part C. Using ML-Based Protein Design Tools

In this section, we will learn about the capabilities of modern protein AI models and test some of them in your chosen protein. Copy the HTGAA_ProteinDesign2026.ipynb notebook and set up a colab instance with GPU. Choose your favorite protein from the PDB. We will now try multiple things in the three sections below; report each of these results in your homework writeup on your HTGAA website:

C1. Protein Language Modeling

  1. Deep Mutational Scans
  2. Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.
Accessibility text Accessibility text
  1. Can you explain any particular pattern? (choose a residue and a mutation that stands out)

A notable pattern is observed around position 845, where most substitutions show strongly negative scores, indicating high intolerance to mutation. This suggests that the residue at this position is under strong evolutionary constraint, likely contributing to structural stability or functional activity. In contrast, other regions display more neutral scores, suggesting mutational flexibility.

Accessibility text Accessibility text
  1. (Bonus) Find sequences for which we have experimental scans, and compare the prediction of the language model to experiment.

Database with experimental deep mutational scans datasets: https://www.mavedb.org/ and https://proteingym.org/

I used the GB1 protein deep mutational scanning (DMS) dataset from Olson et al. (2014), available in MaveDB. This dataset contains experimental measurements of how single amino acid mutations affect the function of a 55 amino acid protein. I used this protein sequence established in the article: protein_sequence = “QYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE” I generated mutation scores for the same protein using the ESM2 language model. For each possible single mutation, I calculated how likely the mutation is compared to the wild-type amino acid. Then, I compared the model scores with the experimental functional scores using Spearman correlation. Results: Spearman ρ = -0.104 p-value = 8 × 10⁻⁴

Accessibility text Accessibility text

This shows a very weak negative correlation between the model predictions and the experimental data. In simple terms, the language model does not strongly predict which mutations improve or reduce protein function in this case. This may be because: The model is trained to predict natural sequence patterns, not experimental fitness. Protein function depends on structural and biophysical effects that may not be fully captured by sequence likelihood.

  1. Latent Space Analysis

  2. Use the provided sequence dataset to embed proteins in reduced dimensionality.

  3. Analyze the different formed neighborhoods: do they approximate similar proteins?

Accessibility text Accessibility text

The formed neighborhoods approximate structurally similar proteins rather than taxonomically related ones. For example, sequences from Francisella tularensis, Campylobacter jejuni, Streptococcus pyogenes, Vibrio cholerae, and Bacillus subtilis cluster together despite belonging to different bacterial species. All share the SCOP classification a.4.5.0, indicating a common structural fold. This suggests that the embedding captures structural similarity across evolutionary distance.

  1. Place your protein in the resulting map and explain its position and similarity to its neighbors.
Accessibility text Accessibility textAccessibility text Accessibility text

After embedding SCOP domain proteins using ESM2 and reducing dimensionality with t-SNE, I placed LbCas12a into the same latent space. The SCOP proteins form a broad, continuous distribution rather than sharply separated clusters, reflecting the diversity of protein folds in the dataset. LbCas12a appears within the overall distribution rather than being completely isolated. It lies near a local group of SCOP proteins, suggesting that the language model identifies some similarity in sequence patterns or structural features between LbCas12a and those neighboring proteins. Because embeddings were generated using mean pooling over the entire sequence, the model likely captures global sequence properties rather than specific catalytic functions. Therefore, proximity in latent space reflects general evolutionary and structural similarity rather than precise functional classification.

C2. Protein Folding

Folding a protein

  1. Fold your protein with ESMFold. Do the predicted coordinates match your original structure?

Since LbCas12a is a very large protein, running the ESMfold and inverse folding analyses would be computationally demanding. For this reason, I chose TEM-1 β-lactamase for parts C2 and C3 of the assignment, as its smaller size makes the analysis faster and easier to perform. The structure predicted by ESMFold closely matches the experimentally determined structure. After structural alignment in PyMOL, the RMSD between the predicted and experimental coordinates was 0.44 Å across 252 aligned atoms, indicating a very high structural similarity. The overall fold and arrangement of secondary structure elements are almost identical. Minor deviations occur mainly in flexible terminal regions, which are typically less structured and more difficult to predict accurately. The ESMFold prediction had a high average pLDDT score (~95.9), indicating very high confidence in the predicted structure.

Accessibility text Accessibility text β-lactamase sequence structure from RCSB PDB visualized in Pymol

Accessibility text Accessibility text β-lactamase sequence structure from ESMFold visualized in Pymol

Accessibility text Accessibility text Structural alignment between the experimental β-lactamase structure and the structure predicted by ESMFold in PyMOL

  1. Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?

I tested the structural resilience of the protein by introducing point mutations predicted by the mutational scan. I first mutated residue 68 to glutamate (E), which was predicted to be unfavorable (blue), and then mutated residue 59 to aspartate (D), which was predicted to be tolerated (green). For each mutant sequence, I predicted the structure using ESMFold and aligned it with the original predicted structure in PyMOL. The alignments produced extremely low RMSD values (0.057 Å for the mutation at position 68 and 0.038 Å for the mutation at position 59), indicating that the global protein fold remained essentially unchanged. These results suggest that the protein structure is highly resilient to single amino-acid substitutions, as these mutations do not significantly alter the overall structure.

Accessibility text Accessibility text Residue 68 was mutated to glutamate (E), and the mutated structure was aligned with the original ESMFold-predicted structure using PyMOL.

Accessibility text Accessibility text Residue 59 was mutated to aspartate (D), and the mutated structure was aligned with the original ESMFold-predicted structure using PyMOL.

I then mutated a larger segment (residues 8-14) by replacing them with glutamate residues. The initial alignment showed larger deviations, suggesting local structural changes. However, after alignment refinement, the core structure still aligned well with a low RMSD (~0.056 Å), indicating that the overall fold remained stable. These results suggest that the protein structure is highly resilient to both point mutations and small segment mutations, with most structural changes occurring in flexible regions rather than the stable core.

Accessibility text Accessibility text 14 residues were mutated to glutamate (E), and the mutated structure was aligned with the original ESMFold-predicted structure using PyMOL.

C3. Protein Generation

Inverse-Folding a protein: Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN

  1. Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one.

Below is the sequence generated by ProteinMPNN.

1ZG4, score=1.4111, fixed_chains=[], designed_chains=[‘A’], model_name=v_48_020 HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPVAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGXERGSRGIIAALGPXDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW T=0.1, sample=0, score=0.7625, seq_recovery=0.4867 SAAVLDVVRAAEERLGAPVGLVVMDLATGEVLAEYNANALFPLNSTSKVFLAAYVLHLVDRGELSLDQLVKYDESDIVPNSPVTKKHIERGMTVRELMEAAIQHSDNTAYNLLLKLIGGPEALTAWLRSIGDTVTRLTHYEPELKANVPGDTADTGTPLATATTLRTILTGDVLSPESRALLREFMAANTVCGARFPSALPAGWALYCKCGSGXKNGSMNIVAAFGPXNGAPTHIVVIFTSGSKASLAEIEAAIAEIAAAIIAHL

Accessibility text Accessibility text

The amino-acid probability heatmap shows the probability of each residue at every position of the protein backbone. Positions with strong probability peaks correspond to residues that are strongly preferred by the model, indicating structurally constrained sites such as buried residues or catalytic positions. Positions with lower probabilities across many amino acids correspond to more flexible or surface-exposed regions. The designed sequence generated by ProteinMPNN has a score of 0.7625, which represents the negative log-likelihood of the sequence given the backbone structure, with lower scores indicating better structural compatibility. The sequence recovery of 0.4867 indicates that approximately 48.7% of the residues match the native sequence, demonstrating that multiple sequences can adopt the same protein fold.

  1. Input this sequence into ESMFold and compare the predicted structure to your original.
Accessibility text Accessibility text

The sequence generated by the inverse folding model (ProteinMPNN) was used as input for ESMFold to predict its structure. The predicted structure was then aligned with the original crystal structure (PDB: 1ZG4) using PyMOL. Despite the sequence recovery being approximately 48%, the structural alignment shows a very low RMSD value of 0.426 Å, indicating that the two structures are nearly identical. This result demonstrates that different amino-acid sequences can adopt very similar three-dimensional folds, highlighting that protein structure is often more conserved than sequence.

Accessibility text Accessibility text

References

Balasco, N., Diaferia, C., Morelli, G., Vitagliano, L., & Accardo, A. (2021). Amyloid-like aggregation in diseases and biomaterials: Osmosis of structural information. Frontiers in Bioengineering and Biotechnology, 9. https://doi.org/10.3389/fbioe.2021.641372

Cheng, P.-N., Liu, C., Zhao, M., Eisenberg, D., & Nowick, J. S. (2012). Amyloid β-sheet mimics that antagonize protein aggregation and reduce amyloid toxicity. Nature Chemistry, 4(11), 927–933. https://doi.org/10.1038/nchem.1433

Doig, A. J. (2017). Frozen, but no accident – why the 20 standard amino acids were selected. The FEBS Journal, 284(9), 1296–1305. https://doi.org/10.1111/febs.13982

Kirschning, A. (2022). On the evolutionary history of the twenty encoded amino acids. Chemistry – A European Journal, 28(55). https://doi.org/10.1002/chem.202201419

Week 5 HW: Protein Design part II

My Homework

Part 1: Generate Binders with PepMLM

  • Begin by retrieving the human SOD1 sequence from UniProt (P00441) and introducing the A4V mutation.
  • Using the PepMLM Colab linked from the HuggingFace PepMLM-650M model card:
    • Generate four peptides of length 12 amino acids conditioned on the mutant SOD1 sequence.
    • To your generated list, add the known SOD1-binding peptide FLYRWLPSRRGG for comparison.
    • Record the perplexity scores that indicate PepMLM’s confidence in the binders.

sp|P00441|SODC_HUMAN Superoxide dismutase [Cu-Zn] OS=Homo sapiens OX=9606 GN=SOD1 PE=1 SV=2 MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTS AGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVV HEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

Mutated sequence_SODC_HUMAN MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

Accessibility text Accessibility text

Pseudo-perplexity values represent PepMLM’s confidence in each peptide sequence. Lower values indicate sequences that are more consistent with patterns learned by the model. In our results, the generated peptides showed lower perplexity scores (6.2-15.1) compared with the known SOD1-binding peptide FLYRWLPSRRGG (20.6), suggesting the model considers the generated sequences more probable under its learned sequence distribution.

In the PepMLM study, most experimentally validated peptide binders exhibit pseudo-perplexity values below ~40, with many falling between approximately 5 and 20. Lower PPL values indicate higher model confidence in the peptide-protein interaction. In our results, the generated peptides showed PPL values between 6.2 and 15.1, which fall within the range reported in the paper, while the known SOD1-binding peptide had a higher PPL of 20.6, suggesting lower confidence by the model.

Part 2: Evaluate Binders with AlphaFold3

  • Navigate to the AlphaFold Server: alphafoldserver.com
  • For each peptide, submit the mutant SOD1 sequence followed by the peptide sequence as separate chains to model the protein-peptide complex.
  • Record the ipTM score and briefly describe where the peptide appears to bind. Does it localize near the N-terminus where A4V sits? Does it engage the β-barrel region or approach the dimer interface? Does it appear surface-bound or partially buried?
  • In a short paragraph, describe the ipTM values you observe and whether any PepMLM-generated peptide matches or exceeds the known binder.

The amino acid X represents an unknown or unspecified residue predicted by PepMLM. Since AlphaFold does not accept ambiguous residues, alanine (A) was used as a replacement. Alanine is commonly used in protein modeling because it has a small, non-reactive side chain, which minimizes steric and chemical effects on the predicted structure. This makes it a reasonable neutral placeholder that is unlikely to strongly bias peptide-protein interactions in the model.

IndexPeptideiPTMBinding description
0WRYGATGVAHKK0.80Peptide binds on the surface of the β-barrel of one SOD1 monomer, away from the N-terminus. The interaction appears surface-bound and does not strongly approach the dimer interface.
1WRYPATALRHKX0.87Peptide binds along the β-barrel surface of one monomer, distant from the N-terminal region. The peptide remains solvent-exposed and mostly surface-associated and it is away from the dimer interface.
2WHYPAAGVEHGX0.65Peptide interacts with loop regions on the protein surface, away from the N-terminus, β-barrel core and dimer interface. Binding appears weak and loosely surface-bound.
3WHYYATGAAHGX0.86Peptide binds near the β-barrel surface, not close to the N-terminal A4V region and dimer interface. Interaction occurs on the outer protein surface.
4FLYRWLPSRRGG (known binder)0.89Peptide binds close to the N-terminus where the A4V mutation is located and lies along the β-barrel surface near the dimer interface.

0-WRYGATGVAHKK & SOD1 Mutant

Accessibility text Accessibility text

1-WRYPATALRHKX & SOD1 Mutant

Accessibility text Accessibility text

2-WHYPAAGVEHGX & SOD1 Mutant

Accessibility text Accessibility text

3-WHYYATGAAHGX & SOD1 Mutant

Accessibility text Accessibility text

4-FLYRWLPSRRGG & SOD1 Mutant

Accessibility text Accessibility text

In AlphaFold predictions, pTM reflects confidence in the overall fold of the protein complex, while ipTM measures the confidence of the interface between interacting chains. ipTM values above ~0.8 indicate high-confidence interfaces, values between 0.6-0.8 represent moderate confidence, and values below 0.6 suggest unreliable interactions. The predicted complexes show ipTM values between 0.65 and 0.89, indicating mostly moderate to high confidence. Most PepMLM-generated peptides bind to surface regions of the SOD1 β-barrel, but do not localize near the N-terminal A4V region and dimer interface. Only the known binder FLYRWLPSRRGG binds close to the N-terminus and near the dimer interface, and it also produces the highest ipTM score (0.89). Although some generated peptides show relatively high ipTM values (0.87 and 0.86), none match or exceed the known binder in both binding location and interface confidence.

Part 3: Evaluate Properties of Generated Peptides in the PeptiVerse

  • Structural confidence alone is insufficient for therapeutic development. Using PeptiVerse, let’s evaluate the therapeutic properties of your peptide! For each PepMLM-generated peptide:
    • Paste the peptide sequence.
    • Paste the A4V mutant SOD1 sequence in the target field.
    • Check the boxes
    • Predicted binding affinity
    • Solubility
    • Hemolysis probability
    • Net charge (pH 7)
    • Molecular weight
  • Compare these predictions to what you observed structurally with AlphaFold3. In a short paragraph, describe what you see. Do peptides with higher ipTM also show stronger predicted affinity? Are any strong binders predicted to be hemolytic or poorly soluble? Which peptide best balances predicted binding and therapeutic properties?
  • Choose one peptide you would advance and justify your decision briefly.

0-WRYGATGVAHKK

Accessibility text Accessibility text

1-WRYPATALRHKX X→ Alanine

Accessibility text Accessibility text

2-WHYPAAGVEHGX X→ Alanine

Accessibility text Accessibility text

3-WHYYATGAAHGX X→ Alanine

Accessibility text Accessibility text

4-FLYRWLPSRRGG (known binder)

Accessibility text Accessibility text

The PeptiVerse analysis shows that all peptides are predicted to be highly soluble (probability = 1.0) and non-hemolytic, indicating favorable safety and formulation properties for therapeutic development. The predicted binding affinities for all peptides fall within a similar range (pKd/pKi ≈ 5.1-5.7), suggesting relatively weak binding interactions overall. These affinity predictions do not perfectly correlate with the structural confidence observed in AlphaFold3. For example, peptide WRYPATALRHKX (index 1) showed a high structural interface score in AlphaFold (ipTM ≈ 0.87) but has the lowest predicted affinity (5.168) in PeptiVerse. Conversely, the known binder FLYRWLPSRRGG shows the highest predicted affinity (5.968) and also had the highest AlphaFold interface confidence. The peptides also fall within a similar molecular weight range (~1200-1500 Da) and most carry a moderate positive charge, which may favor protein-protein interactions. Although several PepMLM-generated peptides show similar therapeutic properties, peptide WHYYATGAAHGX (index 3) provides the best balance between structural and biochemical predictions. It shows a relatively high AlphaFold interface confidence (ipTM ≈ 0.86) together with weaker predicted binding affinity (pKd/pKi ≈ 5.71, but higher than the other ones). In addition, it has favorable therapeutic properties including high solubility, low hemolysis probability, and a smaller molecular weight. Its near-neutral net charge may also reduce nonspecific interactions compared with more positively charged peptides.

Part 4: Generate Optimized Peptides with moPPIt

Now, move from sampling to controlled design. moPPIt uses Multi-Objective Guided Discrete Flow Matching (MOG-DFM) to steer peptide generation toward specific residues and optimize binding and therapeutic properties simultaneously. Unlike PepMLM, which samples plausible binders conditioned on just the target sequence, moPPIt lets you choose where you want to bind and optimize multiple objectives at once.

  1. Open the moPPit Colab linked from the HuggingFace moPPIt model card
  2. Make a copy and switch to a GPU runtime.
  3. In the notebook:
    1. Paste your A4V mutant SOD1 sequence.
    2. Choose specific residue indices on SOD1 that you want your peptide to bind (for example, residues near position 4, the dimer interface, or another surface patch).
    3. Set peptide length to 12 amino acids.
    4. Enable motif and affinity guidance (and solubility/hemolysis guidance if available). Generate peptides.
  4. After generation, briefly describe how these moPPit peptides differ from your PepMLM peptides. How would you evaluate these peptides before advancing them to clinical studies?

The peptides generated with moPPIt differ from the PepMLM peptides because they are not just sampled based on sequence likelihood, but are actively guided toward multiple objectives, including binding affinity, motif constraints, solubility, and hemolysis. As a result, moPPIt peptides (e.g., GHSYFRGCGTYW, YYTMTCYTTIPY) tend to show higher predicted affinity scores (~6-7.3) compared to PepMLM peptides (~5-6), and also include specific sequence features (motifs) that target selected regions of SOD1. However, this optimization comes with trade-offs. Some moPPIt peptides show high hemolysis probabilities (~0.89-0.97) and only moderate solubility, indicating that improving binding and motif targeting may negatively impact therapeutic safety. In contrast, PepMLM peptides were generally non-hemolytic and highly soluble, but had weaker predicted binding. Before advancing these peptides to clinical studies, structural validation using AlphaFold or docking should confirm that the peptides bind to the intended region on SOD1. Second, in vitro assays (e.g., binding affinity measurements such as SPR or ITC) should verify the predicted interactions. Third, toxicity and stability assays should be performed to assess hemolysis, aggregation, and degradation. Finally, promising candidates would need to be optimized to balance binding strength with safety and pharmacological properties.

PeptideHemolysisSolubilityAffinityMotif
GHSYFRGCGTYW0.950.837.130.64
TDSQMRKFGPFY0.890.666.050.69
YYTMTCYTTIPY0.910.757.280.82
SFGKTCVKTEQV0.980.756.770.90

Week 6 HW: Genetic Circuits Design part I

My Homework

DNA Assembly

  • Answer these questions about the protocol in this week’s lab:

    1. What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose?

Phusion High-Fidelity PCR master mix contains Phusion DNA polymerase, deoxynucleotides and HF reaction buffer with MgCl2. This master mix is used for long or difficult PCR amplifications and applications where high sequence fidelity is critical, such as cloning, mutagenesis, or amplicon sequencing.

2. What are some factors that determine primer annealing temperature during PCR?

It is determined by temperature of melting of the primers which depends mainly on the GC content, length of the primer and sequence composition. Salt concentrations and secondary structures could affect primer annealing temperature as well.

3. There are two methods from this class that create linear fragments of DNA: PCR, and restriction enzyme digests. Compare and contrast these two methods, both in terms of protocol as well as when one may be preferable to use over the other.

4. How can you ensure that the DNA sequences that you have digested and PCR-ed will be appropriate for Gibson cloning?

After PCR, digestion and purification, DNA fragments should be analyzed by agarose gel electrophoresis to confirm the correct size of each fragment. DNA concentration can be measured (e.g., by Nanodrop or Qubit) to ensure proper stoichiometry. It is recommended sequencing the fragment before Gibson assembly to ensure that sequence is free of errors.

5. How does the plasmid DNA enter the E. coli cells during transformation?

It is not necessary to purify Gibson assembly reactions because chemical transformation was performed. Chemical competent cells contain salts such as CaCl2 and MgCl2. Positive ions neutralizes negative charges of DNA and phospholipids in the cell membrane allowing to create a channel during heat shock. Heat-shock creates temporary pores allowing plasmid DNA enter the cells during transformation.

6. Describe another assembly method in detail (such as Golden Gate Assembly)
    1. Explain the other method in 5 - 7 sentences plus diagrams (either handmade or online).
    2. Model this assembly method with Benchling or Asimov Kernel!