🪁 Michael Yang, Spring 2026

Toad.gif Toad.gif

Toad (1970).


Hello! I’m Michael.

I live in sunny Los Angeles, California. I enjoy learning, programming, and running.

I’m also starting up a HTGAA paper reading group. We’d love to have you join us!

More about me:


Homework

Labs

Projects

Subsections of 🪁 Michael Yang, Spring 2026

Homework

Weekly homework submissions:

  • Week 1 HW: Principles and Practices

    Class assignment 1. First, describe a biological engineering application or tool you want to develop and why. I’m heavily inspired by Professor Jacobson’s call for a “bio-FPGA” tool, as well as his lecture about cellular automata. I’d like to develop a bio-FPGA that can be programmed to grow into arbitrary 2D patterns on a petri dish, using the machine learning technique mentioned in the lecture to reverse learn the cellular automata rules for growing a specific pattern. The learned CA rules can be encoded by genetically programming the bio-FPGA then using bacteria with the genes to grow an actual cell culture into the pattern, like the butterfly wing letter patterns in the lecture. If this is feasible, 3D patterns would be the next step, and one might even imagine a wild future of programmable plants that grow into the shapes of houses and furniture.

  • Week 2 HW: DNA Read, Write, and Edit

    Part 0: Basics of Gel Electrophoresis I watched the lecture, recitation, and read the lab. Essentially, we use the negative charge of DNA to pull DNA fragments towards a positive anode in a porous agarose gel. Larger DNA fragments move slower in the agarose gel. Part 1: Benchling & In-silico Gel Art I spent some time playing around with Ronan’s gel art site to make a pattern (below on the left). I noticed that some of the restriction enzymes in the gel art tool weren’t on the HTGAA enzyme list, so I didn’t use them.

  • Week 3 HW: Lab Automation

    Opentrons Artwork My artwork is here: https://rcdonovan.com/?id=vmns94wqt45wpqc I used Ronan’s tool to make this. I uploaded an image of tomatoes but it didn’t render well, so I modified it significantly by hand with the editor. Then, I attended the Saturday session on Zoom with Ronan, Michelle, and Ice at Ginkgo Bioworks. Here’s the end result:

  • Week 4 HW: Protein Design, Part I

    Part A: Conceptual questions Answer any NINE of the following questions from Shuguang Zhang: (i.e. you can select two to skip) How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons) A dalton is 1.66053906892(52)*10−23 g, so 500g = 500 / 1.66053906892(52)*10−23 = 3.0110704e+25 daltons.

Subsections of Homework

Week 1 HW: Principles and Practices

Class assignment

1. First, describe a biological engineering application or tool you want to develop and why.

I’m heavily inspired by Professor Jacobson’s call for a “bio-FPGA” tool, as well as his lecture about cellular automata. I’d like to develop a bio-FPGA that can be programmed to grow into arbitrary 2D patterns on a petri dish, using the machine learning technique mentioned in the lecture to reverse learn the cellular automata rules for growing a specific pattern. The learned CA rules can be encoded by genetically programming the bio-FPGA then using bacteria with the genes to grow an actual cell culture into the pattern, like the butterfly wing letter patterns in the lecture. If this is feasible, 3D patterns would be the next step, and one might even imagine a wild future of programmable plants that grow into the shapes of houses and furniture.

My primary goal is to protect health and safety. The sub-goals are:

  • Prevent the development of biological weapons.
  • Prevent outbreak of harmful bacteria.
  • Maximize productive use-cases.

A bio-FPGA has the possibility to be used for great benefit with many applications, but could also be abused.

3. Next, describe at least three different potential governance “actions” by considering the four aspects below (Purpose, Design, Assumptions, Risks of Failure & “Success”).

Action 1: Engineer a genetic “off switch” into the bio-FPGA to stop proliferation any time.

  • Purpose: We would build a genetic off switch to immediately turn off all genes of the bio-FPGA.
  • Design: This would involve researchers and industry as actors to build this prior to releasing the bio-FPGA. The government could also regulate by requiring all bio-FPGA and adjacent tools have such a fail-safe.
  • Assumptions: This assumes that such a technological solution could be reliably engineered and triggered.
  • Risks: The risks are that the technical fail-safe does not work, or could even cause problems if it does work because it could be abused to disable legitimate use cases.

Action 2: Regulate against use for biological warfare.

  • Purpose: Although there are already regulations in place, we could craft regulation to specifically account for bio-FPGA technology.
  • Design: This would involve the government to understand the technology, the dangers, and pass appropriate laws preventing malicious use of bio-FPGAs.
  • Assumptions: This assumes that lawmakers would be motivated to pass regulation and that the public would be accepting of such regulation. It also assumes that lawmakers are able to craft good laws or adapt accordingly.
  • Risks: The risk is that excessive regulation could stifle adoption and research for beneficial use cases. Another risk is that lawmakers don’t understand the science and pass inappropriate laws.

Action 3: Host a conference for researchers and industry to share new developments.

  • Purpose: To share beneficial use cases, foster collaboration, and disseminate research learnings.
  • Design: This requires coordinating and organizing the research and industry community, as well as raising funds to host a venue.
  • Assumptions: I assume that researchers would be interested in attending and discussing.
  • Risks: The conference could be used to develop malicious use-cases, or ethics could be overlooked in favor of scientific progress at all costs.
4. Next, score (from 1-3 with, 1 as the best, or n/a) each of your governance actions against your rubric of policy goals.
Does the action:Action 1Action 2Action 3
Enhance Biosecurity
• By preventing incidents312
• By helping respond133
Foster Lab Safety
• By preventing incident312
• By helping respond133
Protect the environment
• By preventing incidents312
• By helping respond133
Other considerations
• Minimizing costs and burdens to stakeholders231
• Feasibility?211
• Not impede research131
• Promote constructive applications311

Week 2 lecture prep

Homework Questions from Professor Jacobson

  1. Nature’s machinery for copying DNA is called polymerase. What is the error rate of polymerase? How does this compare to the length of the human genome. How does biology deal with that discrepancy?

According to the slides, the error rate of polymerase is 1:106 (online it suggest that it can be even worse depending on the polymerase), or 1 in 1 million. The length of the human genome is 3.2 Gbp (3.2 * 109), so at that rate there would be ~3.2 * 103 (3,200) errors in the human genome per copy. That would be a lot of errors, but there are additional pathways that perform error correction, such as MutS.

  1. How many different ways are there to code (DNA nucleotide code) for an average human protein? In practice what are some of the reasons that all of these different codes don’t work to code for the protein of interest?

The slides mention the average human protein is 1036 base pairs. That is 345 codons, so 345 amino acids.

Amino acids can have multiple corresponding codons. There are 20 amino acids and 4 possible nucleotides, so there are about 3 possible codons per amino acid.

That is an estimate of 3345 ways to code for a given protein, a huge number.

However, despite a synonymous codon coding for the same amino acid, the base pairs choice can affect the chemical bonds of the mRNA structure, affecting RNA cleavage rules.

Another view is that there are 4^1035 possible nucleotides, which are very unlikely to code for the specific protein even with synonymous codons due to sheer possibility space.

Homework Questions from Dr. LeProust

  1. What’s the most commonly used method for oligo synthesis currently?

The phosphoramidite method.

  1. Why is it difficult to make oligos longer than 200nt via direct synthesis?

The coupling step is not possible to have perfect efficiency. That step is repeated per cycle, and each additional base requires the cycle to repeat. This means that longer oligos become dramatically harder to make, even with extremely high efficiencies:

images/oligo-table.png images/oligo-table.png

I found the table from this PDF.

  1. Why can’t you make a 2000bp gene via direct oligo synthesis?

The above answer explains why we can’t synthesize longer oligos. At 2000bp, the probabilities become near impossible even at the highest efficiencies.

Homework Question from George Church

Choose ONE of the following three questions to answer; and please cite AI prompts or paper citations used, if any.

I’m answering question 1.

  1. [Using Google & Prof. Church’s slide #4] What are the 10 essential amino acids in all animals and how does this affect your view of the “Lysine Contingency”?

The ten essential amino acids are:

  1. Histidine (H)
  2. Isoleucine (I)
  3. Leucine (L)
  4. Lysine (K)
  5. Methionine (M)
  6. Phenylalanine (F)
  7. Threonine (T)
  8. Tryptophan (W)
  9. Valine (V)
  10. Arginine (R), sometimes considered conditionally essential.

The “Lysine Contingency” is apparently from Jurassic Park, which was a plot element in the movie that was a genetic modification to make the dinosaurs not to be able to produce Lysine so they would die off without human provided Lysine supplements.

However, Lysine is one of the 10 essential amino acids so animals cannot produce it, making this is a scientifically dubious plot point (the genetic modification would have done nothing).

  1. [Given slides #2 & 4 (AA:NA and NA:NA codes)] What code would you suggest for AA:AA interactions?
  2. [(Advanced students)] Given the one paragraph abstracts for these real 2026 grant programs sketch a response to one of them or devise one of your own:

Week 2 HW: DNA Read, Write, and Edit

Part 0: Basics of Gel Electrophoresis

I watched the lecture, recitation, and read the lab. Essentially, we use the negative charge of DNA to pull DNA fragments towards a positive anode in a porous agarose gel. Larger DNA fragments move slower in the agarose gel.

Part 1: Benchling & In-silico Gel Art

I spent some time playing around with Ronan’s gel art site to make a pattern (below on the left). I noticed that some of the restriction enzymes in the gel art tool weren’t on the HTGAA enzyme list, so I didn’t use them.

I think it looks kind of like Darth Vader.

Then, I added the Lambda DNA to Benchling. I made a custom enzyme list with the EcoRI, HindIII, BamHI, KpnI, EcoRV, SacI, and SalI. Then, I added the restriction enzymes from the gel art tool to make a virtual digest (below on the right). I had some difficulty ordering the digests properly, so I saved them with the names and ordered them by dragging the tabs after.

images/week-02/gel-art-screenshot.png images/week-02/gel-art-screenshot.png images/week-02/virtual-digest-benchling.png images/week-02/virtual-digest-benchling.png

Part 2: Gel Art - Restriction Digests and Gel Electrophoresis (Wet Lab)

N/A: This is optional for committed listeners and I didn’t have access to a wet lab this week.

Part 3: DNA Design Challenge

3.1. Choose your protein

The protein I chose is Miraculin, which is from the miracle berry and famous for temporarily causing sour things to taste sweet. I picked this because I have tried a miracle berry tasting before and it was an interesting experience.

Here is the UniProt entry for Miraculin. UniProt also tags a number of other taste-modifying proteins.

The sequence for Miraculin (in FASTA) is:

>sp|P13087|MIRA_SYNDU Miraculin OS=Synsepalum dulcificum OX=3743 PE=1 SV=3
MKELTMLSLSFFFVSALLAAAANPLLSAADSAPNPVLDIDGEKLRTGTNYYIVPVLRDHG
GGLTVSATTPNGTFVCPPRVVQTRKEVDHDRPLAFFPENPKEDVVRVSTDLNINFSAFMP
CRWTSSTVWRLDKYDESTGQYFVTIGGVKGNPGPETISSWFKIEEFCGSGFYKLVFCPTV
CGSCKVKCGDVGIYIDQKGRRRLALSDKPFAFEFNKTVYF

3.2. Reverse Translate

I used tblastn (Translated BLAST) to get the nucleotide sequence that corresponds with Miraculin. This found two nucleotide sequences in a database: AB512278.1 and D38598.1. One appears to be the mRNA rather than the genes.

images/week-02/tblastn-miraculin.png images/week-02/tblastn-miraculin.png

Here is the FASTA for the gene, from AB512278.1:

>AB512278.1 Synsepalum dulcificum RdMIR gene for miraculin, complete cds
ATGAAGGAATTAACAATGCTCTCTCTCTCGTTCTTCTTCGTCTCTGCATTGTTGGCAGCAGCGGCCAACC
CACTGCTTAGTGCAGCGGATTCGGCACCCAACCCGGTTCTTGACATAGACGGAGAGAAACTCCGGACGGG
GACCAATTATTACATTGTGCCGGTGCTCCGCGACCATGGCGGCGGCCTTACAGTATCCGCCACCACCCCC
AACGGCACCTTCGTTTGTCCACCCAGAGTTGTCCAAACACGAAAGGAGGTCGACCACGATCGCCCCCTCG
CTTTCTTTCCAGAGAACCCAAAGGAAGACGTTGTTCGAGTCTCCACCGATCTCAACATCAATTTCTCGGC
GTTCATGCCCTGTCGTTGGACCAGTTCCACCGTGTGGCGGCTCGACAAATACGATGAATCCACGGGGCAG
TACTTCGTGACCATCGGCGGTGTCAAAGGAAACCCAGGTCCCGAAACCATTAGTAGCTGGTTTAAGATTG
AGGAGTTTTGTGGTAGTGGTTTTTACAAGCTTGTTTTCTGTCCCACCGTTTGTGGTTCCTGCAAAGTAAA
ATGCGGAGATGTGGGCATTTACATTGATCAGAAGGGAAGAAGGCGTTTGGCTCTCAGCGATAAACCATTC
GCATTCGAGTTCAACAAAACCGTATACTTCTAA

3.3. Codon optimization

I used VectorBuilder’s codon optimization tool since Twist’s was down for maintenance. I optimized for E. Coli, so the protein could be mass produced in its “cellular factory”. It gave the following:

Pasted Sequence: GC=51.43%, CAI=0.56

Improved DNA[1]: GC=55.81%, CAI=0.94

ATGAAAGAACTGACCATGCTGAGCCTGAGCTTCTTTTTTGTGAGCGCGCTGCTGGCGGCGGCAGCGAACCCGCTGCTGAGCGCGGCAGATAGCGCGCCGAACCCGGTGCTGGATATTGATGGCGAAAAACTGCGCACCGGCACCAATTATTATATTGTGCCGGTGCTGCGCGACCATGGCGGCGGCCTGACCGTAAGCGCGACTACCCCGAACGGCACCTTTGTGTGCCCGCCGCGTGTCGTGCAGACCCGCAAAGAAGTGGACCACGATCGCCCGCTGGCCTTCTTTCCGGAAAACCCGAAAGAAGATGTGGTGCGCGTGAGCACCGATCTGAACATTAACTTCAGCGCCTTCATGCCGTGCCGTTGGACCAGCTCGACCGTTTGGCGCCTGGATAAATATGATGAAAGCACCGGCCAGTACTTTGTTACCATTGGCGGCGTTAAAGGCAACCCGGGCCCGGAAACCATTAGCTCGTGGTTCAAAATTGAAGAATTTTGCGGCAGCGGCTTTTACAAACTGGTGTTTTGCCCGACCGTGTGCGGCAGCTGTAAAGTGAAATGCGGCGACGTGGGCATTTATATTGATCAGAAAGGCCGTCGCCGCCTGGCCCTGAGCGATAAACCGTTCGCGTTTGAATTTAACAAAACCGTGTATTTCTAA

3.4. You have a sequence! Now what?

Since I chose E. Coli, we can order the gene with a promoter in a plasmid, then use a cell-dependent method of heat shocking the E. Coli to embed the plasmid, then cultivating the E. Coli to produce lots of this protein.

This uses the natural plasmid gene expression mechanisms of E. Coli to transcribe and translate the protein.

Part 4: Prepare a Twist DNA Synthesis Order

I followed the steps to make provided sfGFP sequence in Benchling. Here is my Benchling project.

Here is the FASTA file of the expression cassette:

>E. coli sfGFP
TTTACGGCTAGCTCAGTCCTAGGTATAGTGCTAGCCATTAAAGAGGAGAAAGGTACCATGAGCAAAGGAGAAGAACTTT
TCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTTAATGGGCACAAATTTTCTGTCCGTGGAGAGGGTGA
AGGTGATGCTACAAACGGAAAACTCACCCTTAAATTTATTTGCACTACTGGAAAACTACCTGTTCCGTGGCCAACACTT
GTCACTACTCTGACCTATGGTGTTCAATGCTTTTCCCGTTATCCGGATCACATGAAACGGCATGACTTTTTCAAGAGTG
CCATGCCCGAAGGTTATGTACAGGAACGCACTATATCTTTCAAAGATGACGGGACCTACAAGACGCGTGCTGAAGTCAA
GTTTGAAGGTGATACCCTTGTTAATCGTATCGAGTTAAAGGGTATTGATTTTAAAGAAGATGGAAACATTCTTGGACAC
AAACTCGAGTACAACTTTAACTCACACAATGTATACATCACGGCAGACAAACAAAAGAATGGAATCAAAGCTAACTTCA
AAATTCGCCACAACGTTGAAGATGGTTCCGTTCAACTAGCAGACCATTATCAACAAAATACTCCAATTGGCGATGGCCC
TGTCCTTTTACCAGACAACCATTACCTGTCGACACAATCTGTCCTTTCGAAAGATCCCAACGAAAAGCGTGACCACATG
GTCCTTCTTGAGTTTGTAACTGCTGCTGGGATTACACATGGCATGGATGAGCTCTACAAACATCACCATCACCATCATC
ACTAACCAGGCATCAAATAAAACGAAAGGCTCAGTCGAAAGACTGGGCCTTTCGTTTTATCTGTTGTTTGTCGGTGAAC
GCTCTCTACTAGAGTCACACTGGCTCACCTTCGGGTGGGCCTTTCTGCGTTTATA

I imported that into Twist and added the vector. Here is a link to the Benchling project with the Twist draft order.

images/week-02/twist-plasmid.png images/week-02/twist-plasmid.png

Part 5: DNA Read/Write/Edit

5.1 DNA Read

(i) What DNA would you want to sequence (e.g., read) and why?

I’d like to sequence DNA of the human microbiome. There’s been recent research about how beneficial flora of the gut and skin microbiome contribute to our health, and there’s already a significant effort to sequence our microbiome in the Human Microbiome Project.

I would also be interested in the widespread sequencing and cataloguing of viruses that make up the common cold. I think it could be useful to detect the geographic spread of these viruses and how they mutate over time, to potentially contribute to a cure.

(ii) In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why?

I would use Sanger Sequencing, since it’s a straightforward and well-tested technique, and I understand it the best.

  1. Is your method first-, second- or third-generation or other? How so?

Sanger Sequencing is a first-generation technique. It’s the earliest and most classic form of sequencing.

  1. What is your input? How do you prepare your input (e.g. fragmentation, adapter ligation, PCR)? List the essential steps.

The input is DNA, regular nucleotides (d*), chain-terminating nucleotides (dd*), primer (like in PCR), DNA-polymerase. You prepare the input by PCRing the sample to have lots of DNA.

  1. What are the essential steps of your chosen sequencing technology, how does it decode the bases of your DNA sample (base calling)?

You thermocycle the amplified sample with the nucleotides, primers, and polymerases to build many DNA fragments. These DNA fragments will be different lengths because probabilistically each fragment will incorporate a chain-terminating nucleotide which stops polymerization. Finally, the fragments are run through electrophoresis and imaged one base pair at a time to get the sequence.

  1. What is the output of your chosen sequencing technology?

The result is the electrophoresis imaging data, which can be processed to determine the most likely base pair at each position.

One limitation of the technique is that you need a pure sample of DNA, so it may be inefficient for the volumes of organisms we’d want to sequence.

5.2 DNA Write

(i) What DNA would you want to synthesize (e.g., write) and why?

I would want to synthesize DNA origami, as art! I’m curious what it would take to make the smallest art pieces.

(ii) What technology or technologies would you use to perform this DNA synthesis and why?

I would use the phosphoramidite method.

  1. What are the essential steps of your chosen sequencing [sic; synthesis?] methods?

Deprotection, coupling, capping, oxidation, and repeat.

  1. What are the limitations of your sequencing [sic; synthesis?] method (if any) in terms of speed, accuracy, scalability?

The limitation, as discussed last homework, is the length of DNA oligos that can be synthesized with this technique. However, this isn’t a problem for DNA origami, which doesn’t need full length DNA.

5.3 DNA Edit

(i) What DNA would you want to edit and why?

I would edit human DNA, for example to cure hearing loss and tinnitus as a gene therapy. Hearing loss is permanent and affects 5% of the world population. Noise exposure from work and the environment also contribute to increased rates of hearing loss. Birds, unlike humans, can regenerate inner ear cells, and researchers have demonstrated regrowth in cell cultures so there is a theoretical target for gene therapy. There have already been successful gene therapy treatments for deafness in children due to congenital disorders.

(ii) What technology or technologies would you use to perform these DNA edits and why?

I would use CRISPR-Cas9, since it is the most popular technique today.

  1. How does your technology of choice edit DNA? What are the essential steps?

Cas9 cuts the DNA, then introducing the edits via homology directed repair.

  1. What preparation do you need to do (e.g. design steps) and what is the input (e.g. DNA template, enzymes, plasmids, primers, guides, cells) for the editing?

The guide RNA for Cas9 needs to be designed, as well as the template DNA (for knock-ins).

  1. What are the limitations of your editing methods (if any) in terms of efficiency or precision?

Homology-directed repair is not entirely efficient, since the ends that Cas9 break could rejoin without the knock-in sequence. Performing CRISPR in humans (rather than bacteria or cell cultures) requires sophisticated deliveries to get to the target cells or tissues.

Week 3 HW: Lab Automation

Opentrons Artwork

My artwork is here: https://rcdonovan.com/?id=vmns94wqt45wpqc

images/week-03/tomato-simulation.png images/week-03/tomato-simulation.png

I used Ronan’s tool to make this. I uploaded an image of tomatoes but it didn’t render well, so I modified it significantly by hand with the editor.

Then, I attended the Saturday session on Zoom with Ronan, Michelle, and Ice at Ginkgo Bioworks. Here’s the end result:

images/week-03/tomato-actual.png images/week-03/tomato-actual.png

Post-Lab Questions

  1. Find and describe a published paper that utilizes the Opentrons or an automation tool to achieve novel biological applications.

I read Assembly of small silica nanoparticles using lipid-tethered DNA ‘bonds’. This paper used a novel assembly using DNA by by embedding silica nanoparticles in a lipid bilayer, embedding the cholesterol end of a DNA-cholesterol molecule within the bilayer, then assembling the nanoparticles with complementary sticky end “bridge” DNA. This is best explained with the image from the paper below:

images/week-03/silica-nanoparticles.gif images/week-03/silica-nanoparticles.gif

They used Opentrons to rapidly iterate on and evalute different concentrations of DNA-Chol, NaCL, and bridge DNA in the assembly mixture. These were then screened with SAXS to determine the structural qualities of each sample.

  1. Write a description about what you intend to do with automation tools for your final project. You may include example pseudocode, Python scripts, 3D printed holders, a plan for how to use Ginkgo Nebula, and more. You may reference this week’s recitation slide deck for lab automation details.

One project idea I have is genetically modifying trees for use in urban areas. I’d like to use automation tools to permute different genetic combinations. I envision custom modules that could germinate and monitor an array of seeds for different qualities.

For example, I create a module with grow lights and watering capabilities that can care for the different seed variants and cameras to compare growth rates.

Week 4 HW: Protein Design, Part I

Part A: Conceptual questions

Answer any NINE of the following questions from Shuguang Zhang: (i.e. you can select two to skip)

  1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)

A dalton is 1.66053906892(52)*10−23 g, so 500g = 500 / 1.66053906892(52)*10−23 = 3.0110704e+25 daltons.

If an amino acid averages 100 daltons, then 3e+25 daltons is about 3e+23 amino acids.

~300,000,000,000,000,000,000,000 amino acids!

  1. Why do humans eat beef but do not become a cow, eat fish but do not become fish?

We digest the meat and it does not alter our genetics.

  1. Why are there only 20 natural amino acids?

Codon redundancy allows for both more efficient genetic coding and error correction. 20 appears to be enough (as evidenced by life).

  1. Can you make other non-natural amino acids? Design some new amino acids.

Yes, you can make non-natural amino acids (e.g. non-proteinogenic amino acids). Amino acids require an amine, a carboxyl, a central carbon, and a side-chain. You could design one by using a unique side chain that doesn’t exist in the natural amino acids.

  1. Where did amino acids come from before enzymes that make them, and before life started?

They formed abiotically through natural reactions in the environment. They have even been found on meteorites.

  1. If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?

I would expect it to turn the other way, since D-amino acids are mirrored. Normal alpha helixes are right handed, so I would expect a left handed helix.

  1. Can you discover additional helices in proteins?

Skipped (1/2).

  1. Why are most molecular helices right-handed?

Skipped (2/2).

  1. Why do β-sheets tend to aggregate? What is the driving force for β-sheet aggregation?

This is because beta sheets can be hydrophillic on one side and hydrophobic on the other, forming a “pleated sheet”.

  1. Why do many amyloid diseases form β-sheets? Can you use amyloid β-sheets as materials?

Shugang mentions that β-sheets can aggregate to form disease because β-sheets can’t be easily untangled once formed.

Yes, β-sheets give spider silk its unique properties, which inspired the design of materials like Kevlar.

  1. Design a β-sheet motif that forms a well-ordered structure.

We can use the rule from Shugang’s slides: alternate hydrophobic and hydrophillic for every other amino acid.

Part B: Protein Analysis and Visualization

I’m picking the same protein as in week 1: Miraculin, a taste altering protein.

Amino acid sequence:

How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids.

MKELTMLSLSFFFVSALLAAAANPLLSAADSAPNPVLDIDGEKLRTGTNYYIVPVLRDHGGGLTVSATTPNGTFVCPPRVVQTRKEVDHDRPLAFFPENPKEDVVRVSTDLNINFSAFMPCRWTSSTVWRLDKYDESTGQYFVTIGGVKGNPGPETISSWFKIEEFCGSGFYKLVFCPTVCGSCKVKCGDVGIYIDQKGRRRLALSDKPFAFEFNKTVYF

Using the Colab, I got:

The length of the protein is: 220 aminoacids.
The most common amino acid is: V, which appears 20 times.

How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.

BLAST yielded 249 similar proteins with significant identities (~50%), high scores (~200), and low e-values (< 0.05).

These include proteins from tea, fruit, and tobacco.

Does your protein belong to any protein family?

According to UniProt:

Belongs to the protease inhibitor I3 (leguminous Kunitz-type inhibitor) family.

Identify the structure page of your protein in RCSB

Miraculin isn’t in RCSB, so I switched to D7TY99, a homolog found in grapes.

>5YH4_1|Chain A|mirauclin-like protein|Vitis vinifera (29760)
ESAPDPVLDTEGKQLRSGVDYYILPVIRGRGGGLTLASTGNENCPLDVVQEQHEVSNGLPLTFTPVNPKKGVIRVSTDHNIKFSASTICVQSTLWKLEYDESSGQRFVTTGGVEGNPGRETLDNWFKIEKYEDDYKLVFCPTVCDFCKPVCGDIGIYIQNGYRRLALSDVPFKVMFKKA

This one does have an RCSB entry.

Open the structure of your protein in any 3D molecule visualization software

Here it is visualized in PyMol (different visualization types on the left):

images/week-04/pymol.png images/week-04/pymol.png

It has more sheets than helixes. Here is the secondary structure sheets colored yellow and helix colored red.

images/week-04/secondary-structure.png images/week-04/secondary-structure.png

Here with the residue types colored with orange for hydrophobic and cyan for hydrophilic. The regions alternate and looks like more hydrophilic:

images/week-04/residue.png images/week-04/residue.png

Finally, the surface of the protein. There are at least three “holes” that look like binding sites:

images/week-04/surface.png images/week-04/surface.png

Using ML-based protein design tools

For the ML tools, I’ll be using Fel d 1, which is a major cat allergen protein.

It’s made of two linked peptides: Major allergen I polypeptide chain 1 and Major allergen I polypeptide chain 2 .

>1ZKR_1|Chains A, B|Major allergen I polypeptide, fused chain 1, chain 2|Felis catus (9685)
MEICPAVKRDVDLFLTGTPDEYVEQVAQYKALPVVLENARILKNCVDAKMTEEDKENALSLLDKIYTSPLCVKMAETCPIFYDVFFAVANGNELLLDLSLTKVNATEPERTAMKKIQDCYVENGLISRVLDGLVMTTISSSKDCMGEHHHHHH

Here’s the mutation scan heatmap. It looks like the amino acids at the end of the protein are most sensitive to mutation. We can also see that “W” tends to be a bad mutation anywhere in the protein.

images/week-04/mutation-scan-heatmap.png images/week-04/mutation-scan-heatmap.png

And here’s the TSNE, it looks like it’s not very closely related to the other proteins, which would make sense since it’s a unique allergen (otherwise we might expect cat allergies to correlate with lots of other allergies).

images/week-04/tsne.png images/week-04/tsne.png

Here’s the predicted fold, colored based on confidence. The red at the end means lower confidence.

images/week-04/fold.png images/week-04/fold.png

Here it is compared to the actual structure. On the left in cyan is the experimental structure, and on the right is the predicted fold. We can see that the main structure looks similar but the end is totally wrong, in line with the lower confidence. However, this is probably due to the His-tag at the end of the protein:

images/week-04/fold-comparison.png images/week-04/fold-comparison.png

Note that in PDB data, there were two protein molecules included in the experimental data, apparently as an asymmetric unit, so I removed one for better comparison.

Here’s the output of the inverse fold:

>5MBA, score=1.3652, fixed_chains=[], designed_chains=['A'], model_name=v_48_020
SLSAAEADLAGKSWAPVFANKNANGLDFLVALFEKFPDSANFFADFKGKSVADIKASPKLRDVSSRIFTRLNEFVNNAANAGKMSAMLSQFAKEHVGFGVGSAQFENVRSMFPGFVASVAAPPAGADAAWTKLFGLIIDALKAAGA
>T=0.1, sample=0, score=0.7874, seq_recovery=0.5000
ALTAAQAAKLRAAFAPVAANAAANGRAFLLTLFAAYPELRELFPEFRGKSLEEIAASPALDAVATAFMTTLKTLVDTADDAAAMAALLAALAAAHVARGITAAHFERVRDLFPGFVASVAAPPAGADAAWDALWGLVIAALRAAGG
images/week-04/inverse-fold.png images/week-04/inverse-fold.png
Generating sequences...
>5MBA, score=1.3497, fixed_chains=[], designed_chains=['A'], model_name=v_48_020
SLSAAEADLAGKSWAPVFANKNANGLDFLVALFEKFPDSANFFADFKGKSVADIKASPKLRDVSSRIFTRLNEFVNNAANAGKMSAMLSQFAKEHVGFGVGSAQFENVRSMFPGFVASVAAPPAGADAAWTKLFGLIIDALKAAGA
>T=0.1, sample=0, score=0.7784, seq_recovery=0.4658
ALTPEEAALLAAAMAPFFADREANGRAFLLRLFAAYPALAELFPAFRGKSLAEIAASPELPAIAGAVMDLLATLVANADDAAAMAALLAALAAAHVALGITAAHFEAIRDIFPGFIASVAPPPPGADAAWDRLLGDVIAALRAAGG

New Sequence:ALTPEEAALLAAAMAPFFADREANGRAFLLRLFAAYPALAELFPAFRGKSLAEIAASPELPAIAGAVMDLLATLVANADDAAAMAALLAALAAAHVALGITAAHFEAIRDIFPGFIASVAPPPPGADAAWDRLLGDVIAALRAAGG

I tried added random pointwise mutations to the protein, and it didn’t appear to affect the fold too much. For example:

images/week-04/pointwise-mutation-fold.png images/week-04/pointwise-mutation-fold.png

Subsections of Labs

Week 1 Lab: Pipetting

cover image cover image

Subsections of Projects

Individual Final Project

cover image cover image

Group Final Project

cover image cover image