Class Assignment 1. First, describe a biological engineering application or tool you want to develop and why. Endometriosis is an inflammatory disease characterized by the endometrial-like tissue growth outside of the uterine cavity. This ectopic growth leads to hormonal imbalances, systemic inflammation, and debilitating pain during menstruation, sexual intercourse, and bodily functions . Although it affects 10–15% of reproductive age women, there is currently no cure and the diagnosis of this diseases remains a clinical challenge [1]. Current clinical management is limited to hormonal suppression, pain control and surgical excision [2]. Consequently, there is a critical need for non-invasive, targeted therapies that can modulate the immune response and minimize recurrence rates without compromising the patient’s reproductive health.
Part 1: Benchling & In-silico Gel Art This DNA gel art was designed in the style of Paul Vanouse’s Latent Figure Protocol. I chose to create the letter “P” as it is the initial of my name, Paula. To achieve this, I used Ronan’s website, which was a helpful tool for quickly iterating on the designs and determining the best enzyme combinations to form the silhouette of the letter.
Part A. Conceptual Questions 1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons) The average composition of muscle without external fat cover is composed of approximately 70% water, 20% protein, and 9% fat (The exact values vary depending on the animal source) [1]. Therefore 500 g of meat provides 100 g of protein. Since proteins are chains of amino acids, once digested they break down into individual amino acid molecules. We are told the average molecular weight of an amino acid is ~100 Daltons, which means its molar mass is 100 g/mol.
Part A: SOD1 Binder Peptide Design (From Pranam) Superoxide dismutase 1 (SOD1) is a cytosolic antioxidant enzyme that converts superoxide radicals into hydrogen peroxide and oxygen. In its native state, it forms a stable homodimer and binds copper and zinc.
Mutations in SOD1 cause familial Amyotrophic Lateral Sclerosis (ALS). Among them, the A4V mutation (Alanine → Valine at residue 4) leads to one of the most aggressive forms of the disease. The mutation subtly destabilizes the N-terminus, perturbs folding energetics, and promotes toxic aggregation.
Assignment Part 1: Intracellular Artificial Neural Networks (IANNs) 1. What advantages do IANNs have over traditional genetic circuits, whose input/output behaviors are Boolean functions? Boolean functions are limited to discrete on/off states while IANNs are capable of processing analogue signals and, because of that, carry more information. Real world phenomena are analog, inside a cell there is inherent molecular noise, and Boolean circuits are fragile to this, especially at low signal concentrations.
Subsections of Homework
Week 1 HW: Principles and Practices
Class Assignment
1. First, describe a biological engineering application or tool you want to develop and why.
Endometriosis is an inflammatory disease characterized by the endometrial-like tissue growth outside of the uterine cavity. This ectopic growth leads to hormonal imbalances, systemic inflammation, and debilitating pain during menstruation, sexual intercourse, and bodily functions . Although it affects 10–15% of reproductive age women, there is currently no cure and the diagnosis of this diseases remains a clinical challenge [1]. Current clinical management is limited to hormonal suppression, pain control and surgical excision [2]. Consequently, there is a critical need for non-invasive, targeted therapies that can modulate the immune response and minimize recurrence rates without compromising the patient’s reproductive health.
To address these challenges, I propose Endo-Biotics, a vaginal suppository containing probiotic bacteria (Lactobacillus) genetically programmed to deliver bispecific nanobodies that block the IL-17 cytokine inflammatory cascade after specifically anchoring to CD44 receptors.
Nanobodies: Antigen-binding fragments derived from naturally ocurring heavy-chain-only present in the serum of camelids. Their small size, high stability, strong antigen-binding affinity, water solubility and natural origin offers new possibilities for treatment against antibodies that are limited by their large size and poor penetration into solid tissues [3].
Expression host:Lactobacillus is a commensal bacteria found naturally in the microbiota of the female reproductive tract. This natural affinity enables effective mucosal colonization, ensuring the system persists long enough to deliver a therapeutic dose.
Targeting module: Endometrial cells from women with endometriosis overexpress CD44 variants, wich is associated with incresed adherence to peritoneal cells and plays a key role in the development of early endometriotic lesion [4]. Targeting CD44 allows the nanobody to be retained at the lesion site and reduce exposure to surrounding healthy tissue.
Effector module: Elevated levels of IL-17 have been observed in patients during the early stages of the disease. This pro-inflammatory cytokine promotes the proliferation, invasion, and implantation of endometriotic cells by triggering the construction of new blood vessel networks [5]. Blocking IL-17 not only reduces inflammation but also interrupts the development of the blood supply these lesions require to survive and persist outside the uterine cavity.
2. Next, describe one or more governance/policy goals related to ensuring that this application or tool contributes to an “ethical” future.
Biological Containment
Prevent uncontrolled growth: Bacterial growth should be limited to therapeutic levels and can be stopped by discontinuing use.
Prevent dissemination beyond the host: The organism should not spread to other individuals or into the external enviromental.
Limit horizontal gene transfer: Avoid genetically modified elements being transferred to native microbiota or environmental bacteria through horizontal gene transfer mechanisms.
Patient safety
Minimize off-target effects: Avoid interfering with normal immune functions outside of the tissue affected by endometriosis.
Microbiome integrity: Prevent genetically modified Lactobacillus from altering the balance of the vaginal microbiome.
Responsable patient use: Ensure that patients use the therapy correctly, with adequate understanding of benefits, risks, and limitations.
Biosafety
Laboratory safety: Manufactoring processes follow established biosafety protocols.
Safe handling and distribution: Ensure appropriate storage, transport, and handling conditions.
Misuse prevention: prevent unauthorized acquisition, modification, or use for non-therapeutic purposes.
Equitable access
Affordability: Avoid limiting access to high-income populations only.
Inclusive clinical evaluation: Clinical trials and tests will consider diverse populations to reduce bias and ensure effective results.
3. Describe at least three different potential governance “actions” by considering Purpose, Design, Assumptions, Risks of Failure & “Success”
Action 1: Kill Switch - Researchers
Purpose: Prevent bacteria from growing uncontrolled or escaping into the environment.
Design: bacteria is design to be dependent on a nutrient absent in the body and nature, only present in the vaginal suppository.
Assumptions: The bacteria will not mutate to acquire another form of subsistence.
Risks of Failure & “Success: If this fails, the bacteria could colonize the reproductive system. If successful, the synthetic nutrient could increase the cost of production.
Action 2: Chromosomal Integration - Researchers
Purpose: Avoiding genetically modified elements from spreading to the native microbiota or enviromental bacteria.
Design: In advanced stages of research, the expectation is to transition from genetic modification using plasmids to incorporating therapeutic DNA directly into the chromosomes of Lactobacillus.
Assumptions: Chromosomal integration is stable and will not negatively affect the growth or therapeutic efficacy of the strain.
Risks of Failure & “Success: DNA could still be transferred via transduction or natural transformation. However, risks are significantly reduced with a higher level of security system.
Action 3: Education and Transparency – User
Purpose: Ensure correct and informed use
Design: Clear instructions on how to use and contraindications with total transparency for informed decision making.
Assumptions: patients will read the material and the information system will be accessible to everyone.
Risks of Failure & “Success: negligent use of treatment is made.
Action 4: Access under prescription - Health regulatory agencies (DIGEMID, INS, SUSALUD)
Purpose: Avoid unauthorized acquisition, home modification, or use of the therapy for purposes other than the treatment of endometriosis
Design: Endo-Biotics must be classified as a prescription-only treatment. Only specialist doctors can issue the prescription after diagnosing endometriosis.
Assumptions: Patients will not try to acquire the product through unofficial channels and specialists are willing to prescribe new therapies.
Risks of Failure & “Success: High level of patient safety and clinical oversight, but it may limit access for those without easy access to specialists.
Action 5: Financing and Subsidy – Public Health Organizations (ProCiencia and MINSA – Perú | WHO and EndoFound - Internationaly)
Purpose: ensure the therapy reaches all women regardless of their socioeconomic status.
Design: locally, we will work to include the therapy in Peru’s National Petition of Essential Medicines (PNME) to enable coverage through MINSA (SIS); internationally, we will partner with NGOs like the Gates Foundation, ensuring lower costs for vulnerable populations in developing regions.
Assumptions: There is sufficient political will and international funding available specifically for endometriosis, which is traditionally an underfunded area.
Risks of Failure & “Success: Dependence on external financing or subsidy can make the project unstable. Otherwise, a technology that could improve the quality of life would be accessible to all sectors of the population.
Action 6: Rigorous Lab Protocols - Researchers
Purpose: To avoid human error and ensure the modified Lactobacillus is produced with total sterility and verified binding ability.
Design: mandatory “binding assays” to confirm the bacteria actually adheres to the CD44 target and strict sterility protocols to minimize the risk of contamination or environmental release.
Assumptions: We assume researchers will follow protocols and that everything is perfectly calibrated.
Risks of Failure & “Success: Small errors could lead to contamination or a batch with incorrect genetic markers. Otherwise, the constant auditing and verification could slow down the production process.
4. Score (from 1-3 with, 1 as the best, or n/a) each of your governance actions against your rubric of policy goals.
5. Last, drawing upon this scoring, describe which governance option, or combination of options, you would prioritize, and why. Outline any trade-offs you considered as well as assumptions and uncertainties.
I would prioritize the Kill Switch and Chromosomal integration, while it is true that this technical decision could increase the complexity and cost of production, Nanobody treatment in an emerging technology that is not yet fully regulated so it is necessary to take all possible precautions. The fundamental goal is to ensure its contribution to the quality of life for patients with endometriosis without the need for invasive treatments or harming their fertility. I am assuming that physicians will be willing to adopt this new bio-therapeutic and Public Health Organizations will maintain long-term interest in funding endometriosis so this treatments can be researched and developed. The biggest uncertainty is being able to achieve biological containment and avoid altering the vaginal microbiota, given that it is such a complex system.
Reflecting on what you learned and did in class this week, outline any ethical concerns that arose, especially any that were new to you. Then propose any governance actions you think might be appropriate to address those issues.
One of the ethical concerns discussed in class was “who has access.” Synthetic biology is emerging as a powerful tool that can improve quality of life and open new avenues for innovation, but it can also be used negligently in ways that may harm people or the environment. For this reason, hearing about “trust” as a central theme in biotechnology made me reflect on the importance of closing the gap between experts and the general public, and on how doing so could open the door to new approaches and perspectives, as long as it is done in an ethical way.
Assignment (Week 2 Lecture Prep)
Homework Questions from Professor Jacobson:
Nature’s machinery for copying DNA is called polymerase. What is the error rate of polymerase? How does this compare to the length of the human genome. How does biology deal with that discrepancy?
The error rate of polymerase is 1:10^(6) which means that one error is made for every million nucleotides added. The human genome consist of approximately 3 x 10^(9) pb (3,088,269,832 pb [6]) that means that every time a cell divides there would be approximately 3,000 errors. Biology deals with errors with DNA polymerase proofreading during extension and the MutS Repair System.
How many different ways are there to code (DNA nucleotide code) for an average human protein? In practice what are some of the reasons that all of these different codes don’t work to code for the protein of interest?
Based in Lecture 2 slides, an average human protein is 1036 bp long, since DNA is composed of four nucleotids (A, T, C, G), the number of posible ways to code a protein of this length is 4^1036.
In reality, only a small fraction of these secuences are functional. The reason behind this is that multiple codons can encode the same amino acid, but they are not equally efficient. DNA sequence defines secundary structure formation and high GC content, repetitive sequences, or unfavorable base-pairing energies can lead to unstable secondary structures that interfere with transcription, translation, or synthesis.
Homework Questions from Dr. LeProust:
What’s the most commonly used method for oligo synthesis currently?
The most commonly used method is solid-phase chemical synthesis using phosphoramidite chemistry, where nucleotides are added one at a time in repeated cycles.
Why is it difficult to make oligos longer than 200nt via direct synthesis?
Because each step isn’t perfectly efficient. As the oligo gets longer, small mistakes build up, so after around 200 nucleotides the yield drops a lot and many sequences are incomplete or wrong.
Why can’t you make a 2000bp gene via direct oligo synthesis?
At that length, the error accumulation makes getting a fully correct sequence extremely unlikely. That’s why long genes are made by assembling shorter oligos instead of synthesizing them all at once.
Homework Question from George Church:
What are the 10 essential amino acids in all animals and how does this affect your view of the “Lysine Contingency”?
The 10 essential amino acids in animals are: histidine, isoleucine, leucine, lysine, methionine, phenylalanine, threonine, tryptophan, valine, and arginine [7].
“The lysine contingency is intended to prevent the spread of the animals in case they ever got off the island. Dr. Wu inserted a gene that creates a single faulty enzyme in protein metabolism. The animals can’t manufacture the amino acid lysine. Unless they’re continually supplied with lysine by us, they’ll slip into a coma and die [8].” —Ray Arnold
It highlights how genetic codes can be engineered to enhance biological containment. I proposed this technique in my biotechnological application as a biological contingency method since it allows stopping the uncontrolled growth of Lactobacillus by designing it to depend on a component absent outside the environment for which it is intended.
Bibliography
[1] M. Sahni and E. S. Day, “Nanotechnologies for the detection and treatment of endometriosis,” Front. Biomater. Sci., vol. 2, Nov. 2023, doi: 10.3389/fbiom.2023.1279358.
[3] I. Jovčevska and S. Muyldermans, “The Therapeutic Potential of Nanobodies,” BioDrugs Clin. Immunother. Biopharm. Gene Ther., vol. 34, no. 1, pp. 11–26, Feb. 2020, doi: 10.1007/s40259-019-00392-z.
[4] J. F. Knudtson et al., “Overexpression of CD44 is involved in the development of the early endometriotic lesion,” Fertil. Steril., vol. 110, no. 4, p. e390, Sep. 2018, doi: 10.1016/j.fertnstert.2018.07.1090.
[5] J. V. Garmendia, C. V. De Sanctis, M. Hajdúch, and J. B. De Sanctis, “Endometriosis: An Immunologist’s Perspective,” Int. J. Mol. Sci., vol. 26, no. 11, p. 5193, May 2025, doi: 10.3390/ijms26115193.
[6] A. Piovesan, M. C. Pelleri, F. Antonaros, P. Strippoli, M. Caracausi, and L. Vitale, “On the length, weight and GC content of the human genome,” BMC Res. Notes, vol. 12, no. 1, p. 106, Feb. 2019, doi: 10.1186/s13104-019-4137-z.
This DNA gel art was designed in the style of Paul Vanouse’s Latent Figure Protocol. I chose to create the letter “P” as it is the initial of my name, Paula. To achieve this, I used Ronan’s website, which was a helpful tool for quickly iterating on the designs and determining the best enzyme combinations to form the silhouette of the letter.
Part 3: DNA Design Challenge
3.1 Choose your protein
Hydrophobin HFBI de Trichoderma reesei: I chose this protein because I will be participating in a summer research program at Aalto University focused on bio-based foams and mycelium-derived materials. Hydrophobins are proteins naturally produced by fungi and play an important role in fungal growth, particularly in modifying surface properties and mediating interactions at air–water interfaces. These characteristics are directly relevant to mycelium-based biomaterials, where fungal networks interact with substrates to form structured materials with tunable mechanical properties.
Hydrophobin HFBI de Trichoderma reesei AA sequence:
3.4. What technologies could be used to produce this protein from your DNA? Describe in your words the DNA sequence can be transcribed and translated into your protein.
Cell-Dependent Recombinant Protein Expression: After chemically synthesizing the DNA sequence encoding HFBI, the gene is inserted into a plasmid vector using DNA assembly methods such as Gibson Assembly or restriction enzyme-based cloning methods. Then, the recombinant plasmid is introduced into a host organism, which uses its own transcription and translation machinery to express the protein as the cells grow.
Cell-Free Protein Expression: In this method, instead of using living cells, the DNA sequence encoding HFBI is added to a reaction mixture containing ribosomes, enzymes, nucleotides, amino acids, and energy sources extracted from cells and transcription and translation occur directly in vitro [1]. Compared to in vivo techniques based on bacterial or tissue culture cells, in vitro protein expression is considerably faster because it does not require gene transfection, cell culture or extensive protein purification [2].
I would sequence the synthetic bispecific nanobody construct designed to bind CD44 and block IL-17 signaling. Sequencing would allow me to verify that the DNA was synthesized correctly, confirm the absence of mutations, and ensure the construct is suitable for expression in the probiotic host before experimental use.
(ii) In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why?
I would use next-generation sequencing (NGS), specifically Illumina sequencing, to analyze the engineered construct. Illumina platforms provide high accuracy, relatively low cost per base, and are well suited for short constructs such as nanobody genes as well as targeted panels of inflammatory genes.
1. Is your method first-, second- or third-generation or other? How so?
This method is considered second-generation sequencing because it relies on massively parallel sequencing of many short DNA fragments simultaneously after amplification on a flow cell.
2. What is your input? How do you prepare your input (e.g. fragmentation, adapter ligation, PCR)? List the essential steps.
The input would be DNA extracted either from the engineered plasmid containing the nanobody gene.
Essential preparation steps:
DNA extraction and purification
Fragmentation
Adapter ligation to both ends of DNA fragments
PCR amplification to generate sufficient material
Loading onto the sequencing flow cell
3. What are the essential steps of your chosen sequencing technology, how does it decode the bases of your DNA sample (base calling)?
Essential steps:
DNA fragments bind to complementary oligos on a flow cell.
Clonal amplification creates clusters of identical molecules.
Fluorescently labeled nucleotides are incorporated one base at a time.
A camera detects the fluorescent signal after each cycle.
The color detected corresponds to a specific base (A, T, C, or G), which allows the sequence to be reconstructed digitally. This process is called base calling.
4. What is the output of your chosen sequencing technology?
The output is a large dataset of short DNA reads in digital format (FASTQ files), which can then be aligned to reference sequences or assembled to confirm the construct sequence.
5.2 DNA Write
(i) What DNA would you want to synthesize (e.g., write) and why?
I would synthesize a genetic construct encoding a bispecific nanobody that can anchor to CD44 receptors while simultaneously blocking IL-17 inflammatory signaling. This construct would be expressed in a probiotic Lactobacillus strain to create a localized therapeutic system for endometriosis.
The goal is to combine targeted binding with immune modulation to reduce inflammation and lesion growth without systemic side effects.
The DNA construct would include:
Promoter for bacterial expression
Secretion signal peptide
Anti-CD44 nanobody domain
Flexible linker
Anti-IL-17 nanobody domain
Terminator sequence
(ii) What technology or technologies would you use to perform this DNA synthesis and why?
I would use phosphoramidite chemical DNA synthesis combined with gene assembly, such as the synthesis services provided by companies like Twist Bioscience.
This method allows precise control of nucleotide sequence and is scalable for custom gene design.
1. What are the essential steps of your chosen sequencing methods?
Essential steps:
Chemical synthesis of short oligonucleotides
Assembly of oligos into full gene fragments
Error correction and cloning into plasmid vectors
Sequence verification
Delivery as plasmid DNA
3. What are the limitations of your sequencing method (if any) in terms of speed, accuracy, scalability?
Error rates increase with longer sequences
Cost increases with length and complexity
Repetitive or GC-rich regions can be difficult to synthesize
Turnaround time can vary depending on design complexity
5.3 DNA Edit
(i) What DNA would you want to edit and why?
I would edit the genome of a probiotic Lactobacillus strain to stably express the therapeutic bispecific nanobody. Genome integration would improve stability compared to plasmid-based expression and reduce the need for antibiotic selection.
This could enable long-term therapeutic delivery directly at mucosal surfaces.
(ii) What technology or technologies would you use to perform these DNA edits and why?
I would use CRISPR-Cas9 genome editing because it allows precise insertion of DNA sequences into specific genomic locations with relatively high efficiency.
1. How does your technology of choice edit DNA? What are the essential steps?
CRISPR-Cas9 uses a guide RNA to direct the Cas9 enzyme to a specific DNA sequence. Cas9 creates a double-strand break at that location. The cell’s repair machinery then inserts the desired DNA sequence using a repair template provided by the researcher.
2. What preparation do you need to do (e.g. design steps) and what is the input (e.g. DNA template, enzymes, plasmids, primers, guides, cells) for the editing?
Inputs would include:
Guide RNA targeting the desired genomic site
Cas9 enzyme or expression plasmid
Donor DNA template containing the nanobody gene
Host bacterial cells
3. What are the limitations of your editing methods (if any) in terms of efficiency or precision?
Off-target edits may occur
Editing efficiency can vary between organisms
Delivery of CRISPR components into cells can be challenging
Integration success may require screening multiple clones
Your task this week is to Create a Python file to run on an Opentrons liquid handling robot.
1. Generate an artistic design using the GUI at opentrons-art.rcdonovan.com.
2. Using the coordinates from the GUI, follow the instructions in the HTGAA26 Opentrons Colab to write your own Python script which draws your design using the Opentrons.
Post-Lab Questions — DUE BY START OF FEB 24 LECTURE
1. Find and describe a published paper that utilizes the Opentrons or an automation tool to achieve novel biological applications.
2. Write a description about what you intend to do with automation tools for your final project. You may include example pseudocode, Python scripts, 3D printed holders, a plan for how to use Ginkgo Nebula, and more. You may reference this week’s recitation slide deck for lab automation details.
Week 4 HW: Protein Design part I
Part A. Conceptual Questions
1. How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)
The average composition of muscle without external fat cover is composed of approximately 70% water, 20% protein, and 9% fat (The exact values vary depending on the animal source) [1]. Therefore 500 g of meat provides 100 g of protein. Since proteins are chains of amino acids, once digested they break down into individual amino acid molecules. We are told the average molecular weight of an amino acid is ~100 Daltons, which means its molar mass is 100 g/mol.
Converting grams of protein to moles of amino acids:
A piece of 500 g of meat contains approximately 6 × 10²³ molecules of amino acids
2. Why do humans eat beef but do not become a cow, eat fish but do not become fish?
Biological identity is not preserved through digestion because the human body breaks down the macromolecules we consume into universal building blocks.
Proteins undergo hydrolysis of their peptide bonds by proteases in the stomach and small intestine, releasing amino acids, the same 20 standard amino acids used by all living organisms. Once absorbed, they enter the bloodstream, they are either incorporated into new human proteins according to the sequence information encoded in human DNA or further metabolized [2].
3. Why are there only 20 natural amino acids?
There are more than 20 amino acids in nature, however, the standard genetic code is restricted to 20 amino acids because it provides a balance between structural diversity and metabolic efficiency.
Chemical coverage: the 2o amino acids set covers the necessary range of hydrophobicity, charge and molecular size required for complex protein folding [2], for example:
Charge: Positive (Lysine) and Negative (Glutamate).
Polarity: Hydrophilic (Serine) vs. Hydrophobic (Leucine).
Specialized Shapes: Small (Glycine), Rigged (Proline), and bulky (Tryptophan).
Frozen Accident: This theory, proposed by Francis Crick, states that “the code is universal because at the present time any change would be lethal, or at least very strongly selected against.” Any attempt by an organism to ‘recode’ or add a new amino acid today would trigger a proteome-wide failure, as it would disrupt the sequence of every existing protein simultaneously [3].
Error minimization: Research suggests that our genetic code is ‘one in a million’ in its ability to ensure that a single-point mutation likely results in a chemically similar amino acid, thereby preserving the protein’s overall structure and function [4].
4. Can you make other non-natural amino acids? Design some new amino acids.
Non-natural amino acids are synthesized compounds that differ from the standard set. They can be synthesized either through organic chemistry or incorporated into proteins via engineered orthogonal translation systems. By modifying side-chain chemistry, we can expand the functional diversity of proteins beyond the constraints of the canonical genetic code, enabling novel catalytic, structural, and responsive properties.
All amino acids share the same basic backbone:
An amino group (–NH₂)
A carboxyl group (–COOH)
A hydrogen
A variable side chain (R group)
Attached to the same α-carbon
So to design new amino acids, we keep the backbone and modify the R group to give new chemical properties.
Example:Photoresponsive amino acid with the following side chain:
R = –CH₂–C₆H₄–N=N–C₆H₅
This side chain contains an azobenzene group, which can switch between trans and cis conformations when exposed to different wavelengths of light.
Part B: Protein Analysis and Visualization
1. Briefly describe the protein you selected and why you selected it.
Elevated levels of IL-17 have been observed in patients during the early stages of the disease. This pro-inflammatory cytokine promotes the proliferation, invasion, and implantation of endometriotic cells by triggering the construction of new blood vessel networks. Blocking IL-17 not only reduces inflammation but also interrupts the development of the blood supply these lesions require to survive and persist outside the uterine cavity.
2. Identify the amino acid sequence of your protein.
How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids.
The most frecuent amino acid in 8B7W_1 is glycine.
How many protein sequence homologs are there for your protein?
250 homologs
Does your protein belong to any protein family?
It´s part of the Single-domain antibody family
3. Identify the structure page of your protein in RCSB
When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)
The structure was deposited on October 3, 2022 and made publicly available on December 28, 2022. It was solved using X-ray diffraction.
Resolution: 2.85 Å
Moderate quality range, close but slightly above the 2.70 Å threshold for a good structure.
Are there any other molecules in the solved structure apart from protein?
Yes, 8B7W is a protein.protein complex containing:
IL-17A: pro-inflammatory cytokine from Homo sapiens
Anti-IL-17A-76: nanobody derived from Lama glama (Llama)
Both proteins were expressed in Escherichia coli and the structure contains mutations.
Does your protein belong to any structure classification family?
Classification: IMMUNE SYSTEM/INHIBITOR
4. Open the structure of your protein in any 3D molecule visualization software:
I used PyMol to analyze the three-dimensional structure of 8B7W, a protein complex formed by interleukin-17A (IL-17A) bound to an anti-IL-17A nanobody. Since my main interest lies in the anti IL-17A nanobody, I visualized only this structure referred in PDB as “chain H” hiding “chain B”. This allowed a clearer examination of its structural features, including its secondary structure, residue distribution, and potential interaction regions involved in antigen recognition.
Visualize the protein as:
Cartoon: simplifies the protein structure and highlights the secondary structure elements.
Ribbon: follows the protein backbone and helps visualize the trajectory of the polypeptide chain and how the secondary structure elements are arranged.
Ball and stick: shows atoms as spheres and chemical bonds as sticks. Allows to see individual amino acids, atomic interactions and contacts between residues
Color the protein by secondary structure. Does it have more helices or sheets?
Rojo → α-helices
Amarillo → β-sheets
Verde → loops
The nanobody analyzed shows a structure predominated by β-sheets. This is typical of antibody variable domains that adopt an immunoglobulin fold (many β-sheets and long loops), in this case it plays an important role in forming the antigen-binding interface with IL-17A.
Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?
When visualizing the surface of the nanobody, a region appears almost like a missing portion of the structure and this happens because the antigen is hidden in the visualization. This is the interface where the nanobody interacts with IL-17A in the complex.
Part C: Using ML-Based Protein Design Tools
In this section, we will learn about the capabilities of modern protein AI models and test some of them in your chosen protein.
C1. Protein Language Modeling
1. Deep Mutational Scans
a. Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.
The ESM2 score represents how likely it is evolutionarily that that amino acid will appear in that position, based on millions of protein sequences.
Y-axis: Represents the 20 standard amino acids that could mutate at each position in the sequence.
X-axis: Represents the positions of the amino acids in the input protein sequence.
High score → ESM2 considers this mutation evolutionarily plausible.
This amino acid has appeared at this position in similar sequences.
The protein likely tolerates this mutation.
Blue (low score) → ESM2 considers this mutation evolutionarily improbable.
This amino acid rarely appears at this position.
The mutation likely damages the protein.
b. Can you explain any particular pattern? (choose a residue and a mutation that stands out)
Positions 20, 88, 92 and 94 show a vertical dark blue pattern (low score), indicating that these positions dont tolerate mutations. This suggests that the wild-type residues are critical.
Two horizontal lines are also noteworthy due to their dark blue pattern, these correspond to W (Tryptophan) and C (Cysteine). This suggests that introducing these residues in any position is unfavorable. This makes sense because cysteine can form spurious disulfide bonds that disrupt folding, and tryptophan is bulky enough to cause steric clashes in most sequence contexts. With a lighter shade of blue, methionine (M) and histidine (H) also show moderately negative scores across many positions, likely reflecting their more specialized chemical properties and lower natural abundance in protein sequences.
Position 113 shows predominantly positive scores across many substitutions, indicating that this position is highly tolerant to mutation. This suggest that in this position, the identity of the amino acid doesn´t matter much structurally. Probably sits in a flexible loop or solvent-exposed region.
C-terminal region exhibit more yellow/high score regions, suggesting that the C-terminal end of the nanobody is much more tolerant of mutations.
2. Latent Space Analysis
a. Use the provided sequence dataset to embed proteins in reduced dimensionality.
b.Analyze the different formed neighborhoods: do they approximate similar proteins?
Yes, I identified a cluster containing different types of protein from bacteria like Mycobacterium Tuberculosis, Pseudomona Aeruginosa y Thermus Thermophilus.
c. Place your protein in the resulting map and explain its position and similarity to its neighbors.
I think the data base is limited and has no enough nanobodys sequences because the cluster where my protein is, it´s surrounded by proteins from different animals and even the IL-12. I would have expected to see other VHH sequences from Llama.
C2. Protein Folding
1. Fold your protein with ESMFold. Do the predicted coordinates match your original structure?
Yes, It is very similar with the structure I saw in Pymol, The B-Barrel structure and the loops are visible.
2. Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?
My protein is resilient to mutations because although I made some mutations that appeared to be unfavorable in the ESM2 Mutation Scan heatmap, I didn´t see drastic changes in the protein´s structure, the B-Barrel is still there as well as the loops.
These are the changes I made:
Position 20: L → E
Posición 65: K → C
Posición 94: Y → R
There are probably changes but since I can´t compare side to side the structures, to me they appear to be still similar.
Although, if I remove the first ten amino acids, the structure dramatically changes. That is because the first aminoacids in a nanobody are almost the same in their variations.
C3. Protein Generation
Inverse-Folding a protein: Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN
1. Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one.
2. Input this sequence into ESMFold and compare the predicted structure to your original.
The predicted structure is similar but not completely the same, it appears to have the same structure but is folding in a slightly different way.
Part D. Group Brainstorm on Bacteriophage Engineering
Choose one or two main goals from the list that you think you can address computationally
Goal: Higher toxicity of lysis protein
Sub-goal: Eliminate the dependence of the L protein on the host chaperone DnaJ (E. coli) to accelerate and enhance bacterial lysis.
Context
The L protein of bacteriophage MS2 lyses the bacterium Escherichia coli. The authors identify that this process depends on the hostile chaperone DnaJ, specifically through an interaction with its C-terminal domain.
Using mutants, the study demonstrates that the P330Q mutation in DnaJ blocks the lytic capacity of the L protein at certain temperatures.
In the absence of interaction with DnaJ, the N-terminal domain of L interferes with it´s ability to bind to it´s unknown target.
The discovery of variants called Lodj mutants revealed that the N-terminal domain of the L protein is dispensable and is responsible for generating this chaperone dependence.
Tools/approaches
ESM2 for Deep Mutational Scanning (DMS): we want to identify residues in the N-terminal domain that can be removed or mutated without disrupting folding or making the protein unstable while mantain critical residues that are essential for lytic function.
ESMFold for structural prediction: predict the structure of the designed variants.
ProteinMPNN for inverse folding: once we find a stable structure for the L protein we will use ProteinMPNN to generate new sequences that might fold more efficiently.
Why these tools?
Since the N-terminal domain of L is intrinsically disordered and difficult to fold without DnaJ, ESM2 can help identify favorable mutations to improve stability.
ESMFold would help for validating whether a drastic mutation causes the rest of the protein to collapse.
ProteinMPNN can help redesign the sequence of the lysis protein so that it folds more efficiently.
Pitfalls
Bacterial cells could lyse too early, not allowing the production of sufficient phage particules and resulting in very low final phage titers, this could lead to resistance to the phages.
The interaction between the N-terminal domain and DnaJ conferes the L protein with some stability, if we aim to eliminate this dependency, there´s the posibility that the redesigned protein could become unstable and degrade before reaching the membrane.
---
title: Pipeline´s schematic
---
graph TD
A[WT L-protein sequence] --> B[ESM2 deep mutational scan]
B --> |Identify favorable mutations| C[ESMFold structural prediction]
C --> |Validate stability| D[ProteinMPNN inverse folding]
D --> |New sequences independent from DnaJ and with more efficient folding| E[AF2-Multimer co-fold with DnaJ]
E --> |Verify loss of interaction| F[Candidate variants for lab]
AI Prompts:
What could be the reason there are horizontal dark blue lines in W and C, assigning low score to that residues? M and H also have low score but a bit higher than the others, why could be the chemical reason behind this?
[2] D. L. Nelson and M. M. Cox, Lehninger Principles of Biochemistry, 8th ed. New York, NY, USA: W.H. Freeman/Macmillan Learning, 2021, ch. 18, pp. 695–750.
[3] G. K. Philip and S. J. Freeland, “Did evolution select a nonrandom ‘alphabet’ of amino acids?,” Astrobiology, vol. 11, no. 3, pp. 235–240, Apr. 2011, doi: 10.1089/ast.2010.0567.
[4] F. H. C. Crick, “The origin of the genetic code,” J. Mol. Biol., vol. 38, no. 3, pp. 367–379, Dec. 1968, doi: 10.1016/0022-2836(68)90392-6.
[5] S. J. Freeland and L. D. Hurst, “The Genetic Code Is One in a Million,” J. Mol. Evol., vol. 47, no. 3, pp. 238–248, Sep. 1998, doi: 10.1007/PL00006381.
Superoxide dismutase 1 (SOD1) is a cytosolic antioxidant enzyme that converts superoxide radicals into hydrogen peroxide and oxygen. In its native state, it forms a stable homodimer and binds copper and zinc.
Mutations in SOD1 cause familial Amyotrophic Lateral Sclerosis (ALS). Among them, the A4V mutation (Alanine → Valine at residue 4) leads to one of the most aggressive forms of the disease. The mutation subtly destabilizes the N-terminus, perturbs folding energetics, and promotes toxic aggregation.
Your challenge:
Design short peptides that bind mutant SOD1.
Then decide which ones are worth advancing toward therapy.
You will use three models developed in our lab:
PepMLM: target sequence-conditioned peptide generation via masked language modeling
2. Using the PepMLM Colab linked from the HuggingFace PepMLM-650M model card generate four peptides of length 12 amino acids conditioned on the mutant SOD1 sequence.
3. To your generated list, add the known SOD1-binding peptide FLYRWLPSRRGG for comparison.
4. Record the perplexity scores that indicate PepMLM’s confidence in the binders.
Part 2: Evaluate Binders with AlphaFold3
2. Navigate to the AlphaFold Server and for each peptide, submit the mutant SOD1 sequence followed by the peptide sequence as separate chains to model the protein-peptide complex.
One of the generated peptide contained an “X” residue representing an unspecified amino acid. For AlphaFold modeling, I replaced this position with alanine to allow structure prediction.
3. Record the ipTM score and briefly describe where the peptide appears to bind. Does it localize near the N-terminus where A4V sits? Does it engage the β-barrel region or approach the dimer interface? Does it appear surface-bound or partially buried?
Peptide #0 (WRYPVVAGRLKK)
N-terminus/A4V site: The peptide is binding far from the N-terminus where A4V sits, almost at the oposite side.
β-barrel/dimer interface: The peptide localizes to the face opposite the dimerization surface, away from the free loop termini that would normally contact the second monomer.
Surface-bound or buried: It appears surface-bound, not inside the β-barrel or the loops.
Peptide #1 (KRVPVVAAAHWK)
N-terminus/A4V site: The peptide is binding at the other side of the flat surface from the β-barrel.
β-barrel/dimer interface: Since only a single SOD1 monomer is shown in the model, the exact dimer interface cannot be directly observed. Although the peptide appears to be aligned with the flat surface of the β-barrel.
Surface-bound or buried: It appears surface-bound, not inside the β-barrel or the loops but closer to the surface of SOD1 A4V.
Peptide #2 (WSYPAAGGKWWA)
N-terminus/A4V site:
β-barrel/dimer interface:
Surface-bound or buried:
Peptide #3 (WRYYVVAGKWGE)
N-terminus/A4V site:
β-barrel/dimer interface:
Surface-bound or buried:
Peptide #4 (FLYRWLPSRRGG)
N-terminus/A4V site:
β-barrel/dimer interface:
Surface-bound or buried:
4. In a short paragraph, describe the ipTM values you observe and whether any PepMLM-generated peptide matches or exceeds the known binder
#
Peptide
Pseudo-Perplexity
ipTM
0
WRYPVVAGRLKK
14.84
0.29
1
KRVPVVAAAHWK
9.49
0.45
2
WSYPAAGGKWWA
16.20
0.47
3
WRYYVVAGKWGE
20.02
0.39
4
FLYRWLPSRRGG
22.36
0.41
ipTM measures the accuracy of the predicted relative positions of the subunits within the complex. Values higher than 0.8 represent confident high-quality predictions, while values below 0.6 suggest likely a failed prediction.
All the ipTM score values are below 0.6, ranging from 0.29 to 0.47, suggesting that the predicted complexes may not represent a reliable interaction. Nevertheless, if the ipTM results are compared to the known SOD1-binding peptide FLYRWLPSRRGG, we can see that peptide #2 (ipTM = 0.47) and peptide #1 (ipTM = 0.45), slightly exceeded the ipTM value of the known binder. These results suggest that while the predicted interactions are weak, some generated peptides show comparable or slightly improved interface scores relative to the known binder.
Part 3: Evaluate Properties of Generated Peptides in the PeptiVerse
Structural confidence alone is insufficient for therapeutic development. Using PeptiVerse, let’s evaluate the therapeutic properties of your peptide! For each PepMLM-generated peptide:
Paste the peptide sequence.
Paste the A4V mutant SOD1 sequence in the target field.
Check the boxes
Predicted binding affinity
Solubility
Hemolysis probability
Net charge (pH 7)
Molecular weight
Compare these predictions to what you observed structurally with AlphaFold3. In a short paragraph, describe what you see. Do peptides with higher ipTM also show stronger predicted affinity? Are any strong binders predicted to be hemolytic or poorly soluble? Which peptide best balances predicted binding and therapeutic properties?
Choose one peptide you would advance and justify your decision briefly.
Part 4: Generate Optimized Peptides with moPPIt
Now, move from sampling to controlled design. moPPIt uses Multi-Objective Guided Discrete Flow Matching (MOG-DFM) to steer peptide generation toward specific residues and optimize binding and therapeutic properties simultaneously. Unlike PepMLM, which samples plausible binders conditioned on just the target sequence, moPPIt lets you choose where you want to bind and optimize multiple objectives at once.
Open the moPPit Colab linked from the HuggingFace moPPIt model card
Make a copy and switch to a GPU runtime.
In the notebook:
Paste your A4V mutant SOD1 sequence.
Choose specific residue indices on SOD1 that you want your peptide to bind (for example, residues near position 4, the dimer interface, or another surface patch).
Set peptide length to 12 amino acids.
Enable motif and affinity guidance (and solubility/hemolysis guidance if available). Generate peptides.
After generation, briefly describe how these moPPit peptides differ from your PepMLM peptides. How would you evaluate these peptides before advancing them to clinical studies?
Part C: Final Project: L-Protein Mutants
High level summary: The objective of this assignment is to improve the stability and auto-folding of the lysis protein of a MS2-phage. This mechanism is key to the understanding of how phages can potentially solve antibiotic-resistance.
Week 7 HW: Genetic circuits part II
Assignment Part 1: Intracellular Artificial Neural Networks (IANNs)
1. What advantages do IANNs have over traditional genetic circuits, whose input/output behaviors are Boolean functions?
Boolean functions are limited to discrete on/off states while IANNs are capable of processing analogue signals and, because of that, carry more information. Real world phenomena are analog, inside a cell there is inherent molecular noise, and Boolean circuits are fragile to this, especially at low signal concentrations.
Boolean functions can only handle simple logical relationships (AND/OR/etc..) between inputs. IANNs, through weighted connections and nonlinear activation functions are capable of solving problems that are not linearly separable. [1]
IANNs have potential for Adaptability and unsupervised learning. There´s a principle known as neurons that fire together, wire together:
“This means that the strength of the connection between neurons changes based on how often they are activated. When a connection between two neurons is activated frequently, its weight increases and vice-versa: when the activation is less frequent, the weight weakens.” [1]
2. Describe a useful application for an IANN; include a detailed description of input/output behavior, as well as any limitations an IANN might face to achieve your goal.
3. Below is a diagram depicting an intracellular single-layer perceptron where the X1 input is DNA encoding for the Csy4 endoribonuclease and the X2 input is DNA encoding for a fluorescent protein output whose mRNA is regulated by Csy4. Tx: transcription; Tl: translation.
Draw a diagram for an intracellular multilayer perceptron where layer 1 outputs an endoribonuclease that regulates a fluorescent protein output in layer 2.
Assignment Part 2: Fungal Materials
1. What are some examples of existing fungal materials and what are they used for? What are their advantages and disadvantages over traditional counterparts?
2. What might you want to genetically engineer fungi to do and why? What are the advantages of doing synthetic biology in fungi as opposed to bacteria?
Bibliography
[1] A. Halužan Vasle and M. Moškon, “Synthetic biological neural networks: From current implementations to future perspectives,” BioSystems, vol. 237, p. 105164, Feb. 2024, doi: 10.1016/j.biosystems.2024.105164.