“An application of synthetic biology that I recently found out about, and I’m really excited for, is partial cellular reprogramming.” It’s achieved by inducing the expression of factors called “Yamanaka factors”, which enable a cell to regain a pluripotent state in which DNA methylation patterns and chromatin architecture are reset to a younger state in a “rejuvenation” process that enhances health and lifespan for individuals. (1, 2) Recently, a treatment intended to treat optic neuropathies based on this mechanism was approved by the FDA for its first clinical trials; this treatment consists of an AAV2 vector that carries the Oct4, Sox2, and Klf4 factors. Systemic doxycycline administration is needed for almost two months to activate OSK expression. (3) What if we could engineer a self-regulated genetic circuit that, by utilizing biosensors, detects aging or disease markers like transcription factors that enhance senescence-associated secretory phenotypes (SASPs), and thus activates itself and begins rewiring? To ensure non-malfeasance, the system incorporates a failsafe kill switch as a safety module that induces apoptosis if the cell loses its differentiated identity or if markers of full pluripotency, such as Nanog, are detected, to reduce the risk of cancer. By the time a “safe-by-design” technology like this becomes approved and available to the general public, there must be some important regulations upheld.
1. “An application of synthetic biology that I recently found out about, and I’m really excited for, is partial cellular reprogramming.”
It’s achieved by inducing the expression of factors called “Yamanaka factors”, which enable a cell to regain a pluripotent state in which DNA methylation patterns and chromatin architecture are reset to a younger state in a “rejuvenation” process that enhances health and lifespan for individuals. (1, 2) Recently, a treatment intended to treat optic neuropathies based on this mechanism was approved by the FDA for its first clinical trials; this treatment consists of an AAV2 vector that carries the Oct4, Sox2, and Klf4 factors. Systemic doxycycline administration is needed for almost two months to activate OSK expression. (3) What if we could engineer a self-regulated genetic circuit that, by utilizing biosensors, detects aging or disease markers like transcription factors that enhance senescence-associated secretory phenotypes (SASPs), and thus activates itself and begins rewiring? To ensure non-malfeasance, the system incorporates a failsafe kill switch as a safety module that induces apoptosis if the cell loses its differentiated identity or if markers of full pluripotency, such as Nanog, are detected, to reduce the risk of cancer. By the time a “safe-by-design” technology like this becomes approved and available to the general public, there must be some important regulations upheld.
2. The policy goals
Biosafety:
Genetic safety and vectors: Ensure that the reprogramming does not result in oncogenesis and that the use of vectors (such as AAV2s) does not cause insertional mutagenesis or unforeseen infections (due to its viral nature).
Responsible usage: Restrict access to reprogramming vectors to authorized hospitals to prevent self-administration and prevent the genetic circuit from being altered or replicated at home without adequate security controls. There shouldn’t be people modifying or hacking the genetic circuit nor the vectors.
Public accessibility and equity:
Universal access: Everyone should be able to access such treatment if required, so there should be financing or insurance programs that guarantee it.
Inclusive standards: Treatment must work on a global genetics level, being effective regardless of ethnic diversity
Transparency: People should be taught the nature of things we apply on them and informed of possible risks
3. The potential actions and the actors:
Actors: Researchers, Medics, Industry, Government and Patients
1. Enhance safety by standard design:
Main cast: Researchers, Industry
The researchers must develop a genetic circuit design that must induce apoptosis whenever there is a risk of oncogenesis or a full dedifferentiation process; industry must standardize that design to diminish risk.
We are assuming that the pluripotency (nanong) indicators are completely accurate.
Failure here results in the formation of teratomas or tumors.
2. Funding to ensure equity and access:
Main cast: Government, Medics, Patients
The government must finance treatment for patients that need it but can’t afford it (through funding the program through taxes on purely cosmetic longevity treatments, for example) and incentivize the creation of genomics databases of different ethnic groups. Medics must contribute to the genomic database by taking samples while doing service in the mentioned groups.
We assume that the cost of such treatment decreases as time passes and advances are made in the field.
Failure here results in discrimination (economic and/or ethnic) and leaves people without support.
3. Development of transgene control systems
Main cast: Medics, Researchers
Researchers should develop sophisticated control systems, not only in the form of a synthetic chemical compound (drug) that inhibits the circuit by marking its proteins for degradation, but also in an inhalation-mediated inhibition mechanism with a volatile compound that triggers an inducible promoter that codes for a regulatory protein that instantly degrades the Yamanaka factors. Physicians should prescribe this drug to stop any unwanted side effects that some patients might experience (oncogenesis).
We assume that the transgene control systems are effective and fast enough to reverse a dedifferentiation process before it becomes irreversible or tumorous.
Failure here results in possible untreatable disregulations of the system.
4. Scoring the options
Does the option:
Option 1
Option 2
Option 3
Enhance Biosecurity
• By preventing incidents
1
-
-
• By helping respond
-
-
1
Foster Lab Safety
• By preventing incident
-
-
-
• By helping respond
-
-
-
Protect the environment
• By preventing incidents
-
-
-
• By helping respond
-
-
-
Other considerations
• Minimizing costs and burdens to stakeholders
3
1
3
• Feasibility?
1
3
2
• Not impede research
1
2
1
• Promote constructive applications
3
2
3
5. What to prioritize
I believe that options 1 and 3 are the most important in this case, so I would recommend developing a standard circuit design that is self-regulating and has an external regulation mechanism in case, in a physician’s judgment, it is considered necessary to stop the process in any patient. I would suggest this kind of approach as a prerequisite for any autonomous reprogramming system intended for human use in the FDA.
6. Bibliography
Takahashi & Yamanaka (2006) Induction of Pluripotent Stem Cells from Mouse Embryonic and Adult Fibroblast Cultures by Defined Factors. Cell 126, 663–676. http://dx.doi.org/10.1016/j.cell.2006.07.024
Schmidt & Plath (2012) The roles of the reprogramming factors Oct4, Sox2 and Klf4 in resetting the somatic cell epigenome during induced pluripotent stem cell generation. http://genomebiology.com/2012/13/10/251
Macip et al. (2024) Gene Therapy-Mediated Partial ReprogrammingExtends Lifespan and Reverses Age-RelatedChanges in Aged Mice. http://dx.doi.org/10.1089/cell.2023.0072
Biosciences, L. (2026). Evaluating ER-100 for Safety in People With Glaucoma or Non-Arteritic Anterior Ischemic Optic Neuropathy (Optic Nerve Conditions). https://clinicaltrials.gov/study/NCT07290244
7. Searches:
Google: “Life Biosciences”
Gemini:
“Me gustaría que clarifiques la redacción de la introducción a la aplicación de la biología sintética de la que voy a hablar en mi tarea.”
“Leí estos dos papers Takahashi & Yamanaka (2006) y Schmidt & Plath (2012). Estos hablan de los factores OSK ¿Ya han habido avances de terapias desarrolladas con estos? ¿Cuáles son los avances más recientes?”

Week 2 HW: Read, write and edit DNA
Preparation for Week 2 lecture:
About next generation gene synthesis
What is the error rate of polymerase?
1:10^6
How does this compare to the length of the human genome?
Human genome consist on 3.2 Giga base pairs, so at least more than 3 000 errors could be made in the synthesis of an entire human genome by one polymerase. But, since human methabolic pathways are at most 10kbp long, its negligible in the majority of context.
How does biology deal with that discrepancy?
There are a lot of polymerase and DNA genome is separated in chromosomes, so chances of committing errors in replication are low, also thera are DNA mismatch reparation systems (MutS Repair System).
How many different ways are there to code for an average human protein?
Considering that the average human protein has 1036 bp, and the genetic code is degenerate, you can make the same protein with different codons that in the end encode the same aminoacid, so there should be a lot of different ways.
In practice what are some of the reasons that all of these different codes don’t work to code for the protein of interest?
This is primarily because, when using different base conjugations, some bases will be compatible with others, and in the case of biological synthesis, we could risk the formation of hairpin-like structures between them (as in the case of RNA proteins), which would force the termination of the translation when we want to obtain the proteins.
There is also the error rate.
From DNA Synthesis Development and Application:
What’s the most commonly used method for oligo synthesis currently? Phosphoramidite method
Why is it difficult to make oligos longer than 200nt via direct synthesis? An accumulation of errors due to the error rate of chemical synthesis
Why can’t you make a 2000bp gene via direct oligo synthesis? Because the error rate makes it, probabilistically speaking, almost impossible to achieve. So instead gene assembly is used.
From reading and writing life:
What are the 10 essential amino acids in all animals and how does this affect your view of the “Lysine Contingency”?
10 essentials: Histidine, Isoleucine, Leucine, Lysine, Methionine, Phenylalanine, Threonine, Tryptophan, Valine, Arginine
The lysine contingency It would work well if dinosaurs were as easy to isolate from external resources as bacteria in a petri dish. After all, essential amino acids are those we can’t produce ourselves, but obtain by eating other organisms that do possess them. It would be very difficult to cut off any other available source of lysine from the enviroment, so I would recommend another aproach.
Google Search: “Lysine Contigency”
Week 2 assingments:
(Part 2 consisted of “Gel Art - Restriction of Digests and Gel Electrophoresis”, intended to be carried out in the laboratory, since I do not have access to one, I did not complete it)
Using a ladder Life 1 kb Plus and different restriction enzymes (EcoRI, XhoI, SalI, EcoRV, NdeI, BamHI, HindIII and KpnI) on the Lambda fage genome, I tried to draw a Teletubbie.
PO5F1 protein DNA sequence
ATGGCGGGACACCTGGCTTCGGATTTCGCCTTCTCGCCCCCTCCAGGTGGTGGAGGTGATGGGCCAGGGGGGCCGGAGCCGGGCTGGGTTGATCCTCGGACCTGGCTAAGCTTCCAAGGCCCTCCTGGAGGGCCAGGAATCGGGCCGGGGGTTGGGCCAGGCTCTGAGGTGTGGGGGATTCCCCCATGCCCCCCGCCGTATGAGTTCTGTGGGGGGATGGCGTACTGTGGGCCCCAGGTTGGAGTGGGGCTAGTGCCCCAAGGCGGCTTGGAGACCTCTCAGCCTGAGGGCGAAGCAGGAGTCGGGGTGGAGAGCAACTCCGATGGGGCCTCCCCGGAGCCCTGCACCGTCACCCCTGGTGCCGTGAAGCTGGAGAAGGAGAAGCTGGAGCAAAACCCGGAGGAGTCCCAGGACATCAAAGCTCTGCAGAAAGAACTCGAGCAATTTGCCAAGCTCCTGAAGCAGAAGAGGATCACCCTGGGATATACACAGGCCGATGTGGGGCTCACCCTGGGGGTTCTATTTGGGAAGGTATTCAGCCAAACGACCATCTGCCGCTTTGAGGCTCTGCAGCTTAGCTTCAAGAACATGTGTAAGCTGCGGCCCTTGCTGCAGAAGTGGGTGGAGGAAGCTGACAACAATGAAAATCTTCAGGAGATATGCAAAGCAGAAACCCTCGTGCAGGCCCGAAAGAGAAAGCGAACCAGTATCGAGAACCGAGTGAGAGGCAACCTGGAGAATTTGTTCCTGCAGTGCCCGAAACCCACACTGCAGCAGATCAGCCACATCGCCCAGCAGCTTGGGCTCGAGAAGGATGTGGTCCGAGTGTGGTTCTGTAACCGGCGCCAGAAGGGCAAGCGATCAAGCAGCGACTATGCACAACGAGAGGATTTTGAGGCTGCTGGGTCTCCTTTCTCAGGGGGACCAGTGTCCTTTCCTCTGGCCCCAGGGCCCCATTTTGGTACCCCAGGCTATGGGAGCCCTCACTTCACTGCACTGTACTCCTCGGTCCCTTTCCCTGAGGGGGAAGCCTTTCCCCCTGTCTCCGTCACCACTCTGGGCTCTCCCATGCATTCAAAC
I went to NCBI genbank and retrieved POU5F1 mARN transcript variant, 1409bp long
I learned that there are spliced variants of OCT4 gene, that includes some introns and can have a more pivotal role in the induction of stemness properties (Yazd et al., 2011)
4. Central Dogma of biology
Part 4: DNA Synthesis order, building my first plasmid
1. Finding an appropriate plasmid backbone:
Reading the Yang et al. (2023) article, I saw in their “Materials and Methods” section, that they transfected pLVX-EF1alpha
2xGFP:NES-IRES-2xRFP:NLS to generate NCC-stable cells, so I thought it would be a good backbone for the assembly. This “shuttle vector” was originally described by Mertens et al. (2015) (AddGene ID: 71396)
Directly Reprogrammed Human Neurons Retain Aging-Associated Transcriptomic Signatures and Reveal Age-Related Nucleocytoplasmic Defects. Mertens J, Paquola AC, Ku M, Hatch E, Bohnke L, Ladjevardi S, McGrath S, Campbell B, Lee H, Herdy JR, Goncalves JT, Toda T, Kim Y, Winkler J, Yao J, Hetzer MW, Gage FH. Cell Stem Cell. 2015 Oct 6. pii: S1934-5909(15)00408-7. doi: 10.1016/j.stem.2015.09.001. 10.1016/j.stem.2015.09.001 PubMed 26456686
2. Assembly
I selected restriction enzymes BamHI and EcoRI to cut the GFP reporter out, and then I built my DNA insert sequence (POU5F1 gene for Oct4) with sticky ends ideal to connect with the backbone.
I I incorporated a Kozak consensus sequence (GCCACC) upstream of the ATG instead of an RBS
And to add the 7xHis at the C-terminus of the protein, before the stop codon
The design leverages the backbone’s endogenous promoter and polyadenylation signal, so I didn’t incorporate those to the insert
What DNA would you want to sequence (e.g., read) and why?
I would like to sequence the plasmid I modified (pLVX-EF1alpha-POU5F1-7xHis-IRES-mCherry) as a way of experimentally verifying that there were no mutations and that the junctions integrated properly.
In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why?
I would use the “Oxford Nanopore” sequencing method because it would allow me to find accidental recombinations and deletions by being able to sequence the entire plasmid.
DNA Write
What DNA would you want to synthesize (e.g., write) and why?
I would like to synthesize de novo the entire modified POU5F1 (Oct4) cassett, because that way I can perform codon optimization and add the necessary elements (histidine tag and Kozak sequence) in the most efficient way.
What technology or technologies would you use to perform this DNA synthesis and why?
I would use silicon-based DNA synthesis on microchips. Given that the Oct4 cDNA is approximately 1 kb, I really appreciate the ability to synthesize thousands of oligonucleotides in parallel with extremely high accuracy and at a much lower cost than traditional column synthesis.
DNA Edit
What DNA would you want to edit and why?
I would like to edit the genome of the plasmid recipient cells so that they possess a specific locus where the POU5F1 cassette can be inserted without activating oncogenes or disrupting vital genes.
What technology or technologies would you use to perform these DNA edits and why?
I would use Prime Editing because it would allow me to edit a small portion of the genome with considerable precision and turn it into a specific recognition site. Since it uses a Reverse Transcriptase (fused to Cas9 nickase) to write the new sequence directly into the DNA, it works perfectly in neurons or resting fibroblasts.
Find and describe a published paper that utilizes the Opentrons or an automation tool to achieve novel biological applications
Write a description about what you intend to do with automation tools for your final project. You may include example pseudocode, Python scripts, 3D printed holders, a plan for how to use Ginkgo Nebula, and more. You may reference this week’s recitation slide deck for lab automation details.
Well, of the three projects I’ve proposed, I’ll start with the second one, “Nostoc’s smart gummies”.
The objective here is to Characterize the sensitivity of heavy metal biosensors in Nostoc colonies and standardize the globular shape for the production of “gummies”
Cloud Lab Workflow (Ginkgo Nebula)
Echo: Precise gradient transfer of heavy metal solutions (lead and arsenic) at parts-per-billion (ppb) concentrations to Nostoc culture plates.
Multiflo: Dispensing of culture media optimized for cyanobacteria growth (BG-11).
PHERAstar: Measurement of the color intensity (absorbance) of reporter chromoproteins to establish the biosensor calibration curve.
Weeek 4 HW: Protein design - Part 1
Notes about key concepts for deeper understanding
About ESM
Key Concepts to Know:
Evolutionary Scale Model (ESM): A model that generates likelihood values for the existence of an amino acid at a specific position within a protein sequence. It is essentially a Masked Language Modeling (MLM) architecture.
Deep Mutational Scanning (DMS): A 2D map representing the probability values generated by the ESM for every possible mutation in a sequence.
Latent Space: Upon inputting a FASTA sequence, the language model adds start and end tokens, generating a high-dimensional vector (320 dimensions in the case of the recitation model). This vector classifies the protein; when projected into a 3D space, it creates a map that clusters the protein alongside others from a provided database based on functional or family similarities.
Steps for 3D Protein Structure Prediction with ESM-2:
Likelihood Generation: We utilize the fundamental property of the ESM (Masked Language Modeling) to calculate amino acid probabilities based on the sequence context.
Attention Mechanism: These data points are processed as Queries (Q), Keys (K), and Values (V). This generates an Attention Matrix, where vectors for each amino acid indicate the relevance and position of other residues in the chain.
Structural Extraction (Simple ESM): The raw attention matrix is symmetrized to ensure physically plausible distances and refined using Average Product Correction (APC) to eliminate correlation noise. Through regression, a 2D contact map is obtained.
Model Optimization: This map is compared against structural databases (like the PDB) to adjust the model. This feedback loop updates the model’s weights, maximizing the log-likelihood of real-world protein structures found in nature.
Enrichment via MLP: In parallel, the correlation vectors from the attention matrix are processed by a Multi-Layer Perceptron (MLP). This generates enriched vectors that describe the physicochemical implications of the attention data.
3D Prediction (ESM-2/ESMFold): In the ESM-2 architecture, these MLP-enriched vectors are fed into a “Folding Trunk” module. This module applies learned rules to predict the 3D structure, including bond angles and atomic coordinates.
About AlphaFold ans MSA
Key Concepts to Know:
Multiple Sequence Alignments (MSA): The different dialects of the tree of life for the same protein
For this part I used transcription factor Klf4 and realized a Deep mutational Scan, Latent Space Analysis, folding protein exercise, and inverse-folding protein exercise.
Subsections of Weeek 4 HW: Protein design - Part 1
Conceptual questions
How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)
A: Averaging out the nutritional composition of common meats (like the data from Pereira & Vicente, 2013), meat is about 21.25% protein by weight.If you eat a 500-gram steak, you are consuming roughly 106.25 grams of pure protein. Since 1 Da is exactly 1 g/mol, an average amino acid weighing 100 Da has a molar mass of 100 g/mol.By dividing your total protein (106.25 g) by the average amino acid mass (100 g/mol), we get 1.0625 moles of amino acids. Multiply that by Avogadro’s number 6.022e+23, and you get approximately 6.399e+23 molecules of amino acids
Why do humans eat beef but do not become a cow, eat fish but do not become fish?
A: Because exogenous DNA is either degraded before consumption or degraded by our own enzymes during the digestion process, it cannot be processed to alter our cells. So, the only materials our body takes are the disassembled amino acids; then, it rebuilds them into human proteins, following our human genome.
Why are there only 20 natural amino acids?
A: Certainly, there are only 20 amino acids in Choanoflagellatea, and this set of amino acids provides just enough chemical diversity (acidic, basic, hydrophobic, and hydrophilic shapes) to build complex and versatile proteins. However, I believe it’s a little bit biased to say they are the only “natural” ones, because there are also species of methanogenic archaea that use other amino acids, like selenocysteine and pyrrolysine.
Can you make other non-natural amino acids? Design some new amino acids.
A: Yes, it’s possible to synthesize new amino acids with synthetic biology or organic chemistry. You could also find them as non-canonical amino acids, like 2-azatyrosine, 3-azatyrosine, or azetidine-2-carboxylic acid.
Where did amino acids come from before enzymes that make them, and before life started?
A: It’s believed that they formed completely abiotically, as was proven in the famous Miller-Urey experiment, where they showed that if you mix simple gases present on early Earth (like methane, ammonia, and water) and strike them with electricity (simulating lightning), amino acids form naturally and spontaneously.
If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?
A: Left. Because D-amino acids are enantiomers of our natural L-amino acids, the most energetically stable way for them to fold is also completely mirrored.
Why are most molecular helices right-handed?
A: This conformation is much more energetically favorable because it minimizes atomic collisions (steric hindrance) between side chains and the backbone of the L-amino acids.
Why do β-sheets tend to aggregate? What is the driving force for β-sheet aggregation?
A: Because the unpaired hydrogen bond donors and acceptors of a beta strand attain a highly stable conformation by bonding with another strand, reaching a lower energy state. The driving forces are hydrogen bonds and hydrophobic interactions.
Why do many amyloid diseases form β-sheets? Can you use amyloid β-sheets as materials?
A: Because amyloid diseases are characterized by protein denaturation or misfolding, these proteins expose hydrofobic regions in this process so the end up crashing ther strands together and “zippering up” into tightly packed cross-β-sheet structures. Yes, because of its stability and self-assembly capacity, we could use it as diverse materials: nanowires, hydrogel, scaffolds, etc.
References:
Pereira, P. M. de C. C., & Vicente, A. F. dos R. B. (2013). Meat nutritional composition and nutritive role in the human diet. Meat Science, 93(3), 586–592. https://doi.org/10.1016/j.meatsci.2012.09.018
This is primarily because is the next protein in the yamanaka factors and participates in a SOX2/OCT4-bound, being OCT-4 the protein I selected for week’s two homework.
In relation to Sox2 aminoacid sequence
Sequence length and composition: This protein is 317 amino acids long. Based on the frequency analysis, Serine (S) is the most frequent amino acid, appearing 36 times (11.36% of the sequence), followed closely by Glycine (G) with 35 occurrences (11.04%).
Homology and BLAST results: When including “predicted” and “homology” evidence levels, the search reached the limit of 250 results in the UniProt BLAST tool. This high number of hits reflects that Sox-2 is an extremely conserved protein across species. However, for high-confidence analysis, I primarily considered the results from the reviewed Swiss-Prot database.
Familiy Classification: SCOP results showed that Sox-2 contains a High Mobility Group (HMG) box structural domain located between residues 206-282. This domain is characterized by its L-shaped triple α-helix fold that binds to the minor groove of DNA. While it shares this structural “HMG-box” fold with other architectural proteins, Sox-2 functionally belongs to the SOX transcription factor family. Despite recognizing the same DNA consensus motif ( 5-(A/T)(A/T)CAA(A/T)G-3 ) different Sox factors achieve target gene selectivity through differential affinity for particular flanking sequences next to consensus Sox sites, homo- or heterodimerization among Sox proteins, posttranslational modifications of Sox factors, or interaction with other co-factors (Wegner, 2010)
The structure was solved and published in 2003 by Reményi et al. in the journal Genes & Development (https://doi.org/10.1101/gad.269303).
Since the article its about the crystal structure of a POU/HMG/DNA ternary complex, the resolution is not below 2.70 Å, wich is not ideal for high-detail drug design, but I think the 2.70 Å resolution obtained is considered a good quality structure for analyzing protein-DNA ternary complexes.
As I mentioned before, the solved structure is often a ternary complex. Apart from the SOX2 protein, it includes the POU domain of the OCT4 protein and a specific DNA oligonucleotide (the FGF4 enhancer)
Acording to SCOP, it belongs to the HMG-box family within the structural class all alpha protein
In relation to the molecule visualization:
Since I ran into some strange errors while trying to use Conda, I used pip to import the PyMOL libraries instead
I used py3Dmol to visualize Sox2 in “cartoon” and “ball and stick” representations, and PyMOL for the “ribbon” view
To highlight the secondary structures, I added colors: red for alpha helices, blue for beta sheets, and grey for loops
I then used the following code to count the atoms and determine the predominant structure:
By coloring the hydrophilic residues blue, the hydrophobic residues orange, and others yellow, it becomes clear how they are distributed. You can notice that the hydrophilic residues tend to interact with the phosphates in the DNA chain, while the hydrophobic ones are positioned to fit within the DNA grooves.
As seen in the surface visualization, the “binding pockets” of transcription factors aren’t exactly “holes.” Instead, they act more like a contoured surface that forms “hooks” around the DNA backbone.
Using ML-Based Protein Design Tools
Protein Language Modeling
Deep Mutational Scans
At the beggining and the mid part of the protein there are visible and abundant green/yellow zones, indicating that the protein function its not likely to be compromised by a mutation on this parts, but there is also a lot of dark violet zones at the end of the sequences, indicating that the probability of a critical mutation increases in that zone
Looking at the Klf4 in the RCSB PDB page, the structural features of the last part of the sequence corresponds to the zinc fingers
So it makes sense that in a zoom in of the last part, the mutational scan looks like this
Latent Space Analysis
After running the code and analyzing the positions of tokens of sequences, I could find some clusters of similar proteins within the 3D depiction. just like these transcription factors whose position variability was almost constant on the TSNE2 axis
Then I added the Klf4 protein sequence to the “sequences” variable in the colab
And modified some visualization components to highlight the Klf4
The neighbors of the protein were commongly also transcription factors.
There was a lot of Homo sapiens general transcription factors, as the d2dn5a1 seen in the image below
Also ther were transcriptional regulator factors of bacteria
And quite surprisingly some functionally unrelated proteins like this arsenite translocating ATP-ase in E. colli
Curiously, proteins that at first glance were supposedly to be similr to Kfl4, were relativetely away from it.
Protein Folding
After running ESMFold with the Klf4
the result was quite similar to the RCSB 3D view of the same protein
I did some puntual mutations at the end of the sequence (near the last alpha helix), replacing three amino acids with the last three prolines seen in the sequence below
The result at first glance appeared to be the same
But zooming in, you could se that the first turn of the last alpha helix was disrupted by the mutations
original
disrupted
Protein Generation (inverse-folding)
Using the code provided on this page, I inputted the PDB code for Klf4 (6VTX) to extract its sequence and generate a heatmap. I didn’t expect the resulting sequence to be this short compared to the original one
This heatmap is visually quite similar to the one we saw earlier. Indeed, while both the Deep Mutational Scan and this ProteinMPNN heatmap are probabilistic models that calculate conditional probabilities, their approaches are fundamentally different. The former uses a 1D array (as a masked language model) to predict an amino acid based on the sequence context of other amino acids. In contrast, ProteinMPNN uses a spatial graph to predict an amino acid based on the 3D structure of the protein, relying on the distances and angles between atoms.
Finally, I ran ESMFold using the newly generated sequence and obtained this 3D structure. Although it’s visibly shorter than the original, the overall fold looks quite similar, and the alpha helices that predominantly define the structure have been successfully preserved
Using the collab from https://huggingface.co/ChatterjeeLab/PepMLM-650M, I selected a peptide length of 12, a top-k value of 3 (as the literature suggests this yields better and more flexible outcomes), and set the number of binders to 4.
While running the Colab, I included the provided ground truth binder (FLYRWLPSRRGG) for comparison. Interestingly, I noticed that the known SOD1-binding peptide in the fourth row has an elevated perplexity as calculated by the model.
2. Evaluate Binders with AlphaFold3
I submitted the protein sequence followed by each peptide. Since three of the four peptides had ‘X’ as their last amino acid, I replaced the X with a G (Glycine) to successfully submit the jobs to AlphaFold3.
The results were as following:
Known binder FLYRWLPSRRGG: It binds to the surface near a non structured region and obtained an ipTM = 0.34.
None of the generated peptides appeared to bind near any terminus; all were surface-bound and mostly located in a loop from approximately position 65 to 78.
First generated binder WHYPAVAARWKX with an ipTM = 0.31:
Second generated binder HHYGAVALELKX with an ipTM = 0.45:
Third generated binder WLYPAVVAALKK with an ipTM = 0.24:
Fourth generated binder WRVPAAAVRHGX with an ipTM = 0.36:
The second and fourth generated peptides achieved ipTM scores higher than that of the known binding peptide.
3. Evaluate properties of generated peptides in the PeptiVerse
I analyzed the results of the evaluation by PeptiVerse, considering predicted binding affinity, non-fouling predicted value, hemolysis, solubility, and net charge (pH 7).
FLYRWLPSRRGG: As the ground truth peptide, its properties are good, although its predicted binding affinity is considered “weak” by the model.
WHYPAVAARWKX: Has a relatively low predicted non-fouling value, and its binding affinity value diminished by 0.303 when changing X for G.
HHYGAVALELKX: When changing X for G, the binding affinity value diminished from 5.387 to 5.382, and its permeability value decreased from 0.125 to 0.41. Note that it wasn’t predicted to be permeable from the start.
WLYPAVVAALKK: It achieved the highest binding affinity among the generated and known binding peptides; however, its fouling value is too high, and it is not permeable.
WRVPAAAVRHGX: Has good properties and a binding affinity near that of the known binding peptide.
In AlphaFold the second (HHYGAVALELKX) and fourth (WRVPAAAVRHGX) generated peptide had the higher ipTM. However, the PeptiVerse model considered the third generated peptide (WLYPAVVAALKK) and the known binding peptide (FLYRWLPSRRGG) as the ones with the the highest binding affinity.
4. Generate optimized peptides with moPPIt
I used the provided moPPit Colab, limiting the optimization parameters to work within a GPU T4 execution environment without running into computational power issues.
I obtained a single sequence (as generating four sequences exceeded the environment’s capacity) and analyzed it.
The AlphaFold 3 predicted ipTM of 0.34 was not considerably different from the non-optimized binders; however, the binding location changed significantly, appearing near the terminus this time.
The predicted structure shows that the peptide localizes near the C-terminus of the protein, within the dimer interface. Although it does not bind directly to the N-terminus, it interacts with it through the C-terminus beta-strand.
The PeptiVerse model predicted a tight binding affinity, but also a higher foulingness.
Bibliography:
Saeed, M., Yang, Y., Deng, H.-X., Hung, W.-Y., Siddique, N., Dellefave, L., Gellera, C., Andersen, P. M., & Siddique, T. (2009). Age and founder effect of SOD1 A4V mutation causing ALS. Neurology, 72(19), 1634–1639. https://doi.org/10.1212/01.wnl.0000343509.76828.2a
L-protein engineering
MutagenesisMutagenesis using Af2-MultimerRandom mutagenesis
Subsections of Week 6 HW: Genetic Cirucuits - Part 1
DNA assembly
1. What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose?
While the principal components in every standard PCR reaction are a polymerase (like Taq), dNTPs, nuclease-free water, primers, buffer, and the DNA template to amplify, a Master Mix typically only contains the polymerase, dNTPs, and an optimized buffer (which includes Mg2+ as polymerases require magnesium to function). In the specific case of the Phusion High-Fidelity PCR Master Mix, we must note that the polymerase used is not Taq, but Phusion—a high-fidelity enzyme with 3’ to 5’ exonuclease (proofreading) activity. Its purpose is to amplify DNA with extreme accuracy and speed, preventing mutations during the assembly of sensitive genetic circuits.
2. What are some factors that determine primer annealing temperature during PCR?
First, the primer melting temperature (Tm). The Tm is the temperature at which 50% of the primer is bound to its complementary sequence, and the annealing temperature is typically set a few degrees below this. To calculate this melting point, we need to consider the length of the primer and the G-C content of the DNA template; G-C base pairs are bound by 3 hydrogen bonds each, so more heat is needed to separate them compared to A-T pairs.
Second, the buffer conditions (specifically salt and magnesium concentrations) are a factor, as they add stability to the DNA double-stranded form by shielding the negative charges of the phosphate backbone, hence increasing the Tm and the required annealing temperature.
3. Compare and contrast PCR and restriction enzyme digest methods in terms of protocol and use preference situations.
Protocols: PCR relies on thermal cycling (denaturation, annealing, and extension) using oligonucleotide primers and a thermostable DNA polymerase to synthesize new copies of a specific DNA fragment. In contrast, a restriction enzyme digest is an isothermal incubation (often at 37°C) where an endonuclease cuts pre-existing, double-stranded DNA at specific recognition sites, followed by heat inactivation of the enzyme.
Use Preference: PCR is preferable when you need to amplify a highly specific region from a complex template, when your starting DNA concentration is very low, or when you need to add custom overhangs (like homology arms) to the ends of your sequence. Restriction enzyme digestion is preferable for diagnostic checks (verifying the size of a cloned insert), when preparing plasmids for traditional ligation, or when you must completely avoid the risk of polymerase-induced mutations in a sequence that is already cloned.
4. How can you ensure that the DNA sequences that you have digested and PCR-ed will be appropriate for Gibson cloning?
First, in the PCR procedure, you want to end with fragments that have overlapping terminal bases (typically 15-40 base pairs) complementary to those in the backbone and the ones in the adjacent fragments, in order for them to be assembled by homology.
Second, you can make sure the processed fragments are correct by doing gel electrophoresis and checking if the DNA is degraded or not by seeing if the results are consistent with the expected size. Finally, you can perform a gel extraction to purify these specific fragments from the template DNA and non-specific PCR products, ensuring a clean reaction.
5. How does the plasmid DNA enters the E. coli cells during transformation?
Essentially, it enters through stress-induced pores in the plasmalema. It could be physical or chemical stress. This process is facilitated by cations (normally calcium cations from calcium chloride (CaCl2). These cations act as a bridge to shield and neutralize the electrostatic repulsion between the negatively charged phosphate backbone of the plasmid DNA and the negatively charged lipopolysaccharides on the bacterial plasmalema. Once the DNA is resting on the membrane, the stress pore induction (often done via heat-shock)creates a thermal imbalance that sweeps the DNA into the bacterial cell, allowing it to reach the cytoplasm. Other physical method could be electroporation, which generates pores with high electrical voltage.
6. Describe another assembly method in detail (such as Golden Gate Assembly)
There are plenty of methods, like Modular Cloning or Circular Polymerase Extentional cloning, etc. But the majority of them are based on Gibson assembly or Golden Gate Assembly. so I decided to explain the Golden Gate Assembly:
This technique leverages the capacity of specific endonucleases, called Type IIS restriction enzymes, to recognize asymmetric DNA sequences and cleave outside of this recognition site. By cutting typically 1 to 20 base pairs away, these enzymes generate customizable overhangs. The orientation of the recognition sites dictates the assembly mechanics: for the inserts, the sites are placed facing inward so the cut occurs between them, whereas in the destination vector, the sites must be outward-facing to flank the region being excised. The goal is to achieve one-pot assembly of multiple DNA fragments by designing these overhangs to be complementary (which can be generated via PCR) and using T4 DNA ligase to seal the construct. This simultaneous digestion and ligation is achievable because the restriction recognition sites are entirely removed during cleavage. Consequently, the assembled fragments and the final backbone become resistant to further digestion, driving the reaction forward seamlessly.