Alexandra — HTGAA Spring 2026

About me

I am a third year student of genetics in Peru 🇵🇪 I’m interested in understanding biological phenomena from different approaches and multiple scales with the aim of developing comprehensive and effective treatments and technologies (mainly in chronic diseases) I enjoy doing a variety of things; lately I’ve been exploring mathematical modeling, music, and quantum physics. I love playing volleyball, cycling, and open water swimming.

Contact info

Weeks

Homework

Weekly homework submissions:

Week 1 HW: Principles and Practices
1. “An application of synthetic biology that I recently found out about, and I’m really excited for, is partial cellular reprogramming.” It’s achieved by inducing the expression of factors called “Yamanaka factors”, which enable a cell to regain a pluripotent state in which DNA methylation patterns and chromatin architecture are reset to a younger state in a “rejuvenation” process that enhances health and lifespan for individuals. (1, 2) Recently, a treatment intended to treat optic neuropathies based on this mechanism was approved by the FDA for its first clinical trials; this treatment consists of an AAV2 vector that carries the Oct4, Sox2, and Klf4 factors. Systemic doxycycline administration is needed for almost two months to activate OSK expression. (3) What if we could engineer a self-regulated genetic circuit that, by utilizing biosensors, detects aging or disease markers like transcription factors that enhance senescence-associated secretory phenotypes (SASPs), and thus activates itself and begins rewiring? To ensure non-malfeasance, the system incorporates a failsafe kill switch as a safety module that induces apoptosis if the cell loses its differentiated identity or if markers of full pluripotency, such as Nanog, are detected, to reduce the risk of cancer. By the time a “safe-by-design” technology like this becomes approved and available to the general public, there must be some important regulations upheld.
Week 2 HW: Read, write and edit DNA
Includes question solving as preparation
Week 3 HW: Lab automation
Using Python to run Opentrons liquid handling robot
Weeek 4 HW: Protein design - Part 1
Exploring the world of AI
Week 5 HW: Protein design - Part 2
Using the latest tools to develop binders, treatments and more
Week 6 HW: Genetic Cirucuits - Part 1
Learning about the juxtaposition of electrical engineering and biology, also getting to know creative tools and standard protocols of the field.
Week 07 HW: Genetic circuits II
Sorry, I will update this later, I have been really bussy this last two weeks
Week 9 HW: Cell-free systems
Putting the "engineering" in biology,
Week 10 HW: Advance imaging & measurement technology
Underscovered yet :D
Week 11 HW: Bioproduction adn cloud labs
Pixel art
Final project advance
Using the designing, synthesizing, and editing whole genomes content learned in "building genomes week"for my final project

Week 1 HW: Principles and Practices

1. “An application of synthetic biology that I recently found out about, and I’m really excited for, is partial cellular reprogramming.”

It’s achieved by inducing the expression of factors called “Yamanaka factors”, which enable a cell to regain a pluripotent state in which DNA methylation patterns and chromatin architecture are reset to a younger state in a “rejuvenation” process that enhances health and lifespan for individuals. (1, 2) Recently, a treatment intended to treat optic neuropathies based on this mechanism was approved by the FDA for its first clinical trials; this treatment consists of an AAV2 vector that carries the Oct4, Sox2, and Klf4 factors. Systemic doxycycline administration is needed for almost two months to activate OSK expression. (3) What if we could engineer a self-regulated genetic circuit that, by utilizing biosensors, detects aging or disease markers like transcription factors that enhance senescence-associated secretory phenotypes (SASPs), and thus activates itself and begins rewiring? To ensure non-malfeasance, the system incorporates a failsafe kill switch as a safety module that induces apoptosis if the cell loses its differentiated identity or if markers of full pluripotency, such as Nanog, are detected, to reduce the risk of cancer. By the time a “safe-by-design” technology like this becomes approved and available to the general public, there must be some important regulations upheld.

2. The policy goals

Biosafety:

Genetic safety and vectors: Ensure that the reprogramming does not result in oncogenesis and that the use of vectors (such as AAV2s) does not cause insertional mutagenesis or unforeseen infections (due to its viral nature).
Responsible usage: Restrict access to reprogramming vectors to authorized hospitals to prevent self-administration and prevent the genetic circuit from being altered or replicated at home without adequate security controls. There shouldn’t be people modifying or hacking the genetic circuit nor the vectors.

Public accessibility and equity:

Universal access: Everyone should be able to access such treatment if required, so there should be financing or insurance programs that guarantee it.
Inclusive standards: Treatment must work on a global genetics level, being effective regardless of ethnic diversity
Transparency: People should be taught the nature of things we apply on them and informed of possible risks

3. The potential actions and the actors:

Actors: Researchers, Medics, Industry, Government and Patients

1. Enhance safety by standard design:

Main cast: Researchers, Industry

The researchers must develop a genetic circuit design that must induce apoptosis whenever there is a risk of oncogenesis or a full dedifferentiation process; industry must standardize that design to diminish risk.
We are assuming that the pluripotency (nanong) indicators are completely accurate.
Failure here results in the formation of teratomas or tumors.

2. Funding to ensure equity and access:

Main cast: Government, Medics, Patients

The government must finance treatment for patients that need it but can’t afford it (through funding the program through taxes on purely cosmetic longevity treatments, for example) and incentivize the creation of genomics databases of different ethnic groups. Medics must contribute to the genomic database by taking samples while doing service in the mentioned groups.
We assume that the cost of such treatment decreases as time passes and advances are made in the field.
Failure here results in discrimination (economic and/or ethnic) and leaves people without support.

3. Development of transgene control systems

Main cast: Medics, Researchers

Researchers should develop sophisticated control systems, not only in the form of a synthetic chemical compound (drug) that inhibits the circuit by marking its proteins for degradation, but also in an inhalation-mediated inhibition mechanism with a volatile compound that triggers an inducible promoter that codes for a regulatory protein that instantly degrades the Yamanaka factors. Physicians should prescribe this drug to stop any unwanted side effects that some patients might experience (oncogenesis).
We assume that the transgene control systems are effective and fast enough to reverse a dedifferentiation process before it becomes irreversible or tumorous.
Failure here results in possible untreatable disregulations of the system.

4. Scoring the options

Does the option:	Option 1	Option 2	Option 3
Enhance Biosecurity
• By preventing incidents	1	-	-
• By helping respond	-	-	1
Foster Lab Safety
• By preventing incident	-	-	-
• By helping respond	-	-	-
Protect the environment
• By preventing incidents	-	-	-
• By helping respond	-	-	-
Other considerations
• Minimizing costs and burdens to stakeholders	3	1	3
• Feasibility?	1	3	2
• Not impede research	1	2	1
• Promote constructive applications	3	2	3

5. What to prioritize

I believe that options 1 and 3 are the most important in this case, so I would recommend developing a standard circuit design that is self-regulating and has an external regulation mechanism in case, in a physician’s judgment, it is considered necessary to stop the process in any patient. I would suggest this kind of approach as a prerequisite for any autonomous reprogramming system intended for human use in the FDA.

6. Bibliography

Takahashi & Yamanaka (2006) Induction of Pluripotent Stem Cells from Mouse Embryonic and Adult Fibroblast Cultures by Defined Factors. Cell 126, 663–676. http://dx.doi.org/10.1016/j.cell.2006.07.024
Schmidt & Plath (2012) The roles of the reprogramming factors Oct4, Sox2 and Klf4 in resetting the somatic cell epigenome during induced pluripotent stem cell generation. http://genomebiology.com/2012/13/10/251
Ocampo et al. (2016) In Vivo Amelioration of Age-Associated Hallmarks by Partial Reprogramming. Cell 167, 1719–1733. http://dx.doi.org/10.1016/j.cell.2016.11.052
Macip et al. (2024) Gene Therapy-Mediated Partial ReprogrammingExtends Lifespan and Reverses Age-RelatedChanges in Aged Mice. http://dx.doi.org/10.1089/cell.2023.0072
Biosciences, L. (2026). Evaluating ER-100 for Safety in People With Glaucoma or Non-Arteritic Anterior Ischemic Optic Neuropathy (Optic Nerve Conditions). https://clinicaltrials.gov/study/NCT07290244

7. Searches:

Google: “Life Biosciences”
Gemini:

“Me gustaría que clarifiques la redacción de la introducción a la aplicación de la biología sintética de la que voy a hablar en mi tarea.”
“Leí estos dos papers Takahashi & Yamanaka (2006) y Schmidt & Plath (2012). Estos hablan de los factores OSK ¿Ya han habido avances de terapias desarrolladas con estos? ¿Cuáles son los avances más recientes?”

![cover image](Molly chiquita y grande.JPG)

Week 2 HW: Read, write and edit DNA

Preparation for Week 2 lecture:

About next generation gene synthesis

What is the error rate of polymerase? 1:10^6
How does this compare to the length of the human genome? Human genome consist on 3.2 Giga base pairs, so at least more than 3 000 errors could be made in the synthesis of an entire human genome by one polymerase. But, since human methabolic pathways are at most 10kbp long, its negligible in the majority of context.
How does biology deal with that discrepancy? There are a lot of polymerase and DNA genome is separated in chromosomes, so chances of committing errors in replication are low, also thera are DNA mismatch reparation systems (MutS Repair System).
How many different ways are there to code for an average human protein? Considering that the average human protein has 1036 bp, and the genetic code is degenerate, you can make the same protein with different codons that in the end encode the same aminoacid, so there should be a lot of different ways.
In practice what are some of the reasons that all of these different codes don’t work to code for the protein of interest? This is primarily because, when using different base conjugations, some bases will be compatible with others, and in the case of biological synthesis, we could risk the formation of hairpin-like structures between them (as in the case of RNA proteins), which would force the termination of the translation when we want to obtain the proteins. There is also the error rate.

From DNA Synthesis Development and Application:

What’s the most commonly used method for oligo synthesis currently? Phosphoramidite method
Why is it difficult to make oligos longer than 200nt via direct synthesis? An accumulation of errors due to the error rate of chemical synthesis
Why can’t you make a 2000bp gene via direct oligo synthesis? Because the error rate makes it, probabilistically speaking, almost impossible to achieve. So instead gene assembly is used.

From reading and writing life:

What are the 10 essential amino acids in all animals and how does this affect your view of the “Lysine Contingency”?

10 essentials: Histidine, Isoleucine, Leucine, Lysine, Methionine, Phenylalanine, Threonine, Tryptophan, Valine, Arginine The lysine contingency It would work well if dinosaurs were as easy to isolate from external resources as bacteria in a petri dish. After all, essential amino acids are those we can’t produce ourselves, but obtain by eating other organisms that do possess them. It would be very difficult to cut off any other available source of lysine from the enviroment, so I would recommend another aproach.

Google Search: “Lysine Contigency”

Week 2 assingments:

(Part 2 consisted of “Gel Art - Restriction of Digests and Gel Electrophoresis”, intended to be carried out in the laboratory, since I do not have access to one, I did not complete it)

Part 1: Benchling & In silico gel art
Using a ladder Life 1 kb Plus and different restriction enzymes (EcoRI, XhoI, SalI, EcoRV, NdeI, BamHI, HindIII and KpnI) on the Lambda fage genome, I tried to draw a Teletubbie.
Part 3: DNA Design Challenge
All started choosing my favourite protein…
Part 4: DNA Synthesis order, building my first plasmid
Stitching together the components to form a plasmid
Part 5: Read, Write and Edit
Why and how

Part 1: Benchling & In silico gel art

My attempt at drawing a Teletubbie

Part 3: DNA Design Challenge

1. Obtaining the protein amino acid sequence

I went to UniProt page and retrieved Homo sapien’s PO5F1 sequence

https://www.uniprot.org/uniprotkb/Q01860/entry#sequences

Q01860-1: This isoform has been chosen as the canonical sequence

sp|Q01860|PO5F1_HUMAN POU domain, class 5, transcription factor 1 OS=Homo sapiens OX=9606 GN=POU5F1 PE=1 SV=1 MAGHLASDFAFSPPPGGGGDGPGGPEPGWVDPRTWLSFQGPPGGPGIGPGVGPGSEVWGI PPCPPPYEFCGGMAYCGPQVGVGLVPQGGLETSQPEGEAGVGVESNSDGASPEPCTVTPG AVKLEKEKLEQNPEESQDIKALQKELEQFAKLLKQKRITLGYTQADVGLTLGVLFGKVFS QTTICRFEALQLSFKNMCKLRPLLQKWVEEADNNENLQEICKAETLVQARKRKRTSIENR VRGNLENLFLQCPKPTLQQISHIAQQLGLEKDVVRVWFCNRRQKGKRSSSDYAQREDFEA AGSPFSGGPVSFPLAPGPHFGTPGYGSPHFTALYSSVPFPEGEAFPPVSVTTLGSPMHSN

OCT4-SOX2-bound nucleosome - SHL-6 From: Science 368 1460-1465 (2020) PMID 32327602 DOI 10.1126/science.abb0074

2. DNA sequence:

PO5F1 protein DNA sequence ATGGCGGGACACCTGGCTTCGGATTTCGCCTTCTCGCCCCCTCCAGGTGGTGGAGGTGATGGGCCAGGGGGGCCGGAGCCGGGCTGGGTTGATCCTCGGACCTGGCTAAGCTTCCAAGGCCCTCCTGGAGGGCCAGGAATCGGGCCGGGGGTTGGGCCAGGCTCTGAGGTGTGGGGGATTCCCCCATGCCCCCCGCCGTATGAGTTCTGTGGGGGGATGGCGTACTGTGGGCCCCAGGTTGGAGTGGGGCTAGTGCCCCAAGGCGGCTTGGAGACCTCTCAGCCTGAGGGCGAAGCAGGAGTCGGGGTGGAGAGCAACTCCGATGGGGCCTCCCCGGAGCCCTGCACCGTCACCCCTGGTGCCGTGAAGCTGGAGAAGGAGAAGCTGGAGCAAAACCCGGAGGAGTCCCAGGACATCAAAGCTCTGCAGAAAGAACTCGAGCAATTTGCCAAGCTCCTGAAGCAGAAGAGGATCACCCTGGGATATACACAGGCCGATGTGGGGCTCACCCTGGGGGTTCTATTTGGGAAGGTATTCAGCCAAACGACCATCTGCCGCTTTGAGGCTCTGCAGCTTAGCTTCAAGAACATGTGTAAGCTGCGGCCCTTGCTGCAGAAGTGGGTGGAGGAAGCTGACAACAATGAAAATCTTCAGGAGATATGCAAAGCAGAAACCCTCGTGCAGGCCCGAAAGAGAAAGCGAACCAGTATCGAGAACCGAGTGAGAGGCAACCTGGAGAATTTGTTCCTGCAGTGCCCGAAACCCACACTGCAGCAGATCAGCCACATCGCCCAGCAGCTTGGGCTCGAGAAGGATGTGGTCCGAGTGTGGTTCTGTAACCGGCGCCAGAAGGGCAAGCGATCAAGCAGCGACTATGCACAACGAGAGGATTTTGAGGCTGCTGGGTCTCCTTTCTCAGGGGGACCAGTGTCCTTTCCTCTGGCCCCAGGGCCCCATTTTGGTACCCCAGGCTATGGGAGCCCTCACTTCACTGCACTGTACTCCTCGGTCCCTTTCCCTGAGGGGGAAGCCTTTCCCCCTGTCTCCGTCACCACTCTGGGCTCTCCCATGCATTCAAAC

3. Gene and CDS

https://www.ncbi.nlm.nih.gov/nuccore/NM_002701.6?report=genbank

I went to NCBI genbank and retrieved POU5F1 mARN transcript variant, 1409bp long I learned that there are spliced variants of OCT4 gene, that includes some introns and can have a more pivotal role in the induction of stemness properties (Yazd et al., 2011)

4. Central Dogma of biology

Part 4: DNA Synthesis order, building my first plasmid

1. Finding an appropriate plasmid backbone:

Reading the Yang et al. (2023) article, I saw in their “Materials and Methods” section, that they transfected pLVX-EF1alpha 2xGFP:NES-IRES-2xRFP:NLS to generate NCC-stable cells, so I thought it would be a good backbone for the assembly. This “shuttle vector” was originally described by Mertens et al. (2015) (AddGene ID: 71396) Directly Reprogrammed Human Neurons Retain Aging-Associated Transcriptomic Signatures and Reveal Age-Related Nucleocytoplasmic Defects. Mertens J, Paquola AC, Ku M, Hatch E, Bohnke L, Ladjevardi S, McGrath S, Campbell B, Lee H, Herdy JR, Goncalves JT, Toda T, Kim Y, Winkler J, Yao J, Hetzer MW, Gage FH. Cell Stem Cell. 2015 Oct 6. pii: S1934-5909(15)00408-7. doi: 10.1016/j.stem.2015.09.001. 10.1016/j.stem.2015.09.001 PubMed 26456686

2. Assembly

I selected restriction enzymes BamHI and EcoRI to cut the GFP reporter out, and then I built my DNA insert sequence (POU5F1 gene for Oct4) with sticky ends ideal to connect with the backbone. I I incorporated a Kozak consensus sequence (GCCACC) upstream of the ATG instead of an RBS And to add the 7xHis at the C-terminus of the protein, before the stop codon The design leverages the backbone’s endogenous promoter and polyadenylation signal, so I didn’t incorporate those to the insert

And this is the result

https://benchling.com/s/seq-dcl08hA5mckl7k6SyMfX?m=slm-6JoFb5a3gBdbPekfEuDq

Twist order

I tried ordering, but couldn’t so I made another insert, with a Trp-less GFP like in recitation, and got my Hight Copy cassette from Twist: https://benchling.com/s/seq-tiZifTIvfRfmji3ky9IP?m=slm-Xdd23sZE45BAObyU1yP8

Part 5: Read, Write and Edit

DNA Read

What DNA would you want to sequence (e.g., read) and why?

I would like to sequence the plasmid I modified (pLVX-EF1alpha-POU5F1-7xHis-IRES-mCherry) as a way of experimentally verifying that there were no mutations and that the junctions integrated properly.

In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why?

I would use the “Oxford Nanopore” sequencing method because it would allow me to find accidental recombinations and deletions by being able to sequence the entire plasmid.

DNA Write

What DNA would you want to synthesize (e.g., write) and why?

I would like to synthesize de novo the entire modified POU5F1 (Oct4) cassett, because that way I can perform codon optimization and add the necessary elements (histidine tag and Kozak sequence) in the most efficient way.

What technology or technologies would you use to perform this DNA synthesis and why?

I would use silicon-based DNA synthesis on microchips. Given that the Oct4 cDNA is approximately 1 kb, I really appreciate the ability to synthesize thousands of oligonucleotides in parallel with extremely high accuracy and at a much lower cost than traditional column synthesis.

DNA Edit

What DNA would you want to edit and why?

I would like to edit the genome of the plasmid recipient cells so that they possess a specific locus where the POU5F1 cassette can be inserted without activating oncogenes or disrupting vital genes.

What technology or technologies would you use to perform these DNA edits and why?

I would use Prime Editing because it would allow me to edit a small portion of the genome with considerable precision and turn it into a specific recognition site. Since it uses a Reverse Transcriptase (fused to Cas9 nickase) to write the new sequence directly into the DNA, it works perfectly in neurons or resting fibroblasts.

Week 3 HW: Lab automation

Part 1: Opentrons artwork
I wanted to recreate a solar disc
Part 2: Use of automation tools
References and what I intend to do

Part 1: Opentrons artwork

My attempt

Part 2: Use of automation tools

Find and describe a published paper that utilizes the Opentrons or an automation tool to achieve novel biological applications

Write a description about what you intend to do with automation tools for your final project. You may include example pseudocode, Python scripts, 3D printed holders, a plan for how to use Ginkgo Nebula, and more. You may reference this week’s recitation slide deck for lab automation details.

Well, of the three projects I’ve proposed, I’ll start with the second one, “Nostoc’s smart gummies”.

The objective here is to Characterize the sensitivity of heavy metal biosensors in Nostoc colonies and standardize the globular shape for the production of “gummies”

Cloud Lab Workflow (Ginkgo Nebula)

Echo: Precise gradient transfer of heavy metal solutions (lead and arsenic) at parts-per-billion (ppb) concentrations to Nostoc culture plates.

Multiflo: Dispensing of culture media optimized for cyanobacteria growth (BG-11).

PHERAstar: Measurement of the color intensity (absorbance) of reporter chromoproteins to establish the biosensor calibration curve.

Weeek 4 HW: Protein design - Part 1

Notes about key concepts for deeper understanding

About ESM

Key Concepts to Know:

Evolutionary Scale Model (ESM): A model that generates likelihood values for the existence of an amino acid at a specific position within a protein sequence. It is essentially a Masked Language Modeling (MLM) architecture.
Deep Mutational Scanning (DMS): A 2D map representing the probability values generated by the ESM for every possible mutation in a sequence.
Latent Space: Upon inputting a FASTA sequence, the language model adds start and end tokens, generating a high-dimensional vector (320 dimensions in the case of the recitation model). This vector classifies the protein; when projected into a 3D space, it creates a map that clusters the protein alongside others from a provided database based on functional or family similarities.
Multi-Layer Perceptron (MLP): Consist on two linear transformation with an non-linear activation (https://www.ultralytics.com/glossary/gelu-gaussian-error-linear-unit) in between.

Steps for 3D Protein Structure Prediction with ESM-2:

Likelihood Generation: We utilize the fundamental property of the ESM (Masked Language Modeling) to calculate amino acid probabilities based on the sequence context.
Attention Mechanism: These data points are processed as Queries (Q), Keys (K), and Values (V). This generates an Attention Matrix, where vectors for each amino acid indicate the relevance and position of other residues in the chain.
Structural Extraction (Simple ESM): The raw attention matrix is symmetrized to ensure physically plausible distances and refined using Average Product Correction (APC) to eliminate correlation noise. Through regression, a 2D contact map is obtained.
Model Optimization: This map is compared against structural databases (like the PDB) to adjust the model. This feedback loop updates the model’s weights, maximizing the log-likelihood of real-world protein structures found in nature.
Enrichment via MLP: In parallel, the correlation vectors from the attention matrix are processed by a Multi-Layer Perceptron (MLP). This generates enriched vectors that describe the physicochemical implications of the attention data.
3D Prediction (ESM-2/ESMFold): In the ESM-2 architecture, these MLP-enriched vectors are fed into a “Folding Trunk” module. This module applies learned rules to predict the 3D structure, including bond angles and atomic coordinates.

About AlphaFold ans MSA

Key Concepts to Know:

Multiple Sequence Alignments (MSA): The different dialects of the tree of life for the same protein

Conceptual questions
Basic concepts about amino acids and protein functions
Protein analysis and visualization
I selected a small transcription factor protein related to cellular reprogramming
Using ML-Based Protein Design Tools
For this part I used transcription factor Klf4 and realized a Deep mutational Scan, Latent Space Analysis, folding protein exercise, and inverse-folding protein exercise.

Conceptual questions

How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)

A: Averaging out the nutritional composition of common meats (like the data from Pereira & Vicente, 2013), meat is about 21.25% protein by weight.If you eat a 500-gram steak, you are consuming roughly 106.25 grams of pure protein. Since 1 Da is exactly 1 g/mol, an average amino acid weighing 100 Da has a molar mass of 100 g/mol.By dividing your total protein (106.25 g) by the average amino acid mass (100 g/mol), we get 1.0625 moles of amino acids. Multiply that by Avogadro’s number 6.022e+23, and you get approximately 6.399e+23 molecules of amino acids

Why do humans eat beef but do not become a cow, eat fish but do not become fish?

A: Because exogenous DNA is either degraded before consumption or degraded by our own enzymes during the digestion process, it cannot be processed to alter our cells. So, the only materials our body takes are the disassembled amino acids; then, it rebuilds them into human proteins, following our human genome.

Why are there only 20 natural amino acids?

A: Certainly, there are only 20 amino acids in Choanoflagellatea, and this set of amino acids provides just enough chemical diversity (acidic, basic, hydrophobic, and hydrophilic shapes) to build complex and versatile proteins. However, I believe it’s a little bit biased to say they are the only “natural” ones, because there are also species of methanogenic archaea that use other amino acids, like selenocysteine and pyrrolysine.

Can you make other non-natural amino acids? Design some new amino acids.

A: Yes, it’s possible to synthesize new amino acids with synthetic biology or organic chemistry. You could also find them as non-canonical amino acids, like 2-azatyrosine, 3-azatyrosine, or azetidine-2-carboxylic acid.

Where did amino acids come from before enzymes that make them, and before life started?

A: It’s believed that they formed completely abiotically, as was proven in the famous Miller-Urey experiment, where they showed that if you mix simple gases present on early Earth (like methane, ammonia, and water) and strike them with electricity (simulating lightning), amino acids form naturally and spontaneously.

If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?

A: Left. Because D-amino acids are enantiomers of our natural L-amino acids, the most energetically stable way for them to fold is also completely mirrored.

Why are most molecular helices right-handed?

A: This conformation is much more energetically favorable because it minimizes atomic collisions (steric hindrance) between side chains and the backbone of the L-amino acids.

Why do β-sheets tend to aggregate? What is the driving force for β-sheet aggregation?

A: Because the unpaired hydrogen bond donors and acceptors of a beta strand attain a highly stable conformation by bonding with another strand, reaching a lower energy state. The driving forces are hydrogen bonds and hydrophobic interactions.

Why do many amyloid diseases form β-sheets? Can you use amyloid β-sheets as materials?

A: Because amyloid diseases are characterized by protein denaturation or misfolding, these proteins expose hydrofobic regions in this process so the end up crashing ther strands together and “zippering up” into tightly packed cross-β-sheet structures. Yes, because of its stability and self-assembly capacity, we could use it as diverse materials: nanowires, hydrogel, scaffolds, etc.

References:

Pereira, P. M. de C. C., & Vicente, A. F. dos R. B. (2013). Meat nutritional composition and nutritive role in the human diet. Meat Science, 93(3), 586–592. https://doi.org/10.1016/j.meatsci.2012.09.018
4.2: Secondary structure and loops. (2018). Biology LibreTexts; Libretexts. https://bio.libretexts.org/Bookshelves/Biochemistry/Fundamentals_of_Biochemistry_(Jakubowski_and_Flatt)/01%3A_Unit_I-_Structure_and_Catalysis/04%3A_The_Three-Dimensional_Structure_of_Proteins/4.02%3A_Secondary_Structure_and_Loops

Protein analysis and visualization

Why Sox2?

This is primarily because is the next protein in the yamanaka factors and participates in a SOX2/OCT4-bound, being OCT-4 the protein I selected for week’s two homework.

In relation to Sox2 aminoacid sequence

Sequence length and composition: This protein is 317 amino acids long. Based on the frequency analysis, Serine (S) is the most frequent amino acid, appearing 36 times (11.36% of the sequence), followed closely by Glycine (G) with 35 occurrences (11.04%).
Homology and BLAST results: When including “predicted” and “homology” evidence levels, the search reached the limit of 250 results in the UniProt BLAST tool. This high number of hits reflects that Sox-2 is an extremely conserved protein across species. However, for high-confidence analysis, I primarily considered the results from the reviewed Swiss-Prot database.
Familiy Classification: SCOP results showed that Sox-2 contains a High Mobility Group (HMG) box structural domain located between residues 206-282. This domain is characterized by its L-shaped triple α-helix fold that binds to the minor groove of DNA. While it shares this structural “HMG-box” fold with other architectural proteins, Sox-2 functionally belongs to the SOX transcription factor family. Despite recognizing the same DNA consensus motif ( 5-(A/T)(A/T)CAA(A/T)G-3 ) different Sox factors achieve target gene selectivity through differential affinity for particular flanking sequences next to consensus Sox sites, homo- or heterodimerization among Sox proteins, posttranslational modifications of Sox factors, or interaction with other co-factors (Wegner, 2010)

In relation to protein structure

RCSB result for P48431 : https://www.rcsb.org/groups/sequence/polymer_entity/P48431

The structure was solved and published in 2003 by Reményi et al. in the journal Genes & Development (https://doi.org/10.1101/gad.269303). Since the article its about the crystal structure of a POU/HMG/DNA ternary complex, the resolution is not below 2.70 Å, wich is not ideal for high-detail drug design, but I think the 2.70 Å resolution obtained is considered a good quality structure for analyzing protein-DNA ternary complexes.
As I mentioned before, the solved structure is often a ternary complex. Apart from the SOX2 protein, it includes the POU domain of the OCT4 protein and a specific DNA oligonucleotide (the FGF4 enhancer)
Acording to SCOP, it belongs to the HMG-box family within the structural class all alpha protein

In relation to the molecule visualization:

Since I ran into some strange errors while trying to use Conda, I used pip to import the PyMOL libraries instead I used py3Dmol to visualize Sox2 in “cartoon” and “ball and stick” representations, and PyMOL for the “ribbon” view

To highlight the secondary structures, I added colors: red for alpha helices, blue for beta sheets, and grey for loops I then used the following code to count the atoms and determine the predominant structure:

By coloring the hydrophilic residues blue, the hydrophobic residues orange, and others yellow, it becomes clear how they are distributed. You can notice that the hydrophilic residues tend to interact with the phosphates in the DNA chain, while the hydrophobic ones are positioned to fit within the DNA grooves.

As seen in the surface visualization, the “binding pockets” of transcription factors aren’t exactly “holes.” Instead, they act more like a contoured surface that forms “hooks” around the DNA backbone.

Using ML-Based Protein Design Tools

Protein Language Modeling

Deep Mutational Scans

At the beggining and the mid part of the protein there are visible and abundant green/yellow zones, indicating that the protein function its not likely to be compromised by a mutation on this parts, but there is also a lot of dark violet zones at the end of the sequences, indicating that the probability of a critical mutation increases in that zone Looking at the Klf4 in the RCSB PDB page, the structural features of the last part of the sequence corresponds to the zinc fingers So it makes sense that in a zoom in of the last part, the mutational scan looks like this

Latent Space Analysis

After running the code and analyzing the positions of tokens of sequences, I could find some clusters of similar proteins within the 3D depiction. just like these transcription factors whose position variability was almost constant on the TSNE2 axis

Then I added the Klf4 protein sequence to the “sequences” variable in the colab

And modified some visualization components to highlight the Klf4 The neighbors of the protein were commongly also transcription factors. There was a lot of Homo sapiens general transcription factors, as the d2dn5a1 seen in the image below Also ther were transcriptional regulator factors of bacteria And quite surprisingly some functionally unrelated proteins like this arsenite translocating ATP-ase in E. colli

Curiously, proteins that at first glance were supposedly to be similr to Kfl4, were relativetely away from it.

Protein Folding

After running ESMFold with the Klf4 the result was quite similar to the RCSB 3D view of the same protein

I did some puntual mutations at the end of the sequence (near the last alpha helix), replacing three amino acids with the last three prolines seen in the sequence below

The result at first glance appeared to be the same

But zooming in, you could se that the first turn of the last alpha helix was disrupted by the mutations original disrupted

Protein Generation (inverse-folding)

Using the code provided on this page, I inputted the PDB code for Klf4 (6VTX) to extract its sequence and generate a heatmap. I didn’t expect the resulting sequence to be this short compared to the original one

This heatmap is visually quite similar to the one we saw earlier. Indeed, while both the Deep Mutational Scan and this ProteinMPNN heatmap are probabilistic models that calculate conditional probabilities, their approaches are fundamentally different. The former uses a 1D array (as a masked language model) to predict an amino acid based on the sequence context of other amino acids. In contrast, ProteinMPNN uses a spatial graph to predict an amino acid based on the 3D structure of the protein, relying on the distances and angles between atoms.

Finally, I ran ESMFold using the newly generated sequence and obtained this 3D structure. Although it’s visibly shorter than the original, the overall fold looks quite similar, and the alpha helices that predominantly define the structure have been successfully preserved

Week 5 HW: Protein design - Part 2

SOD1 Binder Peptide Design
Generate, evaluate properties and optimize binders
L-protein engineering
Improve the stability and auto-folding of the lysis protein of a MS2-phage

SOD1 Binder Peptide Design

1. Generate Binders with PepMLM

First, I retrieve the human SOD1 amino acid sequence from UniProt (https://www.uniprot.org/uniprotkb/P00441/entry):

sp|P00441|SODC_HUMAN Superoxide dismutase [Cu-Zn] OS=Homo sapiens OX=9606 GN=SOD1 PE=1 SV=2 MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTS AGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVV HEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

Then, I induced the alanine to valine mutation at residue 4 (A4V) of SOD1 sequence, resulting in the mutated version:

sp|P00441|SODC_HUMAN Superoxide dismutase [Cu-Zn] OS=Homo sapiens OX=9606 GN=SOD1 PE=1 SV=2 MATKVVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTS AGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVV HEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

Using the collab from https://huggingface.co/ChatterjeeLab/PepMLM-650M, I selected a peptide length of 12, a top-k value of 3 (as the literature suggests this yields better and more flexible outcomes), and set the number of binders to 4.

While running the Colab, I included the provided ground truth binder (FLYRWLPSRRGG) for comparison. Interestingly, I noticed that the known SOD1-binding peptide in the fourth row has an elevated perplexity as calculated by the model.

2. Evaluate Binders with AlphaFold3

I submitted the protein sequence followed by each peptide. Since three of the four peptides had ‘X’ as their last amino acid, I replaced the X with a G (Glycine) to successfully submit the jobs to AlphaFold3.

The results were as following:

Known binder FLYRWLPSRRGG: It binds to the surface near a non structured region and obtained an ipTM = 0.34.

None of the generated peptides appeared to bind near any terminus; all were surface-bound and mostly located in a loop from approximately position 65 to 78.

First generated binder WHYPAVAARWKX with an ipTM = 0.31:

Second generated binder HHYGAVALELKX with an ipTM = 0.45:

Third generated binder WLYPAVVAALKK with an ipTM = 0.24:

Fourth generated binder WRVPAAAVRHGX with an ipTM = 0.36:

The second and fourth generated peptides achieved ipTM scores higher than that of the known binding peptide.

3. Evaluate properties of generated peptides in the PeptiVerse

I analyzed the results of the evaluation by PeptiVerse, considering predicted binding affinity, non-fouling predicted value, hemolysis, solubility, and net charge (pH 7).

FLYRWLPSRRGG: As the ground truth peptide, its properties are good, although its predicted binding affinity is considered “weak” by the model.

WHYPAVAARWKX: Has a relatively low predicted non-fouling value, and its binding affinity value diminished by 0.303 when changing X for G.

HHYGAVALELKX: When changing X for G, the binding affinity value diminished from 5.387 to 5.382, and its permeability value decreased from 0.125 to 0.41. Note that it wasn’t predicted to be permeable from the start.

WLYPAVVAALKK: It achieved the highest binding affinity among the generated and known binding peptides; however, its fouling value is too high, and it is not permeable.

WRVPAAAVRHGX: Has good properties and a binding affinity near that of the known binding peptide.

In AlphaFold the second (HHYGAVALELKX) and fourth (WRVPAAAVRHGX) generated peptide had the higher ipTM. However, the PeptiVerse model considered the third generated peptide (WLYPAVVAALKK) and the known binding peptide (FLYRWLPSRRGG) as the ones with the the highest binding affinity.

4. Generate optimized peptides with moPPIt

I used the provided moPPit Colab, limiting the optimization parameters to work within a GPU T4 execution environment without running into computational power issues. I obtained a single sequence (as generating four sequences exceeded the environment’s capacity) and analyzed it.

The AlphaFold 3 predicted ipTM of 0.34 was not considerably different from the non-optimized binders; however, the binding location changed significantly, appearing near the terminus this time.

The predicted structure shows that the peptide localizes near the C-terminus of the protein, within the dimer interface. Although it does not bind directly to the N-terminus, it interacts with it through the C-terminus beta-strand.

The PeptiVerse model predicted a tight binding affinity, but also a higher foulingness.

Bibliography:

Saeed, M., Yang, Y., Deng, H.-X., Hung, W.-Y., Siddique, N., Dellefave, L., Gellera, C., Andersen, P. M., & Siddique, T. (2009). Age and founder effect of SOD1 A4V mutation causing ALS. Neurology, 72(19), 1634–1639. https://doi.org/10.1212/01.wnl.0000343509.76828.2a

L-protein engineering

Mutagenesis

Mutagenesis using Af2-Multimer

Random mutagenesis

Week 6 HW: Genetic Cirucuits - Part 1

DNA assembly
Questions about the "The Chromophore Color Cloning Quest" lab protocol
Asimov Kernel
Building constructs using the Kernel interface

DNA assembly

1. What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose?

While the principal components in every standard PCR reaction are a polymerase (like Taq), dNTPs, nuclease-free water, primers, buffer, and the DNA template to amplify, a Master Mix typically only contains the polymerase, dNTPs, and an optimized buffer (which includes Mg²⁺ as polymerases require magnesium to function). In the specific case of the Phusion High-Fidelity PCR Master Mix, we must note that the polymerase used is not Taq, but Phusion—a high-fidelity enzyme with 3’ to 5’ exonuclease (proofreading) activity. Its purpose is to amplify DNA with extreme accuracy and speed, preventing mutations during the assembly of sensitive genetic circuits.

2. What are some factors that determine primer annealing temperature during PCR?

First, the primer melting temperature (T_m). The T_m is the temperature at which 50% of the primer is bound to its complementary sequence, and the annealing temperature is typically set a few degrees below this. To calculate this melting point, we need to consider the length of the primer and the G-C content of the DNA template; G-C base pairs are bound by 3 hydrogen bonds each, so more heat is needed to separate them compared to A-T pairs. Second, the buffer conditions (specifically salt and magnesium concentrations) are a factor, as they add stability to the DNA double-stranded form by shielding the negative charges of the phosphate backbone, hence increasing the T_m and the required annealing temperature.

3. Compare and contrast PCR and restriction enzyme digest methods in terms of protocol and use preference situations.

Protocols: PCR relies on thermal cycling (denaturation, annealing, and extension) using oligonucleotide primers and a thermostable DNA polymerase to synthesize new copies of a specific DNA fragment. In contrast, a restriction enzyme digest is an isothermal incubation (often at 37°C) where an endonuclease cuts pre-existing, double-stranded DNA at specific recognition sites, followed by heat inactivation of the enzyme.

Use Preference: PCR is preferable when you need to amplify a highly specific region from a complex template, when your starting DNA concentration is very low, or when you need to add custom overhangs (like homology arms) to the ends of your sequence. Restriction enzyme digestion is preferable for diagnostic checks (verifying the size of a cloned insert), when preparing plasmids for traditional ligation, or when you must completely avoid the risk of polymerase-induced mutations in a sequence that is already cloned.

4. How can you ensure that the DNA sequences that you have digested and PCR-ed will be appropriate for Gibson cloning?

First, in the PCR procedure, you want to end with fragments that have overlapping terminal bases (typically 15-40 base pairs) complementary to those in the backbone and the ones in the adjacent fragments, in order for them to be assembled by homology. Second, you can make sure the processed fragments are correct by doing gel electrophoresis and checking if the DNA is degraded or not by seeing if the results are consistent with the expected size. Finally, you can perform a gel extraction to purify these specific fragments from the template DNA and non-specific PCR products, ensuring a clean reaction.

5. How does the plasmid DNA enters the E. coli cells during transformation?

Essentially, it enters through stress-induced pores in the plasmalema. It could be physical or chemical stress. This process is facilitated by cations (normally calcium cations from calcium chloride (CaCl₂). These cations act as a bridge to shield and neutralize the electrostatic repulsion between the negatively charged phosphate backbone of the plasmid DNA and the negatively charged lipopolysaccharides on the bacterial plasmalema. Once the DNA is resting on the membrane, the stress pore induction (often done via heat-shock)creates a thermal imbalance that sweeps the DNA into the bacterial cell, allowing it to reach the cytoplasm. Other physical method could be electroporation, which generates pores with high electrical voltage.

6. Describe another assembly method in detail (such as Golden Gate Assembly)

There are plenty of methods, like Modular Cloning or Circular Polymerase Extentional cloning, etc. But the majority of them are based on Gibson assembly or Golden Gate Assembly. so I decided to explain the Golden Gate Assembly:

This technique leverages the capacity of specific endonucleases, called Type IIS restriction enzymes, to recognize asymmetric DNA sequences and cleave outside of this recognition site. By cutting typically 1 to 20 base pairs away, these enzymes generate customizable overhangs. The orientation of the recognition sites dictates the assembly mechanics: for the inserts, the sites are placed facing inward so the cut occurs between them, whereas in the destination vector, the sites must be outward-facing to flank the region being excised. The goal is to achieve one-pot assembly of multiple DNA fragments by designing these overhangs to be complementary (which can be generated via PCR) and using T4 DNA ligase to seal the construct. This simultaneous digestion and ligation is achievable because the restriction recognition sites are entirely removed during cleavage. Consequently, the assembled fragments and the final backbone become resistant to further digestion, driving the reaction forward seamlessly.

Diagram taken from SnapGene: https://youtu.be/NzQdLQ44I7w?si=UbRcM9ca7adXYMNH

Model this assembly method with Benchling:

Asimov Kernel

Week 07 HW: Genetic circuits II

Week 9 HW: Cell-free systems

Week 10 HW: Advance imaging & measurement technology

Week 11 HW: Bioproduction adn cloud labs

Final project advance

aim1
the proccess
aim 4
protocol details

SECTION 1: ABSTRACT

Abstract Cellular aging and tissue damage are deeply linked to the accumulation of epigenetic errors over time. Although partial reprogramming using Yamanaka factors (OSK) has shown promise for age reversal, it presents a critical problem: uncontrolled overexpression can lead to cellular dedifferentiation. In neural tissue, this entails the catastrophic risk of altering synaptic connections and “erasing” memories. The overall goal of this project is to design a highly regulated epigenetic-error correction genetic circuit that allows cellular rejuvenation without compromising neuronal identity and function. My hypothesis is that, by using Intracellular Artificial Neural Networks (iANN) introduced to us in the week 7, to model and predict regulatory interactions, it is possible to fine-tune promoter strength and the exact concentration of OSK factors needed to achieve safe reprogramming. Specific aims include: (1) in silico assembly and plasmid cloning in bacteria; (2) validation of the circuit’s efficacy and safety in mammalian cells (mice); and (3) eventual clinical translation in human models. Methods will encompass target cell selection, computational modeling with iANNs, DNA synthesis, and cell culture, establishing a new standard of biosafety in epigenetic rejuvenation therapies.

SECTION 2: PROJECT AIMS

Aim 1: Assemble and validate a stable plasmid containing the epigenetic-error correction circuit in bacterial models by utilizing in silico iANN modeling, standard DNA synthesis, and bacterial cloning protocols. Specifically, this aim covers the foundational phase of the circuit. Beggining with the careful selection of the target cell line. Subsequently, I will use Intracellular Artificial Neural Network (iANN) models to calculate and predict the optimal promoter strength and precise concentrations of OSK transcription factors required to prevent dedifferentiation. With these parameters, I will assemble the genetic circuit in silico, order the optimized plasmid (e.g., through Twist Biosciences), and proceed with the transformation and stabilization of the plasmid in bacteria (such as E. coli DH5-alpha) for amplification and storage.

Aim 2: The next step, following a successful Aim 1, will be the transfection and evaluation of the plasmid in mammalian cells, specifically using murine (mouse) cell models and, eventually, in vivo models. This aim seeks to empirically validate the iANN model predictions, verifying that the OSK factors are expressed at the correct levels to reduce epigenetic aging markers without inducing pluripotency. This will involve solving technical limitations related to gene delivery methods (such as the use of adeno-associated viral vectors, AAVs) and monitoring the cellular transcriptome over time.

Aim 3: In the long term, the visionary aim, which I’m considering to serve as the basis for my bachelor’s thesis, is to analyze the effectiveness and biosafety of the plasmid in human models, addressing a major barrier in the field of cellular reprogramming: the preservation of complex functional identity. Currently, there is a justified fear that neuronal reprogramming could interfere with established synaptic networks, carrying the risk of “erasing” memory or altering cognition. If this precision circuit functions as designed, it would challenge the current clinical paradigm, demonstrating that it is possible to decouple the epigenetic clock from the human cellular developmental clock. This would open the door to central nervous system rejuvenation therapies that treat neurodegenerative diseases without neuro-amnesic side effects.

SECTION 3: BACKGROUND

To understand the current state of the field, it is essential to review the advances in in vivo epigenetic reprogramming. For example, the work of Ocampo et al. (2016) pioneered the demonstration that short-term cyclic expression of OSKM factors can improve cellular markers of aging and extend lifespan in mice with progeria, without forming teratomas. More recently, Lu et al. (2020) successfully restored vision in old mice and those with optic nerve damage through the gene expression of OSK (excluding c-Myc due to its oncogenicity). This demonstrated that central nervous system tissue in mammals retains an epigenetic record of youth that can be recovered. Despite these advances, a critical gap exists: the quantitative and temporal control of these factors at the intracellular level remains poor and unpredictable.

This project is deeply innovative because it introduces a layer of computational rationality (iANN) to a highly stochastic biological process. Unlike traditional approaches that rely on constitutive or exogenously drug-inducible promoters, our circuit aims to be self-managed by the cell itself through in silico tuned sensors. In this way, we expand the boundaries of synthetic biology by creating a “state-aware” rejuvenation system that automatically shuts down before erasing the epigenetic marks that define cell identity.

The impact of this project lies in the immense global burden posed by aging-related diseases, especially neurodegenerative conditions like Alzheimer’s or Parkinson’s. The problem we address is the critical safety barrier: the current inability to rejuvenate neurons without risking the network of synaptic connections (the “memory” of the individual). This is significant because, without a solution to this problem, genetic reprogramming therapies will never be safely applicable to the human brain. At the level of advancing technical capability, combining iANNs with in vivo circuit design will provide a new standard methodology for predicting gene dosing. If the aims are achieved, we could catalyze a field-level change in regenerative medicine, moving from palliative maintenance treatments to true neuronal age-reversal interventions without loss of function.

Ethical Implications The ethical implications involved in the epigenetic manipulation of the brain are substantial, appealing directly to the principles of non-maleficence (avoiding harm) and responsibility. The primary ethical risk lies in the potential alteration of an individual’s cognition, personality, or memory if the neuronal reprogramming process inadvertently dedifferentiates the cells, disrupting synapses. There is also a concern related to the principle of justice: if effective rejuvenation therapies are developed, extreme health inequality could arise if these treatments are only accessible to an economic elite, exacerbating the gap in life expectancy and quality of life.

To ensure that the project is ethical both in how the research is conducted and in its broader implications for society, we propose the mandatory implementation of safety “kill switches” in the original circuit design, allowing for the immediate halt of OSK transcription in case of adverse outcomes. An unintended consequence could be cellular hyper-proliferation (tumorigenesis) despite our predictive models. To mitigate these errors and incorrect assumptions (such as the assumption that in silico results will translate perfectly in vivo), validations in Aim 2 with animal models must include comprehensive neurological and behavioral monitoring. As an alternative to this direct gene therapy, the scientific community could consider using small molecules that modulate specific epigenetic pathways transiently, although these are typically less precise than the biological circuit proposed here.

SECTION 4: Aim 1 - The proccess

First I retrieved the sequence of the yamanaka factors from Addgene: Oct3/4 (https://www.addgene.org/13366/) #13366 Sox2 (https://www.addgene.org/13367/) #13367 Klf4 (https://www.addgene.org/13370/) #13370

It was known since the 90’s that OCT4 and SOX2 can form a heterodimer that binds to a composite motif at thousands of sites in the genome, acting as pioneer trascription factors (the ones that maintains the accessibility of chromatin regions previously inaccessible).

Looking in the literature I found the degradation and protein stability values from OSK in mouse embryonic stem cells (ESC). These values were a reference, since the were obtained from studies in 2TS22C cell line and my cassete is intendent to be applied into old fibroblast In the KLF4 protein case, Dhaliwal and peers in 2019 found that its stability was diminished in a 24h of differentiation ESC, but the presence of SOX2 and NANOG expression in differentiated cells restores KLF4 protein stability from t1/2 =2htot1/2=17 h in the case of Sox2 and from t1/2=1.5 h to t1/2=25 h in Nanog’s case.

Fig. 2.A from Dhaliwal et al. (2019) LIF referes to the Leukemia Inhibitory Factor serum in which the stem cells are cultivated to mantain cell plurypotency

These proteins have different prevalence dependent on the cellular type being studied and the life stage of the cell. So, in order to get a better reference I used the half-life information indicated in the supplementary information of Roux et al. (2022), in which they considered the half-life of OSK factors in three days from being transfected into fibroblast

Fig.S1 A and B from Roux et al. (2022) supplementary information

Now, with this search done, I proceded with a simple math model that will leverage iANNs loss function to optimize for the best promoters

Mathematical Framework of the OSK Circuit

Formal Definition: The temporal evolution of the OSK concentration is modeled as a deterministic, continuous-time dynamical system. It is governed by a first-order ordinary differential equation (ODE) that balances non-linear thermodynamic synthesis with first-order kinetic decay.

The core mass-action kinetics of the system is defined by the following ODE:

$$ \frac{d[OSK]}{dt} = \alpha_{0} + \alpha - \gamma[OSK] $$

Where each term represents a distinct biophysical process:

$\alpha_0$: The basal transcription rate (promoter leakiness) in the absence of the inducer.
$\alpha$: The active transcription rate driven by the inducer. This term is constrained by thermodynamic equilibrium and obeys the Hill-equation:
$$ \alpha = \alpha_{max} \frac{[L]^n}{K_d^n + [L]^n} $$
Here, $[L]$ is the concentration of the inducer, $K_d$ is the microscopic dissociation constant (representing the affinity of the inducer to the receptor), and $n$ is the Hill coefficient denoting steric cooperativity.
$\gamma$: The kinetic decay constant, representing the thermodynamic degradation of the OSK protein and cellular dilution. It is inversely proportional to the protein's biological half-life ($t_{1/2}$):
$$ \gamma = \frac{\ln(2)}{t_{1/2}} $$

Computational Integration: Rather than relying on arbitrary parameters, this ODE serves as the explicit physics constraint within our Physics-Informed Neural Network (PINN). By embedding this differential topology into the model's loss function, we ensure that in silico sequence designs strictly adhere to real-world kinetic boundaries.

Now, since I’m still on diapers when it comes to programming, with help of Gemini.AI I code a synthetic transcriptome with the gathered information, almost all based on Roux et al. (2022), like the OSK ARNm virtually indicating 0 on the 10th and 15th days, just like the remaining < 0.05% remaining they got. ``` import numpy as np import scanpy as sc import anndata as ad import pandas as pd

print("Generando transcriptoma sintético de MEFs con datos empíricos...")

# 1. Definimos los días de recolección (Pulse: 0 a 7, Chase: 7 a 15)
dias = [0, 3, 7, 10, 15]
celulas_por_dia = 200

# =========================================================================
# PARÁMETROS EMPÍRICOS EXTRAÍDOS DE LA LITERATURA
# =========================================================================
# Vidas medias de ARNm [horas] (Roux et al. Supplementary Information - Fig S1)
t_half_Pou5f1 = 3.2
t_half_Sox2 = 3.1
t_half_Klf4 = 5.5  # Nota: Dhaliwal et al. reporta >24h en ES, pero usamos el de transito de Roux

# Calculamos la constante de decaimiento kd en [días^-1] (kd = ln(2) / t_half_dias)
# Esto dictará la caída exponencial durante la fase de "Chase"
kd_Pou5f1 = np.log(2) / (t_half_Pou5f1 / 24)  # ~5.2 días^-1
kd_Sox2   = np.log(2) / (t_half_Sox2 / 24)    # ~5.3 días^-1
kd_Klf4   = np.log(2) / (t_half_Klf4 / 24)    # ~3.0 días^-1

X_list = []
obs_list = []

for t in dias:
    # -----------------------------------------------------------------
    # A. DINÁMICA DE LA IDENTIDAD SOMÁTICA (Col1a1, Thy1)
    # -----------------------------------------------------------------
    # [Paper: Roux et al. "Partial reprogramming transiently suppresses somatic identity"]
    # La expresión baja durante la inducción y se recupera tras el chase.
    if t <= 7:
        # Supresión transitoria durante el pulso
        media_somatica = 10 - t 
        varianza = 0.5 + (t * 0.5) # Critical Slowing Down hacia la bifurcación
    else:
        # Re-establecimiento de la identidad tras retirar los factores
        delta_t_chase = t - 7
        media_somatica = 3 + (delta_t_chase * 0.8) # Recuperación lineal/exponencial
        varianza = 1.5 
        
    genes_somaticos = np.random.normal(loc=media_somatica, scale=varianza, size=(celulas_por_dia, 2))
    genes_somaticos = np.clip(genes_somaticos, a_min=0, a_max=None) # No hay scRNA-seq negativo

    # -----------------------------------------------------------------
    # B. DINÁMICA DE LOS FACTORES DE REPROGRAMACIÓN (Pou5f1, Sox2, Klf4)
    # -----------------------------------------------------------------
    # [Paper: Roux et al. Suppl. Info - EDO decay model: p(t) = p(0)e^(-kt)]
    if t <= 7:
        # Fase PULSE: Acumulación hasta estado estacionario (steady-state)
        # Simulamos que alcanzan un nivel máximo (ej. log1p CPM = 8)
        media_Pou5f1 = t * 1.1
        media_Sox2   = t * 1.1
        media_Klf4   = t * 1.1
    else:
        # Fase CHASE: Decaimiento cinético estricto según kd empírico
        delta_t_chase = t - 7
        # Nivel máximo alcanzado en el día 7
        max_expr = 7 * 1.1 
        
        # p(t) = p(0) * exp(-kd * t)
        media_Pou5f1 = max_expr * np.exp(-kd_Pou5f1 * delta_t_chase)
        media_Sox2   = max_expr * np.exp(-kd_Sox2 * delta_t_chase)
        media_Klf4   = max_expr * np.exp(-kd_Klf4 * delta_t_chase)
        
        # [Paper: Roux et al. - "By day 3 of chase, < 0.05% of mRNA remains"]
        # Las medias a partir del día 10 tenderán asintóticamente a 0 rápidamente.

    # Añadimos ruido técnico típico de scRNA-seq (ruido Poisson/Normal)
    gen_Pou5f1 = np.random.normal(loc=media_Pou5f1, scale=0.5, size=(celulas_por_dia, 1))
    gen_Sox2   = np.random.normal(loc=media_Sox2,   scale=0.5, size=(celulas_por_dia, 1))
    gen_Klf4   = np.random.normal(loc=media_Klf4,   scale=0.5, size=(celulas_por_dia, 1))
    
    genes_reprog = np.hstack([gen_Pou5f1, gen_Sox2, gen_Klf4])
    genes_reprog = np.clip(genes_reprog, a_min=0, a_max=None)

    # -----------------------------------------------------------------
    # C. ENSAMBLAJE DE LA MATRIZ DEL TIEMPO t
    # -----------------------------------------------------------------
    X_t = np.hstack([genes_somaticos, genes_reprog])
    X_list.append(X_t)

    # Metadatos
    obs_t = pd.DataFrame({
        'time_days': [t] * celulas_por_dia,
        'phase': ['Pulse' if t <= 7 else 'Chase'] * celulas_por_dia
    })
    obs_list.append(obs_t)

# 3. Ensamblamos el objeto AnnData (.h5ad)
X_total = np.vstack(X_list)
obs_total = pd.concat(obs_list, ignore_index=True)
var_total = pd.DataFrame(index=["Col1a1", "Thy1", "Pou5f1", "Sox2", "Klf4"])

adata_simulado = ad.AnnData(X=X_total, obs=obs_total, var=var_total)

# 4. Guardamos el archivo
ruta_archivo = "mef_simulado_parametros_empiricos.h5ad"
adata_simulado.write_h5ad(ruta_archivo)

print(f"¡Éxito! Archivo {ruta_archivo} creado.")
```

Then, the next step was coding the iANNs training to find the most suitable rate of transcription to match the incoherent feed-forward loop I was planning to add in the plasmid. So, again with Gemini’s help, the code was: ``` import torch import torch.nn as nn import scanpy as sc import numpy as np

  # =====================================================================
  # 1. MÓDULO DE FÍSICA ESTADÍSTICA (Scanpy para Critical Slowing Down)
  # =====================================================================
  def extract_critical_dynamics(h5ad_path, somatic_genes, osk_gene_name):
      """
      Lee datos scRNA-seq, calcula la varianza somática para detectar el
      punto de bifurcación y extrae tensores para PyTorch.
      """
      print("Cargando matriz scRNA-seq empírica...")
      adata = sc.read_h5ad(h5ad_path)
  
      time_points = sorted(adata.obs['time_days'].unique())
      variances = []
      osk_means = []
  
      print("Calculando varianza del atractor somático (Álgebra Lineal)...")
      for t in time_points:
          cells_t = adata[adata.obs['time_days'] == t]
          
          X_somatic = cells_t[:, somatic_genes].X
          if hasattr(X_somatic, "toarray"):
              X_somatic = X_somatic.toarray()
  
          # Varianza como trazador de inestabilidad termodinámica
          var_t = np.var(X_somatic)
          variances.append(var_t)
  
          osk_expr = cells_t[:, osk_gene_name].X
          if hasattr(osk_expr, "toarray"):
              osk_expr = osk_expr.toarray()
          osk_means.append(np.mean(osk_expr))
  
      t_crit_idx = np.argmax(variances)
      t_crit = time_points[t_crit_idx]
      print(f"Bifurcación detectada en t = {t_crit} días. Recortando fase Pulse...")
  
      # Extraemos solo la fase "Pulse" (donde alpha está activo buscando el steady-state)
      t_empirical = np.array(time_points[:t_crit_idx + 1], dtype=np.float32).reshape(-1, 1)
      osk_empirical = np.array(osk_means[:t_crit_idx + 1], dtype=np.float32).reshape(-1, 1)
  
      t_tensor = torch.tensor(t_empirical, requires_grad=True)
      osk_tensor = torch.tensor(osk_empirical)
  
      return t_tensor, osk_tensor, t_crit
  
  
  import torch
  import torch.nn as nn
  
  # =====================================================================
  # 2. ARQUITECTURA PINN IFFL (Incoherent Feed-Forward Loop)
  # =====================================================================
  class iANN_OSK_IFFL(nn.Module):
      def __init__(self, kd_empirico):
          super(iANN_OSK_IFFL, self).__init__()
          self.net = nn.Sequential(
              nn.Linear(1, 32),
              nn.Tanh(),
              nn.Linear(32, 32),
              nn.Tanh(),
              nn.Linear(32, 1)
          )
          
          # PARÁMETROS QUE LA RED DEBE DESCUBRIR
          self.alpha = nn.Parameter(torch.tensor([5.0])) # Tasa transcripcional máxima
          
          # PARÁMETROS FÍSICOS FIJOS (Conocimiento a priori del IFFL)
          self.kd = kd_empirico
          
          # Parámetros del Represor (Puedes volverlos nn.Parameter si quieres que la red los infiera)
          self.tau = 1.5 # Retraso temporal (ej. 1.5 días / 36 horas para que empiece a frenar)
          self.n_hill = 3.0 # Coeficiente de Hill (cooperatividad del represor)
          self.k_represor = 2.0 # Qué tan rápido se acumula el represor tras el retraso
  
      def forward(self, t):
          return self.net(t)
  
  def compute_physics_loss_iffl(model, t_collocation):
      t_collocation.requires_grad = True
      OSK_pred = model(t_collocation)
  
      # 1. Diferenciación Automática para hallar d[OSK]/dt
      dOSK_dt = torch.autograd.grad(
          outputs=OSK_pred,
          inputs=t_collocation,
          grad_outputs=torch.ones_like(OSK_pred),
          create_graph=True
      )[0]
  
      # 2. Ecuación del Represor (Variable Latente)
      # Modelamos el represor [R] saturando el promotor después del tiempo tau
      # Usamos una sigmoide (función logística) para simular la acumulación del represor en el tiempo
      R_t = torch.sigmoid(model.k_represor * (t_collocation - model.tau))
      
      # 3. Función de inhibición de Hill: 1 / (1 + R^n)
      # Si R_t es 0 (antes de tau), el freno es 1 (faucet abierto). 
      # Si R_t es 1 (después de tau), el freno tiende a 0 (faucet cerrado).
      freno_transcripcional = 1.0 / (1.0 + R_t ** model.n_hill)
  
      # 4. Nuevo Residual de la EDO con el IFFL incorporado
      produccion_real = model.alpha * freno_transcripcional
      residual = dOSK_dt - produccion_real + (model.kd * OSK_pred)
      
      return torch.mean(residual ** 2)
  
  # =====================================================================
  # 3. BUCLE DE ENTRENAMIENTO INTEGRADO
  # =====================================================================
  def train_pinn(model, t_data, OSK_data, epochs=3000):
      optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
      mse_loss_fn = nn.MSELoss()
  
      print("Iniciando inferencia termodinámica para descubrir Alpha...")
      for epoch in range(epochs):
          optimizer.zero_grad()
  
          # A. Pérdida Empírica (ScRNA-seq data)
          OSK_pred_data = model(t_data)
          loss_data = mse_loss_fn(OSK_pred_data, OSK_data)
  
          # B. Pérdida Física Continua
          t_collocation = torch.rand(100, 1) * t_data.max().item()
          loss_ode = compute_physics_loss(model, t_collocation)
  
          # C. Gradiente Global
          loss_total = loss_data + (0.5 * loss_ode)
  
          loss_total.backward()
          optimizer.step()
  
          if epoch % 500 == 0:
              print(f"Epoch {epoch} | Loss: {loss_total.item():.4f} | Alpha inferida: {model.alpha.item():.4f}")
  
  # =====================================================================
  # 4. EJECUCIÓN DEL PIPELINE
  # =====================================================================
  if __name__ == "__main__":
      # 1. Usamos el nuevo archivo con la métrica real
      ruta_geo = "mef_simulado_parametros_empiricos.h5ad"
  
      # 2. Genes reales del nuevo script (quitamos Fap que ya no está)
      genes_fibroblasto = ["Col1a1", "Thy1"]
      gen_diana = "Pou5f1"
  
      t_emp, osk_emp, t_c = extract_critical_dynamics(ruta_geo, genes_fibroblasto, gen_diana)
  
      # 3. Constantes empíricas (Paper: Roux et al. Suppl. Fig S1)
      # Pou5f1 mRNA half-life ~ 3.2 horas -> ~0.1333 días
      # kd = ln(2) / t_half = 5.2 días^-1
      kd_Pou5f1_empirico = np.log(2) / (3.2 / 24.0)
      print(f"Inyectando kd empírico de Pou5f1: {kd_Pou5f1_empirico:.2f} d^-1")
  
      # 4. Entrenamos
      modelo_osk = iANN_OSK(kd_empirico=kd_Pou5f1_empirico)
      train_pinn(modelo_osk, t_emp, osk_emp)
```

Finally obtaining

References:

https://doi.org/10.1101/gad.324319.119

aim1

Fig. 2.A from Dhaliwal et al. (2019) LIF referes to the Leukemia Inhibitory Factor serum in which the stem cells are cultivated to mantain cell plurypotency

Fig.S1 A and B from Roux et al. (2022) supplementary information

Now, with this search done, I proceded with a simple math model that will leverage iANNs loss function to optimize for the best promoters

Mathematical Framework of the OSK Circuit

The core mass-action kinetics of the system is defined by the following ODE:

$$ \frac{d[OSK]}{dt} = \alpha_{0} + \alpha - \gamma[OSK] $$

Where each term represents a distinct biophysical process:

$\alpha_0$: The basal transcription rate (promoter leakiness) in the absence of the inducer.
$\alpha$: The active transcription rate driven by the inducer. This term is constrained by thermodynamic equilibrium and obeys the Hill-equation:
$$ \alpha = \alpha_{max} \frac{[L]^n}{K_d^n + [L]^n} $$
Here, $[L]$ is the concentration of the inducer, $K_d$ is the microscopic dissociation constant (representing the affinity of the inducer to the receptor), and $n$ is the Hill coefficient denoting steric cooperativity.
$\gamma$: The kinetic decay constant, representing the thermodynamic degradation of the OSK protein and cellular dilution. It is inversely proportional to the protein's biological half-life ($t_{1/2}$):
$$ \gamma = \frac{\ln(2)}{t_{1/2}} $$

import numpy as np
import scanpy as sc
import anndata as ad
import pandas as pd

print("Generando transcriptoma sintético de MEFs con datos empíricos...")

# 1. Definimos los días de recolección (Pulse: 0 a 7, Chase: 7 a 15)
dias = [0, 3, 7, 10, 15]
celulas_por_dia = 200

# =========================================================================
# PARÁMETROS EMPÍRICOS EXTRAÍDOS DE LA LITERATURA
# =========================================================================
# Vidas medias de ARNm [horas] (Roux et al. Supplementary Information - Fig S1)
t_half_Pou5f1 = 3.2
t_half_Sox2 = 3.1
t_half_Klf4 = 5.5  # Nota: Dhaliwal et al. reporta >24h en ES, pero usamos el de transito de Roux

# Calculamos la constante de decaimiento kd en [días^-1] (kd = ln(2) / t_half_dias)
# Esto dictará la caída exponencial durante la fase de "Chase"
kd_Pou5f1 = np.log(2) / (t_half_Pou5f1 / 24)  # ~5.2 días^-1
kd_Sox2   = np.log(2) / (t_half_Sox2 / 24)    # ~5.3 días^-1
kd_Klf4   = np.log(2) / (t_half_Klf4 / 24)    # ~3.0 días^-1

X_list = []
obs_list = []

for t in dias:
    # -----------------------------------------------------------------
    # A. DINÁMICA DE LA IDENTIDAD SOMÁTICA (Col1a1, Thy1)
    # -----------------------------------------------------------------
    # [Paper: Roux et al. "Partial reprogramming transiently suppresses somatic identity"]
    # La expresión baja durante la inducción y se recupera tras el chase.
    if t <= 7:
        # Supresión transitoria durante el pulso
        media_somatica = 10 - t 
        varianza = 0.5 + (t * 0.5) # Critical Slowing Down hacia la bifurcación
    else:
        # Re-establecimiento de la identidad tras retirar los factores
        delta_t_chase = t - 7
        media_somatica = 3 + (delta_t_chase * 0.8) # Recuperación lineal/exponencial
        varianza = 1.5 
        
    genes_somaticos = np.random.normal(loc=media_somatica, scale=varianza, size=(celulas_por_dia, 2))
    genes_somaticos = np.clip(genes_somaticos, a_min=0, a_max=None) # No hay scRNA-seq negativo

    # -----------------------------------------------------------------
    # B. DINÁMICA DE LOS FACTORES DE REPROGRAMACIÓN (Pou5f1, Sox2, Klf4)
    # -----------------------------------------------------------------
    # [Paper: Roux et al. Suppl. Info - EDO decay model: p(t) = p(0)e^(-kt)]
    if t <= 7:
        # Fase PULSE: Acumulación hasta estado estacionario (steady-state)
        # Simulamos que alcanzan un nivel máximo (ej. log1p CPM = 8)
        media_Pou5f1 = t * 1.1
        media_Sox2   = t * 1.1
        media_Klf4   = t * 1.1
    else:
        # Fase CHASE: Decaimiento cinético estricto según kd empírico
        delta_t_chase = t - 7
        # Nivel máximo alcanzado en el día 7
        max_expr = 7 * 1.1 
        
        # p(t) = p(0) * exp(-kd * t)
        media_Pou5f1 = max_expr * np.exp(-kd_Pou5f1 * delta_t_chase)
        media_Sox2   = max_expr * np.exp(-kd_Sox2 * delta_t_chase)
        media_Klf4   = max_expr * np.exp(-kd_Klf4 * delta_t_chase)
        
        # [Paper: Roux et al. - "By day 3 of chase, < 0.05% of mRNA remains"]
        # Las medias a partir del día 10 tenderán asintóticamente a 0 rápidamente.

    # Añadimos ruido técnico típico de scRNA-seq (ruido Poisson/Normal)
    gen_Pou5f1 = np.random.normal(loc=media_Pou5f1, scale=0.5, size=(celulas_por_dia, 1))
    gen_Sox2   = np.random.normal(loc=media_Sox2,   scale=0.5, size=(celulas_por_dia, 1))
    gen_Klf4   = np.random.normal(loc=media_Klf4,   scale=0.5, size=(celulas_por_dia, 1))
    
    genes_reprog = np.hstack([gen_Pou5f1, gen_Sox2, gen_Klf4])
    genes_reprog = np.clip(genes_reprog, a_min=0, a_max=None)

    # -----------------------------------------------------------------
    # C. ENSAMBLAJE DE LA MATRIZ DEL TIEMPO t
    # -----------------------------------------------------------------
    X_t = np.hstack([genes_somaticos, genes_reprog])
    X_list.append(X_t)

    # Metadatos
    obs_t = pd.DataFrame({
        'time_days': [t] * celulas_por_dia,
        'phase': ['Pulse' if t <= 7 else 'Chase'] * celulas_por_dia
    })
    obs_list.append(obs_t)

# 3. Ensamblamos el objeto AnnData (.h5ad)
X_total = np.vstack(X_list)
obs_total = pd.concat(obs_list, ignore_index=True)
var_total = pd.DataFrame(index=["Col1a1", "Thy1", "Pou5f1", "Sox2", "Klf4"])

adata_simulado = ad.AnnData(X=X_total, obs=obs_total, var=var_total)

# 4. Guardamos el archivo
ruta_archivo = "mef_simulado_parametros_empiricos.h5ad"
adata_simulado.write_h5ad(ruta_archivo)

print(f"¡Éxito! Archivo {ruta_archivo} creado.")

  import torch
  import torch.nn as nn
  import scanpy as sc
  import numpy as np
  
  # =====================================================================
  # 1. MÓDULO DE FÍSICA ESTADÍSTICA (Scanpy para Critical Slowing Down)
  # =====================================================================
  def extract_critical_dynamics(h5ad_path, somatic_genes, osk_gene_name):
      """
      Lee datos scRNA-seq, calcula la varianza somática para detectar el
      punto de bifurcación y extrae tensores para PyTorch.
      """
      print("Cargando matriz scRNA-seq empírica...")
      adata = sc.read_h5ad(h5ad_path)
  
      time_points = sorted(adata.obs['time_days'].unique())
      variances = []
      osk_means = []
  
      print("Calculando varianza del atractor somático (Álgebra Lineal)...")
      for t in time_points:
          cells_t = adata[adata.obs['time_days'] == t]
          
          X_somatic = cells_t[:, somatic_genes].X
          if hasattr(X_somatic, "toarray"):
              X_somatic = X_somatic.toarray()
  
          # Varianza como trazador de inestabilidad termodinámica
          var_t = np.var(X_somatic)
          variances.append(var_t)
  
          osk_expr = cells_t[:, osk_gene_name].X
          if hasattr(osk_expr, "toarray"):
              osk_expr = osk_expr.toarray()
          osk_means.append(np.mean(osk_expr))
  
      t_crit_idx = np.argmax(variances)
      t_crit = time_points[t_crit_idx]
      print(f"Bifurcación detectada en t = {t_crit} días. Recortando fase Pulse...")
  
      # Extraemos solo la fase "Pulse" (donde alpha está activo buscando el steady-state)
      t_empirical = np.array(time_points[:t_crit_idx + 1], dtype=np.float32).reshape(-1, 1)
      osk_empirical = np.array(osk_means[:t_crit_idx + 1], dtype=np.float32).reshape(-1, 1)
  
      t_tensor = torch.tensor(t_empirical, requires_grad=True)
      osk_tensor = torch.tensor(osk_empirical)
  
      return t_tensor, osk_tensor, t_crit
  
  
  import torch
  import torch.nn as nn
  
  # =====================================================================
  # 2. ARQUITECTURA PINN IFFL (Incoherent Feed-Forward Loop)
  # =====================================================================
  class iANN_OSK_IFFL(nn.Module):
      def __init__(self, kd_empirico):
          super(iANN_OSK_IFFL, self).__init__()
          self.net = nn.Sequential(
              nn.Linear(1, 32),
              nn.Tanh(),
              nn.Linear(32, 32),
              nn.Tanh(),
              nn.Linear(32, 1)
          )
          
          # PARÁMETROS QUE LA RED DEBE DESCUBRIR
          self.alpha = nn.Parameter(torch.tensor([5.0])) # Tasa transcripcional máxima
          
          # PARÁMETROS FÍSICOS FIJOS (Conocimiento a priori del IFFL)
          self.kd = kd_empirico
          
          # Parámetros del Represor (Puedes volverlos nn.Parameter si quieres que la red los infiera)
          self.tau = 1.5 # Retraso temporal (ej. 1.5 días / 36 horas para que empiece a frenar)
          self.n_hill = 3.0 # Coeficiente de Hill (cooperatividad del represor)
          self.k_represor = 2.0 # Qué tan rápido se acumula el represor tras el retraso
  
      def forward(self, t):
          return self.net(t)
  
  def compute_physics_loss_iffl(model, t_collocation):
      t_collocation.requires_grad = True
      OSK_pred = model(t_collocation)
  
      # 1. Diferenciación Automática para hallar d[OSK]/dt
      dOSK_dt = torch.autograd.grad(
          outputs=OSK_pred,
          inputs=t_collocation,
          grad_outputs=torch.ones_like(OSK_pred),
          create_graph=True
      )[0]
  
      # 2. Ecuación del Represor (Variable Latente)
      # Modelamos el represor [R] saturando el promotor después del tiempo tau
      # Usamos una sigmoide (función logística) para simular la acumulación del represor en el tiempo
      R_t = torch.sigmoid(model.k_represor * (t_collocation - model.tau))
      
      # 3. Función de inhibición de Hill: 1 / (1 + R^n)
      # Si R_t es 0 (antes de tau), el freno es 1 (faucet abierto). 
      # Si R_t es 1 (después de tau), el freno tiende a 0 (faucet cerrado).
      freno_transcripcional = 1.0 / (1.0 + R_t ** model.n_hill)
  
      # 4. Nuevo Residual de la EDO con el IFFL incorporado
      produccion_real = model.alpha * freno_transcripcional
      residual = dOSK_dt - produccion_real + (model.kd * OSK_pred)
      
      return torch.mean(residual ** 2)
  
  # =====================================================================
  # 3. BUCLE DE ENTRENAMIENTO INTEGRADO
  # =====================================================================
  def train_pinn(model, t_data, OSK_data, epochs=3000):
      optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
      mse_loss_fn = nn.MSELoss()
  
      print("Iniciando inferencia termodinámica para descubrir Alpha...")
      for epoch in range(epochs):
          optimizer.zero_grad()
  
          # A. Pérdida Empírica (ScRNA-seq data)
          OSK_pred_data = model(t_data)
          loss_data = mse_loss_fn(OSK_pred_data, OSK_data)
  
          # B. Pérdida Física Continua
          t_collocation = torch.rand(100, 1) * t_data.max().item()
          loss_ode = compute_physics_loss(model, t_collocation)
  
          # C. Gradiente Global
          loss_total = loss_data + (0.5 * loss_ode)
  
          loss_total.backward()
          optimizer.step()
  
          if epoch % 500 == 0:
              print(f"Epoch {epoch} | Loss: {loss_total.item():.4f} | Alpha inferida: {model.alpha.item():.4f}")
  
  # =====================================================================
  # 4. EJECUCIÓN DEL PIPELINE
  # =====================================================================
  if __name__ == "__main__":
      # 1. Usamos el nuevo archivo con la métrica real
      ruta_geo = "mef_simulado_parametros_empiricos.h5ad"
  
      # 2. Genes reales del nuevo script (quitamos Fap que ya no está)
      genes_fibroblasto = ["Col1a1", "Thy1"]
      gen_diana = "Pou5f1"
  
      t_emp, osk_emp, t_c = extract_critical_dynamics(ruta_geo, genes_fibroblasto, gen_diana)
  
      # 3. Constantes empíricas (Paper: Roux et al. Suppl. Fig S1)
      # Pou5f1 mRNA half-life ~ 3.2 horas -> ~0.1333 días
      # kd = ln(2) / t_half = 5.2 días^-1
      kd_Pou5f1_empirico = np.log(2) / (3.2 / 24.0)
      print(f"Inyectando kd empírico de Pou5f1: {kd_Pou5f1_empirico:.2f} d^-1")
  
      # 4. Entrenamos
      modelo_osk = iANN_OSK(kd_empirico=kd_Pou5f1_empirico)
      train_pinn(modelo_osk, t_emp, osk_emp)

Finally obtaining

References:

https://doi.org/10.1101/gad.324319.119

aim 4

The Biological Timeline (Empirical Phase in Fibroblasts)

If you transfect this Incoherent Feed-Forward Loop (IFFL) plasmid into fibroblasts (MEFs), the thermodynamic timeline dictated by mammalian synthesis rates will look approximately like this:

0 to 12 hours (Latency): The plasmid enters the nucleus. RNA polymerase II machinery and ribosomes begin to assemble. There is no functional protein yet.
12 to 36 hours (The Autonomous "Pulse"): Strong transcription of OSK and the repressor begins. Because the repressor must reach a critical concentration threshold (K_D) to bind efficiently to the DNA, it remains ineffective during this window. OSK accumulates almost freely, reaching its maximum peak of cellular expression around 24 to 36 hours.
36 to 48 hours (Bifurcation Point): The repressor crosses its critical threshold ([R] > K_D). The binding sites on the plasmid promoter become saturated. The production term α in the thermodynamic equation rapidly collapses to zero.
48 to 96+ hours (The Autonomous "Chase"): With the transcriptional "faucet" closed from the inside by the repressor, the system enters free fall. From hour 48 onward, the system is entirely dominated by the empirical degradation constant k_d. Recalling that the half-lives of KLF4 and OCT4 range from 2 to 12 hours, by hour 72 to 96, the OSK pulse will have been "cleared" by the proteasome, leaving the cell in its new state or allowing the somatic identity to attempt to re-establish itself.

Labs

Lab writeups:

Week 1 Lab: Pipetting

Week 1 Lab: Pipetting

Projects

Final projects:

Individual Final Project
Using the designing, synthesizing, and editing whole genomes content learned in “building genomes week"for my final project
Group Final Project

Individual Final Project

SECTION 1: ABSTRACT

SECTION 2: PROJECT AIMS

SECTION 3: BACKGROUND

Aim 1: The literature search

Fig. 2.A from Dhaliwal et al. (2019) LIF referes to the Leukemia Inhibitory Factor serum in which the stem cells are cultivated to mantain cell plurypotency

Fig.S1 A and B from Roux et al. (2022) supplementary information

Now, with this search done, I proceded with a simple math model that will leverage iANNs loss function to optimize for the best promoters

Aim 1: The simple math model

Mathematical Framework of the OSK Circuit

The core mass-action kinetics of the system is defined by the following ODE:

$$ \frac{d[OSK]}{dt} = \alpha_{0} + \alpha - \gamma[OSK] $$

Where each term represents a distinct biophysical process:

$\alpha_0$: The basal transcription rate (promoter leakiness) in the absence of the inducer.
$\alpha$: The active transcription rate driven by the inducer. This term is constrained by thermodynamic equilibrium and obeys the Hill-equation:
$$ \alpha = \alpha_{max} \frac{[L]^n}{K_d^n + [L]^n} $$
Here, $[L]$ is the concentration of the inducer, $K_d$ is the microscopic dissociation constant (representing the affinity of the inducer to the receptor), and $n$ is the Hill coefficient denoting steric cooperativity.
$\gamma$: The kinetic decay constant, representing the thermodynamic degradation of the OSK protein and cellular dilution. It is inversely proportional to the protein's biological half-life ($t_{1/2}$):
$$ \gamma = \frac{\ln(2)}{t_{1/2}} $$

Aim 1: Code

print("Generando transcriptoma sintético de MEFs con datos empíricos...")
1. Definimos los días de recolección (Pulse: 0 a 7, Chase: 7 a 15)
dias = [0, 3, 7, 10, 15]
celulas_por_dia = 200
=========================================================================
PARÁMETROS EMPÍRICOS EXTRAÍDOS DE LA LITERATURA
=========================================================================
Vidas medias de ARNm [horas] (Roux et al. Supplementary Information - Fig S1)
t_half_Pou5f1 = 3.2
t_half_Sox2 = 3.1
t_half_Klf4 = 5.5  # Nota: Dhaliwal et al. reporta >24h en ES, pero usamos el de transito de Roux
Calculamos la constante de decaimiento kd en [días^-1] (kd = ln(2) / t_half_dias)
Esto dictará la caída exponencial durante la fase de "Chase"
kd_Pou5f1 = np.log(2) / (t_half_Pou5f1 / 24)  # ~5.2 días^{-1
kd_Sox2   = np.log(2) / (t_half_Sox2 / 24)    # ~5.3 días}-1
kd_Klf4   = np.log(2) / (t_half_Klf4 / 24)    # ~3.0 días^-1
X_list = []
obs_list = []
for t in dias:
# —————————————————————–
# A. DINÁMICA DE LA IDENTIDAD SOMÁTICA (Col1a1, Thy1)
# —————————————————————–
# [Paper: Roux et al. "Partial reprogramming transiently suppresses somatic identity"]
# La expresión baja durante la inducción y se recupera tras el chase.
if t <= 7:
# Supresión transitoria durante el pulso
media_somatica = 10 - t
varianza = 0.5 + (t * 0.5) # Critical Slowing Down hacia la bifurcación
else:
# Re-establecimiento de la identidad tras retirar los factores
delta_t_chase = t - 7
media_somatica = 3 + (delta_t_chase * 0.8) # Recuperación lineal/exponencial
varianza = 1.5
genes_somaticos = np.random.normal(loc=media_somatica, scale=varianza, size=(celulas_por_dia, 2))
genes_somaticos = np.clip(genes_somaticos, a_min=0, a_max=None) # No hay scRNA-seq negativo

# -----------------------------------------------------------------
# B. DINÁMICA DE LOS FACTORES DE REPROGRAMACIÓN (Pou5f1, Sox2, Klf4)
# -----------------------------------------------------------------
# [Paper: Roux et al. Suppl. Info - EDO decay model: p(t) = p(0)e^(-kt)]
if t &lt;= 7:
    # Fase PULSE: Acumulación hasta estado estacionario (steady-state)
    # Simulamos que alcanzan un nivel máximo (ej. log1p CPM = 8)
    media_Pou5f1 = t * 1.1
    media_Sox2   = t * 1.1
    media_Klf4   = t * 1.1
else:
    # Fase CHASE: Decaimiento cinético estricto según kd empírico
    delta_t_chase = t - 7
    # Nivel máximo alcanzado en el día 7
    max_expr = 7 * 1.1 
    
    # p(t) = p(0) * exp(-kd * t)
    media_Pou5f1 = max_expr * np.exp(-kd_Pou5f1 * delta_t_chase)
    media_Sox2   = max_expr * np.exp(-kd_Sox2 * delta_t_chase)
    media_Klf4   = max_expr * np.exp(-kd_Klf4 * delta_t_chase)
    
    # [Paper: Roux et al. - &quot;By day 3 of chase, &lt; 0.05% of mRNA remains&quot;]
    # Las medias a partir del día 10 tenderán asintóticamente a 0 rápidamente.

# Añadimos ruido técnico típico de scRNA-seq (ruido Poisson/Normal)
gen_Pou5f1 = np.random.normal(loc=media_Pou5f1, scale=0.5, size=(celulas_por_dia, 1))
gen_Sox2   = np.random.normal(loc=media_Sox2,   scale=0.5, size=(celulas_por_dia, 1))
gen_Klf4   = np.random.normal(loc=media_Klf4,   scale=0.5, size=(celulas_por_dia, 1))

genes_reprog = np.hstack([gen_Pou5f1, gen_Sox2, gen_Klf4])
genes_reprog = np.clip(genes_reprog, a_min=0, a_max=None)

# -----------------------------------------------------------------
# C. ENSAMBLAJE DE LA MATRIZ DEL TIEMPO t
# -----------------------------------------------------------------
X_t = np.hstack([genes_somaticos, genes_reprog])
X_list.append(X_t)

# Metadatos
obs_t = pd.DataFrame({
    'time_days': [t] * celulas_por_dia,
    'phase': ['Pulse' if t &lt;= 7 else 'Chase'] * celulas_por_dia
})
obs_list.append(obs_t)
3. Ensamblamos el objeto AnnData (.h5ad)
X_total = np.vstack(X_list)
obs_total = pd.concat(obs_list, ignore_index=True)
var_total = pd.DataFrame(index=["Col1a1", "Thy1", "Pou5f1", "Sox2", "Klf4"])
adata_simulado = ad.AnnData(X=X_total, obs=obs_total, var=var_total)
4. Guardamos el archivo
ruta_archivo = "mef_simulado_parametros_empiricos.h5ad"
adata_simulado.write_h5ad(ruta_archivo)
print(f"¡Éxito! Archivo {ruta_archivo} creado.")
</code></pre>
<p>Then, the next step was coding the iANNs training to find the most suitable rate of transcription to match the incoherent feed-forward loop I was planning to add in the plasmid. So, again with Gemini&rsquo;s help, the code was:
import torch
import torch.nn as nn
import scanpy as sc
import numpy as np
  # =====================================================================
  # 1. MÓDULO DE FÍSICA ESTADÍSTICA (Scanpy para Critical Slowing Down)
  # =====================================================================
  def extract_critical_dynamics(h5ad_path, somatic_genes, osk_gene_name):
      """
      Lee datos scRNA-seq, calcula la varianza somática para detectar el
      punto de bifurcación y extrae tensores para PyTorch.
      """
      print("Cargando matriz scRNA-seq empírica...")
      adata = sc.read_h5ad(h5ad_path)
  
      time_points = sorted(adata.obs['time_days'].unique())
      variances = []
      osk_means = []
  
      print("Calculando varianza del atractor somático (Álgebra Lineal)...")
      for t in time_points:
          cells_t = adata[adata.obs['time_days'] == t]
          
          X_somatic = cells_t[:, somatic_genes].X
          if hasattr(X_somatic, "toarray"):
              X_somatic = X_somatic.toarray()
  
          # Varianza como trazador de inestabilidad termodinámica
          var_t = np.var(X_somatic)
          variances.append(var_t)
  
          osk_expr = cells_t[:, osk_gene_name].X
          if hasattr(osk_expr, "toarray"):
              osk_expr = osk_expr.toarray()
          osk_means.append(np.mean(osk_expr))
  
      t_crit_idx = np.argmax(variances)
      t_crit = time_points[t_crit_idx]
      print(f"Bifurcación detectada en t = {t_crit} días. Recortando fase Pulse...")
  
      # Extraemos solo la fase "Pulse" (donde alpha está activo buscando el steady-state)
      t_empirical = np.array(time_points[:t_crit_idx + 1], dtype=np.float32).reshape(-1, 1)
      osk_empirical = np.array(osk_means[:t_crit_idx + 1], dtype=np.float32).reshape(-1, 1)
  
      t_tensor = torch.tensor(t_empirical, requires_grad=True)
      osk_tensor = torch.tensor(osk_empirical)
  
      return t_tensor, osk_tensor, t_crit
  
  
  import torch
  import torch.nn as nn
  
  # =====================================================================
  # 2. ARQUITECTURA PINN IFFL (Incoherent Feed-Forward Loop)
  # =====================================================================
  class iANN_OSK_IFFL(nn.Module):
      def __init__(self, kd_empirico):
          super(iANN_OSK_IFFL, self).__init__()
          self.net = nn.Sequential(
              nn.Linear(1, 32),
              nn.Tanh(),
              nn.Linear(32, 32),
              nn.Tanh(),
              nn.Linear(32, 1)
          )
          
          # PARÁMETROS QUE LA RED DEBE DESCUBRIR
          self.alpha = nn.Parameter(torch.tensor([5.0])) # Tasa transcripcional máxima
          
          # PARÁMETROS FÍSICOS FIJOS (Conocimiento a priori del IFFL)
          self.kd = kd_empirico
          
          # Parámetros del Represor (Puedes volverlos nn.Parameter si quieres que la red los infiera)
          self.tau = 1.5 # Retraso temporal (ej. 1.5 días / 36 horas para que empiece a frenar)
          self.n_hill = 3.0 # Coeficiente de Hill (cooperatividad del represor)
          self.k_represor = 2.0 # Qué tan rápido se acumula el represor tras el retraso
  
      def forward(self, t):
          return self.net(t)
  
  def compute_physics_loss_iffl(model, t_collocation):
      t_collocation.requires_grad = True
      OSK_pred = model(t_collocation)
  
      # 1. Diferenciación Automática para hallar d[OSK]/dt
      dOSK_dt = torch.autograd.grad(
          outputs=OSK_pred,
          inputs=t_collocation,
          grad_outputs=torch.ones_like(OSK_pred),
          create_graph=True
      )[0]
  
      # 2. Ecuación del Represor (Variable Latente)
      # Modelamos el represor [R] saturando el promotor después del tiempo tau
      # Usamos una sigmoide (función logística) para simular la acumulación del represor en el tiempo
      R_t = torch.sigmoid(model.k_represor * (t_collocation - model.tau))
      
      # 3. Función de inhibición de Hill: 1 / (1 + R^n)
      # Si R_t es 0 (antes de tau), el freno es 1 (faucet abierto). 
      # Si R_t es 1 (después de tau), el freno tiende a 0 (faucet cerrado).
      freno_transcripcional = 1.0 / (1.0 + R_t ** model.n_hill)
  
      # 4. Nuevo Residual de la EDO con el IFFL incorporado
      produccion_real = model.alpha * freno_transcripcional
      residual = dOSK_dt - produccion_real + (model.kd * OSK_pred)
      
      return torch.mean(residual ** 2)
  
  # =====================================================================
  # 3. BUCLE DE ENTRENAMIENTO INTEGRADO
  # =====================================================================
  def train_pinn(model, t_data, OSK_data, epochs=3000):
      optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
      mse_loss_fn = nn.MSELoss()
  
      print("Iniciando inferencia termodinámica para descubrir Alpha...")
      for epoch in range(epochs):
          optimizer.zero_grad()
  
          # A. Pérdida Empírica (ScRNA-seq data)
          OSK_pred_data = model(t_data)
          loss_data = mse_loss_fn(OSK_pred_data, OSK_data)
  
          # B. Pérdida Física Continua
          t_collocation = torch.rand(100, 1) * t_data.max().item()
          loss_ode = compute_physics_loss(model, t_collocation)
  
          # C. Gradiente Global
          loss_total = loss_data + (0.5 * loss_ode)
  
          loss_total.backward()
          optimizer.step()
  
          if epoch % 500 == 0:
              print(f"Epoch {epoch} | Loss: {loss_total.item():.4f} | Alpha inferida: {model.alpha.item():.4f}")
  
  # =====================================================================
  # 4. EJECUCIÓN DEL PIPELINE
  # =====================================================================
  if __name__ == "__main__":
      # 1. Usamos el nuevo archivo con la métrica real
      ruta_geo = "mef_simulado_parametros_empiricos.h5ad"
  
      # 2. Genes reales del nuevo script (quitamos Fap que ya no está)
      genes_fibroblasto = ["Col1a1", "Thy1"]
      gen_diana = "Pou5f1"
  
      t_emp, osk_emp, t_c = extract_critical_dynamics(ruta_geo, genes_fibroblasto, gen_diana)
  
      # 3. Constantes empíricas (Paper: Roux et al. Suppl. Fig S1)
      # Pou5f1 mRNA half-life ~ 3.2 horas -> ~0.1333 días
      # kd = ln(2) / t_half = 5.2 días^-1
      kd_Pou5f1_empirico = np.log(2) / (3.2 / 24.0)
      print(f"Inyectando kd empírico de Pou5f1: {kd_Pou5f1_empirico:.2f} d^-1")
  
      # 4. Entrenamos
      modelo_osk = iANN_OSK(kd_empirico=kd_Pou5f1_empirico)
      train_pinn(modelo_osk, t_emp, osk_emp)
```
Finally obtaining


Aim 1 Plasmid making in Benchling

First I retrieved the sequence of the yamanaka factors from Addgene: Oct3/4 (https://www.addgene.org/13366/) #13366 Sox2 (https://www.addgene.org/13367/) #13367 Klf4 (https://www.addgene.org/13370/) #13370

and made a hOCT4:2A:hSOX2:2A:hKLF4 cassette to insert

Alexandra — HTGAA Spring 2026

About me

Contact info

Weeks

Subsections of Alexandra — HTGAA Spring 2026

Homework

Weekly homework submissions:

Subsections of Homework

Week 1 HW: Principles and Practices

1. “An application of synthetic biology that I recently found out about, and I’m really excited for, is partial cellular reprogramming.”

2. The policy goals

Biosafety:

Public accessibility and equity:

3. The potential actions and the actors:

1. Enhance safety by standard design:

2. Funding to ensure equity and access:

3. Development of transgene control systems

4. Scoring the options

5. What to prioritize

6. Bibliography

7. Searches:

Week 2 HW: Read, write and edit DNA

About next generation gene synthesis

From DNA Synthesis Development and Application:

From reading and writing life:

Week 2 assingments:

Subsections of Week 2 HW: Read, write and edit DNA

Part 1: Benchling & In silico gel art

My attempt at drawing a Teletubbie

Part 3: DNA Design Challenge

1. Obtaining the protein amino acid sequence

2. DNA sequence:

3. Gene and CDS

4. Central Dogma of biology

Part 4: DNA Synthesis order, building my first plasmid

1. Finding an appropriate plasmid backbone:

2. Assembly

And this is the result

Twist order

Part 5: Read, Write and Edit

DNA Read

What DNA would you want to sequence (e.g., read) and why?

In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why?

DNA Write

What DNA would you want to synthesize (e.g., write) and why?

What technology or technologies would you use to perform this DNA synthesis and why?

DNA Edit

What DNA would you want to edit and why?

What technology or technologies would you use to perform these DNA edits and why?

Week 3 HW: Lab automation

Subsections of Week 3 HW: Lab automation

Part 1: Opentrons artwork

My attempt

Part 2: Use of automation tools

Find and describe a published paper that utilizes the Opentrons or an automation tool to achieve novel biological applications

Write a description about what you intend to do with automation tools for your final project. You may include example pseudocode, Python scripts, 3D printed holders, a plan for how to use Ginkgo Nebula, and more. You may reference this week’s recitation slide deck for lab automation details.

Cloud Lab Workflow (Ginkgo Nebula)

Weeek 4 HW: Protein design - Part 1

Notes about key concepts for deeper understanding

Key Concepts to Know:

Steps for 3D Protein Structure Prediction with ESM-2:

Key Concepts to Know:

Subsections of Weeek 4 HW: Protein design - Part 1

Conceptual questions

How many molecules of amino acids do you take with a piece of 500 grams of meat? (on average an amino acid is ~100 Daltons)

Why do humans eat beef but do not become a cow, eat fish but do not become fish?

Why are there only 20 natural amino acids?

Can you make other non-natural amino acids? Design some new amino acids.

Where did amino acids come from before enzymes that make them, and before life started?

If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?

Why are most molecular helices right-handed?

Why do β-sheets tend to aggregate? What is the driving force for β-sheet aggregation?

Why do many amyloid diseases form β-sheets? Can you use amyloid β-sheets as materials?

References:

Protein analysis and visualization

Why Sox2?

In relation to Sox2 aminoacid sequence

In relation to protein structure

In relation to the molecule visualization:

Using ML-Based Protein Design Tools

`Group Final Project`