I am a student researcher focused on peptide design and experimental validation, with strong interests in bioinformatics and computational biology. My work centers on the AI-assisted design of inhibitory peptides, followed by in vitro and in vivo experimental testing, combining computational approaches with molecular biology and biochemical validation.
I actively participate in scientific outreach initiatives and the organization of academic events, and I am involved in communities focused on omics sciences, bioinformatics, and science education. My goal is to develop computationally guided biomolecular tools with real experimental impact in biomedicine, microbiology, and biotechnology.
Governance of AI-Driven Biological Design 1. Biological Engineering Application or Tool Description As the core idea of my project, I would like to develop a concept that has been under discussion in my laboratory for some time: a computational platform for the de novo design of peptide and protein ligands capable of inhibiting essential microbial processes, such as translation, with the goal of suppressing or controlling microbial growth.
Part 1: Benchling & In-silico Gel Art
Simulate Restriction Enzyme Digestion with the following Enzymes:
Part 3: DNA Design Challenge
3.1. Choose your protein.
In recitation, we discussed that you will pick a protein for your homework that you find interesting. Which protein have you chosen and why? Using one of the tools described in recitation (NCBI, UniProt, google), obtain the protein sequence for the protein you chose.
Article / case study: Automation at Adaptyv Bio and protein binder design competitions
A prominent example of the use of automation in biology is the work carried out by Adaptyv Bio, a company specialized in laboratory automation and the integration of artificial intelligence for protein design and validation. In particular, Adaptyv Bio organized international protein design competitions, such as the Protein Binder Competition, in which thousands of computationally generated designs were experimentally tested using fully automated workflows.
Part A. Conceptual Questions
Why do humans eat beef but not turn into cows, and eat fish but not turn into fish? Organisms do not incorporate intact proteins. Instead, they degrade them into amino acids, which are then reassembled according to the organism’s own genome. Molecular identity is determined by genetic information, not by the origin of the raw material.
Why are there only 20 natural amino acids? Only 20 were selected by early evolution because they possessed a specific chemistry that allowed for a functional balance of hydrophobicity, specific folding patterns, and various electrical charges.
Part 1: Generation of Peptide Binders with PepMLM
The human SOD1 sequence (P00441) was retrieved from UniProt and the A4V mutation was introduced. Using PepMLM, four peptides of length 12 amino acids were generated conditioned on the mutant SOD1 sequence. A known SOD1-binding peptide (FLYRWLPSRRGG) was added for comparison.
Generated Peptides and Perplexity Scores Peptide Sequence Perplexity PepMLM-0 WRYPAAAAAHKE 8.27 PepMLM-1 WLYYVVALEWGK 23.99 PepMLM-2 WLYYAAALELKE 18.84 PepMLM-3 WRYGVAAVEWKK 15.52 Control FLYRWLPSRRGG N/A
Phusion Master Mix Components This mix contains a high-fidelity polymerase with proofreading activity (3 to 5 exonuclease) to minimize sequence errors. It also includes dNTPs as DNA building blocks, an optimized buffer with salts like magnesium chloride that act as enzymatic cofactors, and stabilizers to maintain the pH and ionic strength required for the reaction.
Factors Determining Annealing Temperature The annealing temperature primarily depends on the melting temperature (T_m) of the primers, which is influenced by sequence length and GC content. Other external factors include the concentration of salts (monovalent and divalent) in the PCR buffer and the concentration of the primers themselves in the mixture.
Assignment Part 1: Intracellular Artificial Neural Networks (IANNs)
What advantages do IANNs have over traditional genetic circuits, whose input/output behaviors are Boolean functions? IANNs provide a significant advantage over Boolean genetic circuits by enabling analog computation, which allows cells to process a continuous range of signal concentrations rather than simple on/off states. This capability leads to more efficient signal integration, as a single layer can replace complex cascades of logic gates, while offering greater tunability by adjusting molecular weights like promoter strengths without re-engineering the entire system.
PART 1
Question 1: What are the main advantages of cell-free protein synthesis (CFPS) regarding flexibility and control, and name two cases where it is more beneficial?
Since it is an open system, you have direct control over the chemical environment (pH, redox potential, and salts) and can add synthetic components like non-natural amino acids without being restricted by a cell membrane.
Case 1: Production of cytotoxic proteins that would otherwise kill a living host cell. Case 2: Efficient incorporation of labeled isotopes or synthetic tags for precise protein engineering.
Subsections of Homework
Week 1 HW: Principles and Practices
Governance of AI-Driven Biological Design
1. Biological Engineering Application or Tool
Description
As the core idea of my project, I would like to develop a concept that has been under discussion in my laboratory for some time: a computational platform for the de novo design of peptide and protein ligands capable of inhibiting essential microbial processes, such as translation, with the goal of suppressing or controlling microbial growth.
This platform would integrate computational biochemistry, focused on deep-learning-based structure prediction, together with generative sequence design and molecular dynamics simulations to generate ligands that selectively bind to key molecular interactions in microbial metabolism, including protein–protein and protein–nucleic acid interactions. Tools such as RosettaFold Diffusion and BindCraft have demonstrated remarkable capabilities for designing ligands with high affinity and specificity, consolidating this approach as a promising strategy for the rational development of new antimicrobial agents.
2. Governance and Policy Goals
General Governance Objective
Ensure that computational platforms for antimicrobial ligand design are developed and implemented in ways that maximize public health benefits while minimizing risks of misuse, accidental harm, and unethical applications.
Objective 1: Prevent Malicious or Irresponsible Use of Molecular Design Platforms
Sub-goals:
Limit the use of ligand design tools to prevent the generation of molecules that could increase microbial virulence, toxicity, immune evasion, or environmental persistence.
Establish mechanisms to identify, evaluate, and manage cases of dual-use research of concern (DURC) arising from computational molecular engineering.
Implement educational mechanisms focused on responsible AI use, aimed at preventing unintentionally harmful applications.
Objective 2: Strengthen Biosecurity Throughout the Research Process
Sub-goals:
Ensure that computational design workflows incorporate early biological risk assessment, and promote rigorous experimental validation protocols that evaluate toxicity and potential ecological impact before advancing projects.
Objective 3: Promote Responsible Innovation and Transparency in AI-Driven Bioengineering
Sub-goals:
Encourage documentation, traceability, and auditability of molecular design decisions, supported by open scientific communication and the sharing of best practices related to risk prevention and mitigation.
3. Governance Actions
Action 1 — Integrated Biosecurity Filters in Molecular Design Platforms
Purpose:
Current molecular design tools optimize binding and stability without systematic safety analysis. The proposed change is to integrate mandatory biosecurity filters that detect or log potentially dangerous sequences or misuse, particularly those related to pathogens or viral components.
Design:
This action would involve collaboration with the laboratories and organizations responsible for developing these software platforms to implement built-in safety checks.
Assumptions:
It is assumed that harmful biological functions can be predicted and flagged computationally with sufficient accuracy.
Risks of Failure & Success: Failures include false negatives, where harmful designs are not detected. Additionally, many platforms operate locally and offline, limiting centralized monitoring.
Action 2 — Institutional Oversight for AI-Driven Molecular Engineering
Purpose:
Establish specialized review processes to evaluate dual-use risks before experimental implementation.
Design: Universities, ethics committees, funding agencies, and regulatory bodies would implement multidisciplinary review panels and mandatory risk assessments prior to project approval and funding.
Assumptions:
This approach assumes institutional capacity for technical risk evaluation and researcher compliance.
Risks of Failure & Success: Excessive regulation may discourage exploratory research*, particularly in low-resource environments.
Action 3 — Tiered Access and Licensing of Advanced Molecular Design Platforms
Purpose: Implement user identification and credential-based access models to monitor and deter misuse.
Design: Platform developers, regulatory agencies, and academic consortia would establish access levels, authentication systems, and activity monitoring.
Assumptions: This approach assumes credential-based access control is enforceable and accepted by researchers.
Risks of Failure & Success:
Failures include excluding under-resourced researchers and the emergence of unregulated alternative tools.
4. Governance Option Evaluation Matrix
Does the option:
Option 1
Option 2
Option 3
Enhance Biosecurity
• By preventing incidents
1
2
1
• By helping respond
2
1
2
Foster Lab Safety
• By preventing incident
2
1
2
• By helping respond
2
1
2
Protect the environment
• By preventing incidents
2
1
2
• By helping respond
3
1
2
Other considerations
• Minimizing costs and burdens to stakeholders
2
3
3
• Feasibility?
1
2
3
• Not impede research
2
3
1
• Promote constructive applications
1
2
2
5. Governance Prioritization and Recommendation
Based on the scoring and overall evaluation, I would prioritize interinstitutional oversight mechanisms and tiered-access systems as the most effective governance options. The main reason for this prioritization is that the primary risk factor for misuse lies in the broad open access to these software tools, combined with limited monitoring of their actual use. Since many AI-based biological design platforms are freely accessible, traceability, accountability, and early risk detection are currently limited, substantially increasing the risk of accidental misuse or deliberate malicious exploitation.
The administrative burden, slower research workflows, and barriers for resource-limited institutions were considered key trade-offs. However, the significant improvements in risk mitigation, accountability, and governance transparency outweigh these disadvantages. Furthermore, these limitations can be reduced through careful system design, including fast-track approval procedures and international collaboration frameworks.
This recommendation assumes that institutions possess the organizational and technical capacity to implement oversight systems and that researchers will comply with access regulations. Major uncertainties remain regarding global regulatory harmonization, consistent enforcement, and the adaptability of governance frameworks in response to the rapid evolution of AI capabilities.
Target Audience:
This recommendation is primarily directed at major research institutions, international scientific organizations, and national regulatory agencies, aiming to establish coordinated governance structures that balance innovation, safety, and public protection.
Homework Questions & Answers
Homework Questions from Professor Jacobson
Questions
Nature’s machinery for copying DNA is called polymerase. What is the error rate of polymerase? How does this compare to the length of the human genome? How does biology deal with that discrepancy?
How many different ways are there to code (DNA nucleotide code) for an average human protein? In practice, what are some of the reasons that all of these different codes don’t work to code for the protein of interest?
Answers
DNA polymerase has an intrinsic error rate of approximately 1 mistake per 10⁶ nucleotides incorporated. Given that the human genome is roughly 3 × 10⁹ base pairs long, this would result in thousands of errors per replication cycle in the absence of correction mechanisms. Biology addresses this discrepancy through the proofreading activity of DNA polymerase and post-replicative mismatch repair systems, which dramatically reduce the final mutation rate.
Due to the degeneracy of the genetic code, there exists an astronomically large number of DNA sequences capable of encoding an average human protein. However, in practice, not all of these sequences are equally viable. Factors such as mRNA stability, codon usage bias, translational efficiency, secondary structure formation, and regulatory sequence constraints limit the set of functional coding sequences.
Homework Questions from Dr. LeProust
Questions
What’s the most commonly used method for oligo synthesis currently?
Why is it difficult to make oligos longer than 200 nt via direct synthesis?
Why can’t you make a 2000 bp gene via direct oligo synthesis?
Answers
The most commonly used method for oligonucleotide synthesis is solid-phase chemical synthesis using phosphoramidite chemistry.
It is difficult to synthesize oligonucleotides longer than ~200 nucleotides because the coupling efficiency at each synthesis cycle is not perfect, leading to the progressive accumulation of errors and truncated products as the length increases.
A 2000 bp gene cannot be synthesized directly because the cumulative error rate and product truncation become overwhelmingly high, preventing the recovery of a correct full-length sequence in sufficient yield and purity.
Homework Question from George Church
Question
What are the 10 essential amino acids in all animals, and how does this affect your view of the “Lysine Contingency”?
Answer
The ten essential amino acids in animals are:
Histidine
Isoleucine
Leucine
Lysine
Methionine
Phenylalanine
Threonine
Tryptophan
Valine
Arginine
title: ‘Week 2’
weight: 10
Homework Questions & Answers
Homework Questions from Professor Jacobson
Questions
Nature’s machinery for copying DNA is called polymerase. What is the error rate of polymerase? How does this compare to the length of the human genome? How does biology deal with that discrepancy?
How many different ways are there to code (DNA nucleotide code) for an average human protein? In practice, what are some of the reasons that all of these different codes don’t work to code for the protein of interest?
Answers
DNA polymerase has an intrinsic error rate of approximately 1 mistake per 10⁶ nucleotides incorporated. Given that the human genome is roughly 3 × 10⁹ base pairs long, this would result in thousands of errors per replication cycle in the absence of correction mechanisms. Biology addresses this discrepancy through the proofreading activity of DNA polymerase and post-replicative mismatch repair systems, which dramatically reduce the final mutation rate.
Due to the degeneracy of the genetic code, there exists an astronomically large number of DNA sequences capable of encoding an average human protein. However, in practice, not all of these sequences are equally viable. Factors such as mRNA stability, codon usage bias, translational efficiency, secondary structure formation, and regulatory sequence constraints limit the set of functional coding sequences.
Homework Questions from Dr. LeProust
Questions
What’s the most commonly used method for oligo synthesis currently?
Why is it difficult to make oligos longer than 200 nt via direct synthesis?
Why can’t you make a 2000 bp gene via direct oligo synthesis?
Answers
The most commonly used method for oligonucleotide synthesis is solid-phase chemical synthesis using phosphoramidite chemistry.
It is difficult to synthesize oligonucleotides longer than ~200 nucleotides because the coupling efficiency at each synthesis cycle is not perfect, leading to the progressive accumulation of errors and truncated products as the length increases.
A 2000 bp gene cannot be synthesized directly because the cumulative error rate and product truncation become overwhelmingly high, preventing the recovery of a correct full-length sequence in sufficient yield and purity.
Homework Question from George Church
Question
What are the 10 essential amino acids in all animals, and how does this affect your view of the “Lysine Contingency”?
Answer
The ten essential amino acids in animals are:
Histidine
Isoleucine
Leucine
Lysine
Methionine
Phenylalanine
Threonine
Tryptophan
Valine
Arginine
Week 2 — DNA Read, Write, & Edit
Part 1: Benchling & In-silico Gel Art
Simulate Restriction Enzyme Digestion with the following Enzymes:
Part 3: DNA Design Challenge
3.1. Choose your protein.
In recitation, we discussed that you will pick a protein for your homework that you find interesting. Which protein have you chosen and why? Using one of the tools described in recitation (NCBI, UniProt, google), obtain the protein sequence for the protein you chose.
The protein selected for this assignment is Translation Initiation Factor IF-2 (IF2) from Escherichia coli (strain K12).
I chose IF2 because it plays a central role in the initiation of protein synthesis, a critical and highly regulated step of gene expression. IF2 is responsible for promoting the binding of the initiator tRNA to the ribosomal P site and facilitating the correct assembly of the translation initiation complex. Due to its essential function, IF2 is a key factor in controlling translational efficiency and fidelity. Additionally, IF2 is highly conserved among bacteria, making it an important target for studies in molecular biology, ribosome dynamics, and antibiotic development. Its biological relevance and mechanistic complexity make it a particularly interesting protein to study.
The amino acid sequence of IF2 was obtained from UniProt.
3.2. Reverse Translate: Protein (amino acid) sequence to DNA (nucleotide) sequence.
The Central Dogma discussed in class and recitation describes the process in which DNA sequence becomes transcribed and translated into protein. The Central Dogma gives us the framework to work backwards from a given protein sequence and infer the DNA sequence that the protein is derived from. Using one of the tools discussed in class, NCBI or online tools (google “reverse translation tools”), determine the nucleotide sequence that corresponds to the protein sequence you chose above.
Using the reverse translation tool available in Benchling, the amino acid sequence of Translation Initiation Factor IF-2 (IF2) from Escherichia coli (strain K12) was converted into a corresponding nucleotide sequence.
I’ve shortened the sequense because it’s very long.
3.3. Codon optimization.
Once a nucleotide sequence of your protein is determined, you need to codon optimize your sequence. You may, once again, utilize google for a “codon optimization tool”. In your own words, describe why you need to optimize codon usage. Which organism have you chosen to optimize the codon sequence for and why?
Codon optimization is necessary because, although multiple codons can encode the same amino acid, different organisms show preferences for specific codons. These preferences are related to the abundance of corresponding tRNAs and directly affect translation efficiency, protein yield, and overall expression levels.
The organism selected for codon optimization was Escherichia coli, because it is one of the most widely used hosts for recombinant protein expression. E. coli offers fast growth, low cost, well-established genetic tools, and high-level protein production.
I honestly don’t know how to represent it, so I’m going to include the evidence from Bencling.
3.4. You have a sequence! Now what?
What technologies could be used to produce this protein from your DNA? Describe in your words the DNA sequence can be transcribed and translated into your protein. You may describe either cell-dependent or cell-free methods, or both.
In a cell-dependent system, the optimized DNA is cloned into an expression plasmid and introduced into E. coli. Inside the cell, the DNA is transcribed into mRNA and then translated by ribosomes into a protein, which folds into its functional form and can later be purified using techniques such as affinity chromatography.
From personal experience I use cell-dependent but the approach of both systems, the underlying process follows the Central Dogma of Molecular Biology, where DNA is transcribed into mRNA and then translated into protein.
Part 4: Prepare a Twist DNA Synthesis Order
4.1. and 4.2. Build Your DNA Insert Sequence
The final DNA construct includes the essential regulatory and coding elements required for protein expression: a promoter to initiate transcription, a ribosome binding site (RBS) to enable efficient translation, a start codon (ATG), the codon-optimized coding sequence of the target protein, a C-terminal 7×His tag to facilitate protein purification, a stop codon, and a transcription terminator to properly end transcription. Together, these components ensure efficient transcription, translation, and purification of the recombinant protein in E. coli.
4.3. to 4.6.
The pET target (AMP), a recombinant expression vector intended for Escherichia coli IF2 protein production, displays the IF2_MOD plasmid map in the figure. Strong, controlled transcription and translation of the inserted gene are made possible by the plasmid’s T7 promoter and T7 ribosome binding site (RBS). The lac operator (lacO) and the lacI repressor regulate expression, enabling IPTG induction. The plasmid also has a high-copy-number replication origin, which guarantees effective plasmid maintenance, and the ampicillin resistance gene (ampR), which is used to select transformed cells.
Part 5: DNA Read/Write/Edit
(i) What DNA would you want to sequence (e.g., read) and why? This could be DNA related to human health (e.g. genes related to disease research), environmental monitoring (e.g., sewage waste water, biodiversity analysis), and beyond (e.g. DNA data storage, biobank).
I would want to sequence genes related to antibiotic resistance in pathogenic bacteria, because antimicrobial resistance is a major global health problem. By sequencing these genes, we can identify resistance mechanisms, track how they spread among bacterial populations, and monitor the emergence of new resistant strains. This information is essential for improving disease surveillance, guiding treatment decisions, and developing new strategies to control infections.
(ii) In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why?
For exaple: This method is a second-generation sequencing technology that is ideal for studying antibiotic resistance genes in pathogenic bacteria. The input is purified DNA extracted from bacterial or environmental samples, which is prepared by fragmentation, adapter ligation, and PCR amplification to generate a sequencing library. Sequencing is performed using sequencing-by-synthesis, where fluorescently labeled nucleotides are incorporated and detected to accurately decode each DNA base. The output consists of millions of short DNA sequence reads, which can be analyzed to identify resistance genes, detect mutations, and monitor their spread in bacterial populations.
5.2 DNA Write
(i) What DNA would you want to synthesize (e.g., write) and why? These could be individual genes, clusters of genes or genetic circuits, whole genomes, and beyond. As described in class thus far, applications could range from therapeutics and drug discovery (e.g., mRNA vaccines and therapies) to novel biomaterials (e.g. structural proteins), to sensors (e.g., genetic circuits for sensing and responding to inflammation, environmental stimuli, etc.), to art (DNA origamis). If possible, include the specific genetic sequence(s) of what you would like to synthesize! You will have the opportunity to actually have Twist synthesize these DNA constructs! :)
I would like to synthesize genes with optimized codons that encode antimicrobial peptides and inhibitors of proteins involved in bacterial translation, such as peptides that interact with IF2, as these could be used for drug discovery and the development of new antibiotics. These synthetic DNA sequences could then be inserted into expression vectors to rapidly produce and test novel therapeutic proteins.
(ii) What technology or technologies would you use to perform this DNA synthesis and why?
I would use solid-phase phosphoramidite DNA synthesis, followed by FPLC purification, because this method allows precise and reliable chemical synthesis of custom DNA sequences. The essential steps include stepwise nucleotide coupling, oxidation, capping, and deprotection to build the DNA strand, followed by cleavage from the solid support. The main limitations of this method are sequence length constraints (typically up to ~200 bp per fragment), synthesis errors that accumulate with longer sequences, and moderate scalability, which requires assembly of longer constructs from shorter fragments.
5.3 DNA Edit
(i) What DNA would you want to edit and why?
I want to edit bacterial genes involved in antibiotic resistance and translation initiation, such as infB (encoding IF2), to study their function and to develop new antimicrobial strategies. Editing this DNA would allow precise modification of key residues to understand their role in protein synthesis and to identify vulnerabilities that can be targeted for drug development.
(ii) What technology or technologies would you use to perform these DNA edits and why?
I would use CRISPR-Cas9 gene editing technology because it allows precise, efficient, and targeted modification of DNA sequences. Its high accuracy, simplicity, and adaptability make it ideal for editing bacterial genes such as infB to investigate protein function and antibiotic resistance mechanisms.
Week 3 — Lab Automation
1. Article / case study: Automation at Adaptyv Bio and protein binder design competitions
A prominent example of the use of automation in biology is the work carried out by Adaptyv Bio, a company specialized in laboratory automation and the integration of artificial intelligence for protein design and validation. In particular, Adaptyv Bio organized international protein design competitions, such as the Protein Binder Competition, in which thousands of computationally generated designs were experimentally tested using fully automated workflows.
In these competitions, participants used AI models to design proteins capable of binding to specific therapeutic targets, such as the EGFR receptor and, in similar events, emerging viral proteins such as those from Nipah virus. Subsequently, the best designs were synthesized, expressed, and characterized using automated robotic pipelines, including cloning, protein expression, and affinity assays through Bio-Layer Interferometry (BLI). The entire experimental process was conducted in high-throughput robotic laboratories, enabling the rapid, reproducible, and standardized evaluation of hundreds of proteins.
This approach demonstrated how the integration of artificial intelligence with robotic automation can dramatically accelerate the design–build–test–learn (DBTL) cycle, reducing costs, human error, and experimental time, while simultaneously generating large volumes of reproducible data to improve predictive models.
2. Automation proposal for the final project
For my final project, I plan to implement an automated flow for the design, expression, and functional evaluation of protein binders targeted at essential bacterial targets, combining artificial intelligence, structural bioinformatics, and experimental automation.
Automated general flow
A. Computational design of binders
Use of generative AI models and structural prediction (ProteinMPNN, RFdiffusion, AlphaFold2).
Molecular docking assessment and molecular dynamics simulations.
Automatic prioritization of candidates with better affinity and stability.
B. Experimental automation with Opentrons
Automated cloning of binder genes into expression vectors.
Bacterial transformation, expression induction and culture preparation.
Automated bacterial growth assays and inhibition measurement.
C. Functional validation
Automated reading of OD600.
Analysis of growth curves.
Statistical comparison between controls and strains expressing binders.
Week 4 — Protein Design Part I
Part A. Conceptual Questions
Why do humans eat beef but not turn into cows, and eat fish but not turn into fish?
Organisms do not incorporate intact proteins. Instead, they degrade them into amino acids, which are then reassembled according to the organism’s own genome. Molecular identity is determined by genetic information, not by the origin of the raw material.
Why are there only 20 natural amino acids?
Only 20 were selected by early evolution because they possessed a specific chemistry that allowed for a functional balance of hydrophobicity, specific folding patterns, and various electrical charges.
Can unnatural amino acids be created? Design new ones.
Yes. Thousands have already been created; the limitation is functional rather than chemical.
Rational Design Examples:
Fluorinated side-chain amino acid: Increases hydrophobicity and thermal stability.
Azide or Alkyne group amino acid: Enables “click chemistry” for selective molecular labeling.
Where did amino acids come from before enzymes and life?
There is solid evidence from abiotic synthesis, most notably demonstrated by the Miller–Urey experiments, which showed that organic compounds could form from inorganic precursors under simulated primitive Earth conditions.
If you build an α-helix with D-amino acids, what chirality would it have?
It would form a left-handed (levogyre) helix.
Can you discover additional helices in proteins?
The common assumption that only α-helices and 3
10
-helices exist is false. Others include:
π-helices
Polyproline helices (I and II)
Mixed helices
Transient dynamic helices
Why are most molecular helices right-handed?
Biological homochirality (the prevalence of L-amino acids) imposes a geometry that:
Minimizes steric collisions.
Optimizes hydrogen bonding.
Maximizes overall structural stability.
Why do β-sheets tend to aggregate?
In physical terms, aggregation reduces free energy, making the clustered state more thermodynamically favorable.
What is the driving force behind β-aggregation?
It is not a single force, but rather a balance of:
Cooperative hydrogen bonding.
The hydrophobic effect.
Entropy gain from released water molecules.
Steric “zipper” stacking.
Can β-amyloid sheets be used as materials?
Yes, this is a key emerging field.
Applications include:
Structure Name: Nipah virus attachment glycoprotein in complex with human cell surface receptor ephrinB2.
Resolution and Quality: 1.80 Å (High quality, resolved on May 20, 2008).
Other Molecules: Contains NAG (N-acetylglucosamine) ligands and the human ephrin-B2 receptor.
Structural Classification: Classified as a Hydrolase.
3D Visualization Guide
Cartoon Representation:
Ribbon Representation:
Ball and Stick Representation:
Part C. Using ML-Based Protein Design Tools
C1. Protein Language Modeling
Deep Mutational ScansTo analyze the functional robustness of the Nipah virus G glycoprotein, the ESM2 language model was used to generate an unsupervised deep mutational scan (DMS) based on sequence likelihoods.
Heatmap Patterns: The mutation scan reveals distinct vertical bands of high conservation (dark purple), indicating positions where any amino acid substitution is highly penalized.
Critical Residues: A specific sensitive region is observed near the beginning of the sequence (approx. positions 100–120), where substitutions with large hydrophobic or charged residues significantly lower the model score, suggesting a critical structural core.
Functional Insights: These high-sensitivity areas likely correspond to the $\beta$-propeller folds required for ephrin-B2 receptor binding.
Latent Space Analysis
A latent space analysis was performed by embedding the protein sequence dataset into a reduced dimensionality using t-SNE.
Neighborhood Clusters: The 3D map shows clear clusters where proteins are organized by structural and evolutionary similarity.
Protein Positioning: The Nipah G protein is located within a specific neighborhood alongside other Henipavirus homologs (such as Hendra virus), confirming that the model captures biological relationships without explicit labels.
C2. Protein Folding
Folding with ESMFold
The atomic-level structure of the protein was predicted from its primary sequence using ESMFold.
Coordinate Matching: The predicted coordinates show a high degree of structural alignment with the original experimental structure 2VSM.
Structural Features: The model accurately recovers the six-bladed $\beta$-propeller globular domain responsible for receptor recognition.
Mutation Resilience: While the protein core is resilient to single-point mutations suggested by the DMS, large segment deletions lead to a collapse in structural confidence (low pLDDT) in those regions.
C3. Protein Generation
Inverse-Folding with ProteinMPNN
Using the backbone of the 2VSM structure, we utilized ProteinMPNN to propose new sequence candidates that could adopt the same fold.
Sequence Probabilities: The probability map indicates that the model has high confidence in specific “anchor” residues necessary to maintain internal packing.
Sequence Recovery: A sequence recovery of 49.6% was achieved (seq_recovery = 0.4964), showing that the model retains the essential native core while exploring variations on the protein surface.
Structural Validation: When the designed sequence was re-folded using ESMFold, the resulting structure matched the original backbone, validating the success of the inverse-folding design.
Part D. Group Brainstorm on Bacteriophage Engineering
Main Goals
Stabilize the Lysis Protein (Protein L): Specifically focused on enhancing the stability of the transmembrane domain.
Maintain Functional Motif Interaction: Ensuring that any modifications do not disrupt the critical interaction of the Leu48–Ser49 (LS) motif with its membrane target.
Proposed Tools and Approaches (Pipeline)
The strategy combines evolutionary, structural, and physical perspectives to optimize the protein:
Protein Language Models (pLMs): Used for directed in silico mutagenesis to generate variants consistent with evolutionary constraints.
AlphaFold2: Employed for 3D structural prediction to verify that the transmembrane topology and functional orientation of the LS motif remain intact.
FoldX / Rosetta: Used to estimate $\Delta\Delta G$ (Folding Free Energy), allowing for the prioritization of mutants with high thermodynamic stability.
GROMACS: Utilized for Molecular Dynamics (MD) in a membrane environment to evaluate structural stability and flexibility within a realistic bacterial lipid bilayer.
Why These Tools?
This integrated approach allows for a rational exploration of mutations while minimizing the risk of disrupting lytic function:
pLMs ensure designed variants are likely to remain properly folded and functional.
AlphaFold2 enables rapid structural screening of the transmembrane region.
Energy-based predictions prioritize candidates, reducing the need for exhaustive downstream analysis.
MD simulations provide a physiological context (bacterial membrane) necessary for validating transmembrane protein behavior.
Potential Pitfalls
Prediction Accuracy: Current tools may have limited reliability for small, membrane-associated, and partially disordered proteins.
Stability-Function Trade-off: Increasing thermodynamic stability might reduce the conformational flexibility required to interact with membrane targets, potentially impairing lytic activity.
Schematic of the Pipeline
Week 5 — Protein Design Part II
Part 1: Generation of Peptide Binders with PepMLM
The human SOD1 sequence (P00441) was retrieved from UniProt and the A4V mutation was introduced. Using PepMLM, four peptides of length 12 amino acids were generated conditioned on the mutant SOD1 sequence. A known SOD1-binding peptide (FLYRWLPSRRGG) was added for comparison.
The perplexity scores indicate PepMLM’s confidence in the generated binders, with lower scores representing higher confidence. PepMLM-0 (WRYPAAAAAHKE) shows the lowest perplexity (8.27), suggesting the model is most confident in this peptide as a potential binder.
(I ran it twice because I was getting amino acids incompatible with Alphafold.)
Part 2: AlphaFold3 Complex Modeling
Each peptide was modeled in complex with the A4V mutant SOD1 using AlphaFold3. The ipTM (interface predicted Template Modeling) score provides a measure of predicted binding confidence, while pTM indicates overall fold confidence.
AlphaFold3 Results Summary
Peptide Sequence ipTM Score pTM Score Binding Region Observations
PepMLM-0 WRYPAAAAAHKE 0.29 0.82 Appears to bind near the β-barrel region, surface-bound orientation
PepMLM-1 WLYYVVALEWGK 0.29 0.78 Localizes near the C-terminal region, partially surface-exposed
PepMLM-2 WLYYAAALELKE 0.24 0.68 Engages the β-barrel region but with lower confidence
PepMLM-3 WRYGVAAVEWKK 0.34 0.80 Binds near the N-terminal region, in proximity to A4V mutation site
Control FLYRWLPSRRGG 0.41 0.78 Binds near the N-terminal region, engaging the A4V mutation site
ipTM Analysis:
The ipTM scores range from 0.24 to 0.41 across all peptides, with the known binder (control) achieving the highest score (0.41). Among the PepMLM-generated peptides, PepMLM-3 (WRYGVAAVEWKK) shows the highest ipTM score at 0.34, approaching but not exceeding the control peptide’s binding confidence. PepMLM-0 and PepMLM-1 both show moderate ipTM scores of 0.29, while PepMLM-2 demonstrates the weakest predicted binding with an ipTM of 0.24.
Binding localization observations:
The control peptide (FLYRWLPSRRGG) and PepMLM-3 appear to bind near the N-terminus where the A4V mutation is located, potentially making them sensitive to the mutation’s effects and capable of modulating its impact.
PepMLM-0 engages the β-barrel region, a structurally important area for SOD1 stability and aggregation propensity.
PepMLM-1 localizes near the C-terminal region, away from the mutation site, which may offer a different mechanism of action.
PepMLM-2 shows the weakest interaction confidence and appears more diffusely associated with the protein surface.
Part 3: Therapeutic Property Evaluation with PeptiVerse
Each peptide was evaluated for key therapeutic properties including solubility, hemolysis risk, binding affinity prediction, and physicochemical characteristics.
Control:
Peptide 0:
Peptide 1:
Peptide 2:
Peptide 3:
Comparing the AlphaFold3 structural predictions with PeptiVerse properties reveals several insights:
Correlation between ipTM and predicted binding affinity: There is partial alignment between structural confidence and predicted binding strength. PepMLM-3, which had the highest ipTM (0.34) among generated peptides, also shows the strongest predicted binding affinity (6.916 pKd). PepMLM-0 and PepMLM-1 share identical ipTM scores (0.29) but show different predicted affinities (4.974 vs 6.584 pKd), suggesting that structural confidence doesn’t always correlate directly with biochemically measured binding strength. The control peptide achieves the highest ipTM (0.41) but only moderate predicted affinity (5.968 pKd).
Hemolysis risk assessment: PepMLM-3 shows notably higher hemolysis probability (0.848) compared to other peptides, despite being structurally promising. This represents a significant therapeutic concern. In contrast, PepMLM-0 demonstrates excellent hemolysis safety (0.013) while maintaining reasonable structural confidence. PepMLM-1 shows moderate hemolysis risk (0.209), higher than PepMLM-0 and the control but significantly lower than PepMLM-3.
Solubility and physicochemical properties: All peptides are predicted to be soluble. PepMLM-0 and PepMLM-3 are hydrophilic (negative GRAVY scores), while PepMLM-1 and PepMLM-2 are somewhat hydrophobic. The control peptide is the most hydrophilic and carries the highest positive charge at physiological pH. PepMLM-1 has a near-neutral charge at pH 7 (-0.24), which may favor membrane permeability.
Therapeutic balance assessment:
PepMLM-0: Excellent safety profile (very low hemolysis), good solubility, moderate structural confidence (ipTM 0.29), but weakest predicted binding affinity
PepMLM-1: Moderate structural confidence (ipTM 0.29), good predicted binding affinity (6.584 pKd), acceptable hemolysis risk (0.209), hydrophobic character may affect bioavailability
PepMLM-2: Good hemolysis safety, but weakest structural confidence (ipTM 0.24) and moderate binding prediction
PepMLM-3: Best structural confidence and predicted binding affinity, but concerning hemolysis probability (0.848) - highest among all candidates
Control: Best structural confidence, moderate binding prediction, excellent safety profile, but largest size and highest charge
Part 4: Generate Optimized Peptides with moPPIt
The moPPIt-generated peptides represent a significant advancement over the initial PepMLM designs by incorporating multi-objective optimization directly into the generative process. The explicit targeting of residues 1-5 (containing the A4V mutation) and the weighting of affinity, motif, and specificity objectives should theoretically yield peptides with:
Higher specificity for the intended binding site
Improved therapeutic properties
Greater potential for clinical success
The shift from passive sampling (PepMLM) to active steering (moPPIt) enables rational design of peptide candidates with built-in consideration of both binding and drug-like properties—a critical advantage for therapeutic development.
Part C:
Execution of the Mutation Score Notebook
An attempt was made to run the provided notebook to generate mutation scores for each position of the L protein. However, the following difficulties were encountered:
Problems running the notebook: I tried to run the notebook, but due to internet connectivity issues in my country, it failed to load completely even after waiting for a considerable amount of time.
Execution time: The computational process for the complete protein (75 amino acids × 19 possible mutations = ~1425 predictions) exceeded the available time in the free Colab session.
Memory errors: The underlying model consumed more RAM than was available, causing the kernel to restart before completing the analysis.
Result: It was not possible to obtain the .csv file with scores for all mutations.
Week 6 — Genetic Circuits Part I: Assembly Technologies
Phusion Master Mix Components
This mix contains a high-fidelity polymerase with proofreading activity (3 to 5 exonuclease) to minimize sequence errors. It also includes dNTPs as DNA building blocks, an optimized buffer with salts like magnesium chloride that act as enzymatic cofactors, and stabilizers to maintain the pH and ionic strength required for the reaction.
Factors Determining Annealing Temperature
The annealing temperature primarily depends on the melting temperature (T_m) of the primers, which is influenced by sequence length and GC content. Other external factors include the concentration of salts (monovalent and divalent) in the PCR buffer and the concentration of the primers themselves in the mixture.
Comparison: PCR vs. Restriction Digest
PCR exponentially amplifies specific sequences using a polymerase and thermal cycles, while digestion cuts pre-existing DNA at specific recognition sites using endonucleases. PCR is preferable when there is a low amount of template or when mutations and overlap sites need to be introduced; digestion is faster for verifying fragments or when pure genomic/plasmid DNA with exact cleavage sites is available.
Ensuring Fragments for Gibson Assembly
It is essential for fragments to share overlap sequences at their ends (usually 20 to 40 base pairs) so that the exonuclease can generate the necessary cohesive ends. Additionally, post-PCR DNA purification must be performed, and the original template should be treated with DpnI to eliminate the background of circular parental plasmid that lacks the designed overlaps.
DNA Entry into E. coli
Plasmid DNA enters through the creation of temporary pores in the cell membrane induced by thermal or electrical shock. In the heat shock method, the sudden temperature change alters the fluidity of the lipid bilayer, allowing negatively charged DNA to cross the cell wall and membrane into the cytoplasm via diffusion.
Golden Gate Assembly
Golden Gate Assembly uses Type IIS restriction enzymes that cut outside their recognition site, generating unique and programmable cohesive ends. This allows for the assembly of multiple fragments in a single step and a “single-tube” reaction through cycles of digestion and ligation. Unlike Gibson, the design requires the cleavage sites to be removed during ligation, leaving a specific scar or none at all. It is the preferred method for building combinatorial libraries or complex genetic circuits modullarly.
Modeling in Asimov Kernel
To model this process in a language like Asimov Kernel, fragments would be defined as sequences with BsaI sites in opposite orientations. The code would represent the combination of parts through ligation operators that recognize the Type IIS cleavage sites, simulating the formation of the final circular plasmid without the original restriction sites.
Week 7 — Genetic Circuits Part II: Neuromorphic Circuits
Assignment Part 1: Intracellular Artificial Neural Networks (IANNs)
What advantages do IANNs have over traditional genetic circuits, whose input/output behaviors are Boolean functions?
IANNs provide a significant advantage over Boolean genetic circuits by enabling analog computation, which allows cells to process a continuous range of signal concentrations rather than simple on/off states. This capability leads to more efficient signal integration, as a single layer can replace complex cascades of logic gates, while offering greater tunability by adjusting molecular weights like promoter strengths without re-engineering the entire system.
Describe a useful application for an IANN; include a detailed description of input/output behavior, as well as any limitations an IANN might face to achieve your goal.
A practical application for an IANN is a smart metabolic regulator for conditions such as diabetes, where inputs like glucose and GLP-1 levels are weighted to produce a graded output of insulin. This mimics natural pancreatic behavior more closely than a binary response, although it faces limitations such as molecular noise and metabolic burden, which can lead to leaky expression or reduced cellular fitness during high computational demands.
Below is a diagram depicting an intracellular single-layer perceptron where the X1 input is DNA encoding for the Csy4 endoribonuclease and the X2 input is DNA encoding for a fluorescent protein output whose mRNA is regulated by Csy4. Tx: transcription; Tl: translation.
In the provided diagram, the Csy4 endoribonuclease acts as an inhibitory input that regulates the output of the fluorescent protein by cleaving its mRNA. The system functions as a molecular perceptron where the final fluorescence depends on the balance between the transcription of the reporter gene and the repressive activity of Csy4, effectively performing a subtraction-based calculation within the cell.
Assignment Part 2: Fungal Materials
What are some examples of existing fungal materials and what are they used for? What are their advantages and disadvantages over traditional counterparts?
Existing fungal materials, often called mycelium-based composites, include sustainable alternatives for packaging, construction bricks, and vegan leather. In packaging, mycelium is grown on agricultural waste to replace polystyrene, while in construction, it is used to create lightweight, fire-resistant insulation panels or acoustic tiles. Compared to traditional materials like plastic or concrete, fungal materials are biodegradable, carbon-negative, and require significantly less energy to produce; however, they currently face disadvantages such as lower structural strength, sensitivity to high humidity, and slower production times compared to synthetic manufacturing.
What might you want to genetically engineer fungi to do and why? What are the advantages of doing synthetic biology in fungi as opposed to bacteria?
Genetically engineering fungi could enable the production of specialized enzymes for plastic degradation or the synthesis of complex pharmaceutical compounds like precursor molecules for antibiotics and anticancer drugs. Fungi are particularly advantageous for synthetic biology because, as eukaryotes, they possess advanced protein folding and post-translational modification machinery that bacteria lack, allowing them to produce functional human-like proteins. Additionally, their robust secretome allows them to export large quantities of products directly into the growth medium, simplifying the purification process compared to the intensive recovery methods often required for bacterial intracellular production.
Week 9 — Cell-Free Systems
PART 1
Question 1: What are the main advantages of cell-free protein synthesis (CFPS) regarding flexibility and control, and name two cases where it is more beneficial?
Since it is an open system, you have direct control over the chemical environment (pH, redox potential, and salts) and can add synthetic components like non-natural amino acids without being restricted by a cell membrane.
Case 1: Production of cytotoxic proteins that would otherwise kill a living host cell.
Case 2: Efficient incorporation of labeled isotopes or synthetic tags for precise protein engineering.
Question 2: What are the main components of a cell-free expression system and what is the role of each?
Cell Extract (Lysate): Provides the biological machinery (ribosomes, tRNAs, and translation factors).
Energy Substrates (NTPs): Supplies the chemical fuel (ATP and GTP) to power the synthesis.
Amino Acids: The building blocks used to assemble the protein chain.
DNA/RNA Template: The genetic instructions for the specific target protein.
Salts and Buffers: Maintain pH and provide essential ions like $Mg^{2+}$ and $K^{+}$ for enzyme activity.
Question 3: Why is energy provision regeneration critical, and what method would you use to ensure a continuous ATP supply?
Protein synthesis is energy-intensive and produces inorganic phosphate as a byproduct, which inhibits the reaction. Without regeneration, the system runs out of fuel and stops within minutes.
Method: The Creatine Phosphate / Creatine Kinase system. The enzyme (kinase) transfers a phosphate group from creatine phosphate back to ADP, continuously recycling it into ATP.
Question 4: Compare prokaryotic versus eukaryotic cell-free systems. Choose a protein for each and explain why.
Prokaryotic (E. coli): Fast, high-yield, and inexpensive. I would choose T7 RNA Polymerase, as it is a bacterial protein that functions perfectly without complex folding or sugars.
Eukaryotic (e.g., Wheat Germ or CHO): Slower and more expensive, but capable of complex folding. I would choose Human Antibodies, because they require specific disulfide bonds and glycosylation (sugars) that only eukaryotic systems can provide.
Question 5: How would you design an experiment for a membrane protein, and what challenges would you address?*
Design: Incorporate artificial hydrophobic environments like nanodiscs or liposomes directly into the reaction mix.
Challenges: Membrane proteins are hydrophobic and tend to aggregate (clump together) in water-based buffers.
Solution: Use co-translational insertion, where the protein embeds itself into the added lipids as it is being synthesized by the ribosome, keeping it stable and functional.
Question 6: Describe three possible reasons for low yield and suggest a troubleshooting strategy for each.
Magnesium Imbalance: Incorrect Mg levels stop ribosome function. Strategy: Perform a magnesium titration to find the optimal concentration.
Template Degradation: Nucleases in the extract break down the DNA/RNA. Strategy: Add RNase inhibitors or protective proteins like GamS to shield the template.
Codon Bias: The gene uses instructions that are “rare” for the extract’s machinery. Strategy: Perform codon optimization on the DNA sequence to match the tRNA abundance of the lysate.
PART 2
Pick a function and describe it
Function: synthetic cell that detects lactate and produces fluorescence
a. What would your synthetic cell do What is the input and what is the output
Input: lactate
Output: GFP
b. Could this function be realized by cell free Tx Tl alone without encapsulation
Partially yes, but with low stability and no environmental control
c. Could this function be realized by genetically modified natural cell
Yes, easier in bacteria such as E coli
d. Describe the desired outcome of your synthetic cell operation
Detect lactate and generate a proportional fluorescence response
Design all components that would need to be part of your synthetic cell
a. What would be the membrane made of
Liposomes made of POPC and cholesterol
b. What would you encapsulate inside Enzymes small molecules
Bacterial Tx Tl system
DNA with lactate responsive circuit
GFP
LDH
NAD+
c. Which organism your Tx Tl system will come from
E coli based system for simplicity
d. How will your synthetic cell communicate with the environment
Passive diffusion of lactate or transport via LldP
Experimental details
a. List all lipids and genes
Lipids
POPC
cholesterol
Genes
lldR
lldPRD
gfp
ldhA
lldP
b. How will you measure the function of your system