Bioindicator for Microplastic Contamination in Agricultural Soils 1. Biological Engineering Application Project Description The proposed biological engineering application is the development of a bioindicator for detecting contamination by microplastics and their chemical additives in agricultural soils.
Part I: Benchling & In-silico Gel Art Begin by importing your DNA sequence and use the Digests tool to test the effects of different restriction enzyme(s). Export your final design as a png and compare with your lab results on your Notion page. See the images below for where to find the Digests tool, selecting the “NEB 2-log” ladder in the Virtual Digest tab, and how to have multiple Digests appear in the same Virtual Digest.
WEEK 3 — LAB AUTOMATION LAB PROTOCOL 1. Review of Materials First, I reviewed the available documentation on HTGAA (LAB–Week 3 – Opentrons Art). Key information required to prepare the design was found at the beginning of the Google Colab notebook, including technical constraints and recommended parameters for droplet spacing and volume.
Part A. Conceptual Questions How many molecules of amino acids do you take with a piece of 500 grams of meat? First, I researched how much protein the meat contains. I assumed it was beef, and I saw that the amount of protein per gram varies depending on the cut (FEN, 2012). I calculated the average protein content of 10 cuts and got 19.46, which I rounded up to 20g per 100g.
Part 1: Generate Binders with PepMLM 1. Begin by retrieving the human SOD1 sequence from UniProt (P00441) and introducing the A4V mutation. Superoxide Dismutase 1 (SOD1) is a human enzyme that plays a critical role in protecting cells from oxidative stress by catalyzing the conversion of superoxide radicals into oxygen and hydrogen peroxide. It is a small, 154-amino-acid protein that typically forms a stable homodimer and contains a β-βarrel core structure with metal cofactors, copper and zinc, essential for its catalytic activity. Mutations in SOD1, such as the A4V variant, are associated with familial amyotrophic lateral sclerosis (ALS), a neurodegenerative disorder. SOD1 is widely expressed in the cytoplasm and is a key model protein for studying protein folding, aggregation, and targeted protein degradation strategies.
Answer these questions about the protocol in this week’s lab:
What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose?
Components of the Phusion High-Fidelity PCR Master Mix and their purpose:
Phusion DNA polymerase – a high-fidelity DNA polymerase that synthesizes new DNA strands with a low error rate during PCR. Primers (forward and reverse) – short DNA sequences that bind to the target DNA and define the region that will be amplified. dNTPs (dATP, dTTP, dCTP, dGTP) – the nucleotide building blocks used by the polymerase to synthesize new DNA strands. Reaction buffer – maintains optimal pH and ionic conditions for proper enzyme activity. Mg²⁺ ions – an essential cofactor required for DNA polymerase catalytic activity. Nuclease-free water – maintains the correct reaction volume and prevents degradation of DNA. 2. What are some factors that determine primer annealing temperature during PCR?
Subsections of Homework
Week 1 HW: Principles and Practices
Bioindicator for Microplastic Contamination in Agricultural Soils
1. Biological Engineering Application
Project Description
The proposed biological engineering application is the development of a bioindicator for detecting contamination by microplastics and their chemical additives in agricultural soils.
The system would be based on a genetically modified soil bacterium (Pseudomonas spp.), a microorganism naturally associated with the rhizosphere of crops such as:
Maize
Wheat
Functional Concept
The engineered bacterium would be designed to:
Detect stress caused by microplastic additives present in soil
Produce a visible signal (e.g., fluorescence) when exposed to such stress
This mechanism would enable:
Early detection of soil contamination
Identification of potential stress conditions affecting crops
On-site monitoring without exclusive reliance on laboratory-based analytical methods
Motivation
Microplastics are increasingly present in agricultural soils due to multiple sources, including agricultural plastics and contaminated inputs. Current detection methods are:
Technically complex
Costly
Primarily limited to laboratory analysis
This project aims to provide a field-deployable, cost-effective biological monitoring tool to complement existing analytical techniques.
Governance and Policy Framework
2. Governance and Policy Goals
The following goals guide the evaluation of governance actions for this project.
Biosecurity
Ensure that the use of a genetically modified microorganism as a bioindicator:
Does not create biological or environmental risks
Protects sustainable soil use
Avoids negative impacts on:
Soil ecosystems
Crops (maize and wheat)
Non-target organisms
Ethics
Ensure responsible and socially acceptable use of the technology by:
Protecting local communities
Promoting transparency
Limiting the application strictly to environmental monitoring purposes
Other Considerations
Ensure technical feasibility
Consider economic costs
Minimize burdens to stakeholders
Avoid unnecessary impediments to scientific research
Promote constructive and responsible applications of biotechnology
3. Potential Governance Actions
Action 1: Environmental Biosecurity Assessment Before Field Use
Require a formal environmental biosecurity assessment prior to field application in order to prevent biological and ecological risks in agricultural soils.
Design
Researchers prepare risk assessment protocols
Regulatory or institutional bodies review and approve these protocols before field testing
Assumptions
Laboratory experiments can reasonably predict environmental behavior and associated risks
Risks of Failure
Long-term environmental effects may be underestimated
Risks of Over-Strict Implementation
Excessively strict requirements may slow down research and innovation
Reduce environmental risks by limiting the survival or activity of the bioindicator microorganism outside controlled soil conditions.
Design
Genetic containment mechanisms are incorporated during the design stage
Containment is built directly into the microorganism’s genetic architecture
Assumptions
Containment systems are stable and effective
Containment does not significantly reduce bioindicator performance
Risks of Failure
Containment mechanisms fail under real environmental conditions
Risks of Over-Strict Implementation
Excessive biological control reduces detection sensitivity or functionality
Action 3: Transparency and Engagement with Local Farming Communities
Type: Ethical governance / incentive mechanism
Actors: Researchers, institutions, farmers
Purpose
Ensure ethical and socially responsible implementation by informing and engaging farming communities.
Design
Provide clear information regarding:
Purpose of the bioindicator
Operational limits
Potential risks
Involve local communities in decision-making processes
Assumptions
Transparency increases trust and acceptance
Risks of Failure
Communication efforts may increase resistance or skepticism
Risks of Over-Expansion
Broad acceptance may lead to unintended or expanded uses
4. Comparative Evaluation
Scoring table to be inserted here.
5. Prioritization of Governance Options
Based on the updated scoring table, a combination of:
Option 1: Environmental Biosecurity Assessment
Option 2: Biological Containment Strategies
is prioritized.
These two options are essential for project viability because they:
Perform best in preventing biological and environmental risks
Address core ethical concerns
Strengthen long-term sustainability of the project
Although they may involve:
Higher implementation costs
Potential constraints on research
they are necessary to ensure responsible development.
Role of Option 3
Transparency and community engagement is not critical for technical viability but is essential for:
Implementation in agricultural plantations
Social acceptance
Practical adoption by farming communities
Conclusion
A combined governance strategy integrating:
Environmental biosecurity assessment
Technical biological containment
Community engagement
provides a balanced framework that aligns:
Safety
Technical feasibility
Ethical responsibility
Real-world applicability
This integrated approach supports the responsible development and deployment of a bioindicator system for detecting microplastic contamination in agricultural soils.
DNA Replication, Oligo Synthesis, and Molecular Coding Concepts
Questions from Professor Jacobson
DNA Replication Fidelity
Nature’s machinery for copying DNA is DNA polymerase, which has an error rate of approximately 1 in 10⁶ bases due to its 3’→5’ proofreading activity.
The human genome contains approximately 3 × 10⁹ base pairs. In principle, thousands of errors could occur during a complete replication cycle. However, biological systems maintain genomic integrity through multiple mechanisms:
Polymerase proofreading activity
Exonuclease activity
DNA repair systems
Together, these systems significantly reduce the final mutation rate and preserve genomic stability.
Combinatorial Space of Protein-Coding Sequences
There are approximately:
3400 ≈ 10190
possible DNA sequences that could encode an average human protein of 400 amino acids.
In practice, many theoretical sequences are non-functional due to constraints such as:
Codon bias
mRNA secondary structure
Translation efficiency
Splicing motifs
Protein folding constraints
Thus, while the combinatorial sequence space is vast, functional protein-coding sequences represent only a small subset.
Questions from Dr. LeProust
Most Common Method for Oligonucleotide Synthesis
The most widely used method for oligonucleotide synthesis today is solid-phase phosphoramidite chemistry (β-cyanoethyl phosphoramidite method).
It is performed on a solid support, typically:
Controlled pore glass (CPG)
Functionalized silica
The synthesis proceeds through iterative cycles consisting of:
Coupling of a protected nucleoside phosphoramidite
Capping of unreacted hydroxyl groups
Oxidation of the phosphite triester to a phosphate triester
DMT deprotection (deblocking)
This cycle is repeated until the desired sequence is assembled.
Why Is It Difficult to Synthesize Oligos Longer Than 200 nt?
Direct solid-phase phosphoramidite synthesis becomes inefficient beyond approximately 200 nucleotides because each nucleotide addition step is not 100% efficient (typically ~99%).
Since each step introduces a small loss, the total yield decreases exponentially with length.
Consequences include:
Cumulative yield loss
Exponential decrease in full-length product
Accumulation of truncated sequences
Increased error rates
Increasingly difficult purification
For example, if each step is 99% efficient, after 200 cycles the overall yield is:
0.99^200
which results in a significant reduction of full-length product.
Therefore, beyond ~200 nt, direct synthesis becomes impractical.
Why Can’t a 2000 bp Gene Be Synthesized Directly?
A 2000 base pair gene cannot be synthesized by direct chemical oligo synthesis for the same reason: cumulative inefficiency.
Attempting 2000 consecutive synthesis cycles would result in:
Negligible yield of full-length product
Massive accumulation of truncated fragments
Increased error rates
Impractical purification requirements
Instead, modern approaches use high-throughput synthesis platforms (e.g., chip-based synthesis) to generate millions of short oligos (typically 60–150 nt) in parallel.
These shorter fragments are then assembled into full-length genes using methods such as:
PCR-based assembly
Gibson Assembly
This strategy allows efficient production of thousands of genes (e.g., 9,600 genes) in a practical and scalable way.
In summary, long genes are not synthesized directly because the chemistry is inefficient at that scale; instead, short oligos are synthesized and subsequently assembled.
Question from George Church
The 10 Essential Amino Acids in Animals
The essential amino acids (those that must be obtained from the diet) are:
Histidine
Isoleucine
Leucine
Lysine
Methionine
Phenylalanine
Threonine
Tryptophan
Valine
Arginine (essential in many animals, especially during growth)
Implications for the “Lysine Contingency”
Since lysine is already an essential amino acid in animals, metabolic dependence on lysine is not unusual in biology.
However, lysine is naturally present in many environments and food sources. Therefore, engineering an organism to depend on lysine as a containment strategy may offer limited biosafety, because the amino acid is neither rare nor synthetic.
This suggests that dependence on a non-natural amino acid would provide a stronger and more reliable biocontainment strategy.
Coding Relationships in Molecular Biology
AA:NA and NA:NA Codes
Definitions:
AA = Amino Acid
NA = Nucleic Acid (DNA or RNA)
NA:NA Code
Refers to nucleic acid–nucleic acid interactions, specifically base-pairing rules:
A–T (or A–U in RNA)
C–G
These interactions govern information storage and replication.
AA:NA Code
Refers to the genetic code: codons in nucleic acids specify which amino acids are incorporated into a protein.
This code determines the primary amino acid sequence of a protein.
What Code Would Describe AA:AA Interactions?
AA:AA interactions refer to amino acid–amino acid interactions within or between proteins, including:
Hydrophobic interactions
Hydrogen bonds
Ionic interactions
Disulfide bonds
Van der Waals forces
These interactions determine:
Protein folding
Three-dimensional structure
Stability
Function
Unlike the genetic code, AA:AA interactions do not form a simple digital code. Instead, they constitute a physicochemical interaction framework that governs protein behavior.
Using the AA:NA code, we can program the amino acid sequence of a protein. By selecting specific amino acids, we indirectly influence AA:AA interactions and therefore affect structure and function.
However, although protein sequences can be designed, final folding and functionality cannot always be predicted with certainty. Protein structure emerges from complex and sometimes unpredict
Week 2 HW: DNA read, write and edit.
Part I: Benchling & In-silico Gel Art
Begin by importing your DNA sequence and use the Digests tool to test the effects of different restriction enzyme(s). Export your final design as a png and compare with your lab results on your Notion page. See the images below for where to find the Digests tool, selecting the “NEB 2-log” ladder in the Virtual Digest tab, and how to have multiple Digests appear in the same Virtual Digest.
First, I accessed the website https://rcdonovan.com/gel-art to create a sketch of a design based on the lambda sequence with the restriction enzymes:
EcoRI-HF
HindIII-HF
BamHI-HF
KpnI-HF
EcoRV-HF
SacI-HF
SalI-HF
which I will perform using electrophoresis.
After experimenting a bit with the enzymes and the sequence (lamba), I decided to draw a “sunrise in a forest”.
Electrophoresis Overview
Now let’s talk about electrophoresis. We start with a lambda DNA sequence of about 48,500 base pairs, which we cut into fragments of different sizes using restriction enzymes. During electrophoresis, these DNA fragments are loaded into wells in a gel and move through the gel when an electric current is applied.
The fragments travel at different speeds depending on their size:
Larger fragments move more slowly and cover shorter distances.
Smaller fragments move faster and travel farther.
By observing how far each fragment migrates, we can use this pattern to create the chosen visual representation.
Benchling Simulation
Let’s go to the Benchling website (benchling.com) to perform a simulation of electrophoresis using the proposed DNA sequence and restriction enzymes.
To find the GenBank accession number for the lambda sequence (#lambda sequence in GenBank) Google was used. The number for the lambda is J02459.1. Clicking on this number opens the sequence directly on the NIH website (National Library of Medicine, www.ncbi.nlm.nih.gov).
Click on FASTA, which will open a menu. Adjust the options as needed and then download the file.
Next, select the scissors icon on the right to perform a digest in the Virtual Digest tab. Choose the restriction enzyme and carry out the digestion.
The program will simulate electrophoresis for each well, showing how far the DNA fragments migrate. After performing digests for all nine wells and selecting the ladder, we will have the final proposed image.
Figure: “Sunrise in the Forest,” sun and tree in the image.
Part 2: Gel Art – Restriction digests and Gel Electrophoresis
Since I did not have access to the lab to perform this section, I will use a gel that I ran a couple of weeks ago:
To determine the fragment size, we compared it with the ladder (on the right of each row).
The fragments are approximately 6,000 bp.
The color does not determine the size. The DNA size is determined by the migration distance.
The bands are clear and well separated, indicating that the electrophoresis ran correctly. No lateral smearing or fuzzy bands are observed, which indicates good gel preparation and proper sample loading.
Part 3: DNA Design Challenge (proposal)
Promoter
Name: BBa_J23106
Type: Constitutive promoter
Sequence: TTTACGGCTAGCTCAGTCCTAGGTATAGTGCTAGC
Purpose: Initiates transcription of PETase.
RBS + Start codon (ATG)
Sequence: CATTAAAGAGGAGAAAGGTACCATG
Purpose: Ribosome binding and translation initiation.
CDS (PETase codon-optimized)
Name: PETase_Ecoli_codopt
Sequence: tu secuencia codón optimizada aquí
Purpose: Codes for the PETase protein.
His-tag (optional, C-terminal)
Sequence: CATCACCATCACCATCATCAC (7x His)
Purpose: Facilitates purification of PETase protein.
Purpose: Selects for bacteria that successfully took up the plasmid.
3.1. Choose your protein
PETase Protein Sequence Description
The protein used in this project is a Poly(ethylene terephthalate) hydrolase (PETase), also known as a PET-digesting enzyme. Its UniProt/NCBI accession is A0A0K8P6T7.1, and it is composed of 290 amino acids. This enzyme originates from a bacterial source and is capable of degrading PET plastic. The sequence has been selected for codon optimization to enable expression in E. coli, facilitating the construction of a plasmid-based reporter system for PET detection in environmental samples.
The PETase protein sequence was obtained from Ideonella sakaiensis by accessing the NIH protein database at https://www.ncbi.nlm.nih.gov/protein and searching for “PETase Ideonella sakaiensis.” The protein entry was selected, and the sequence was downloaded in FASTA format for subsequent use in codon optimization and plasmid design.
I downloaded the Proteine.fasta file.
I examined and visualized the features of the protein sequence through the Features option.
3.2. Reverse Translate: Protein (amino acid) sequence to DNA (nucleotide) sequence.
There are available on internet many Codon Optimization Tools (IDT, Twist, GenScript, VectorBuilder, ExoOptimizer…). I chose Vector Builder Optimization Tool (https://en.vectorbuilder.com/).
The PETase protein sequence in FASTA format was used as input in VectorBuilder → Tools → Codon Optimization Tool.
The amino acid sequence was pasted directly, specifying it as a protein sequence.
The host organism was set to Escherichia coli str. K-12 substr. MG1655 to optimize codon usage for this bacterium.
No restriction sites were selected for avoidance.
The tool generated a DNA sequence (ORF) optimized for E. coli, with the following characteristics:
Start codon: ATG and stop codon: TAA
GC content: 59.56%
Codon Adaptation Index (CAI): 0.91, indicating high compatibility with E. coli codon preferences.
The sequence is now ready to be inserted into a plasmid for expression.
3.3. Codon optimization
NOTE: Codon optimization was performed along with reverse translation in Vector Builder (VB).
Codon optimization is important because, although multiple codons can encode the same amino acid, different organisms preferentially use certain codons more frequently. If a gene contains codons that are rarely used in the host organism, translation may be inefficient, leading to low protein expression. By optimizing codon usage, I adapted the PETase nucleotide sequence to match the codon preferences of Escherichia coli without altering the amino acid sequence of the protein.
For this project, I chose to optimize PETase for E. coli because it is the host organism where the plasmid will be expressed. Since PETase originates from the bacterium Ideonella sakaiensis, its native codon usage differs from E. coli, so codon optimization improves expression efficiency in this bacterial system.
During codon optimization, restriction enzyme recognition sites were not specifically removed because the sequence was designed for direct insertion into a plasmid using VectorBuilder, which ensures proper assembly. This optimization ensures high expression of PETase in E. coli, allowing the gene to be used effectively in the plasmid as a biological marker for PET detection.
3.4. You have a sequence! Now what?
Producing PETase from the Optimized DNA Sequence
The codon-optimized DNA sequence for PETase can be expressed using cell-dependent or cell-free systems. In bacteria like E. coli, the plasmid is transcribed into mRNA and translated into protein using the host’s machinery, with selection markers ensuring only cells carrying the gene produce PETase. Alternatively, cell-free systems synthesize protein directly from the DNA in vitro, without living cells. Optimizing the sequence ensures efficient production of functional PETase for detection or degradation of PET.
Both approaches follow the central dogma of molecular biology: DNA is transcribed into RNA, and RNA is translated into protein.
3.5. How does it work in nature/biological systems?
In nature, a single gene contains the information to produce a protein through transcription and translation. The DNA sequence of the PETase gene is first transcribed into messenger RNA (mRNA) by RNA polymerase. This mRNA serves as a template for translation, where ribosomes read the codons in sets of three nucleotides to assemble the corresponding amino acids into the PETase protein.
Some genes in biological systems can produce multiple protein variants through mechanisms such as alternative splicing, alternative start codons, or frameshifting, but PETase from Ideonella sakaiensis produces a single functional protein. Aligning the DNA sequence, the transcribed RNA, and the translated protein shows how the nucleotide code directly determines the amino acid sequence of the protein.
The construct of DNA and amino acids was obtained in genscript during codon optimization sequence.
Part 4: Prepare a Twist DNA Synthesis Order
4.1. Create a Twist account and a Benchling account
Twist account doesn’t work properly. I used instead VectorBuilder and Benchling.
4.2. Build a DNA Insert Sequence
I adjusted the proposed PETase plasmid design as follows:
I have to build a DNA Insert Sequence from the Promoter to Terminator, which includes:
Promoter – initiates transcription of the gene. (BBa_J23106 TTTACGGCTAGCTCAGTCCTAGGTATAGTGCTAGC, position 1-35).
Ribosome Binding Site (RBS / Kozak) – ensures proper ribosome binding and efficient translation initiation (Kozak GCCACC, position 36 - 41).
Start Codon – defines the beginning of the open reading frame (ATG, position 42 - 44).
Codon-Optimized PETase ORF – DNA codon optimized sequence for E. coli (position 45 – 911).
Antibiotic Selection Marker – allows selection of bacteria carrying the plasmid (Ampicillin - Amp).
In Benchling, I create a new sequence that will result from sequentially adding the DNA corresponding to the proposed plasmid design. I am going to create an insert sequence manually.
Then click Sequence Map and right-clicking over the sequence and create an annotation that describes what each piece is.
This final construct ensures correct transcription and translation of the PETase protein in a bacterial expression system.
And I also exported and downloaded the FASTA file for the constructed sequence.
The final assembled sequence is known as an expression cassette, since it contains all the necessary elements required for gene expression, including the promoter, RBS, start codon, codon-optimized PETase coding sequence, stop codon, and terminator.
This cassette can be downloaded as a FASTA file for further use. In this format, the sequence can be shared, analyzed, synthesized, or inserted into a circular plasmid backbone for recombinant protein expression.
To visualize the genetic construct, I used SBOL Canvas to create a standardized graphical representation of the expression cassette. This tool allowed me to clearly illustrate the organization of the promoter, Kozak, codon-optimized PETase coding sequence, and terminator.
4.3. VectorBuilder: clonal genes
For this project, we selected the Clonal Genes option because it provides the DNA sequence cloned into a circular plasmid backbone, ready for direct transformation into E. coli.
Since the Twist Bioscience platform was not accessible, VectorBuilder was used as an alternative to insert the cassette into the plasmid backbone. VectorBuilder offers similar gene synthesis and cloning services, enabling the insertion of the designed expression cassette into a suitable plasmid backbone for bacterial expression. This alternative allows the construct to be obtained as a ready-to-use circular DNA molecule for transformation and recombinant protein production.
Open VectorBuilder and choose a vector system. I have chosen a bacterial recombinant protein expression vector (pET Guide) to express the codon-optimized PETase in E. coli. This vector is designed for high-level protein expression, supports all essential elements (RBS, start/stop codons, optional tags, terminator), and allows selection with antibiotics. It provides a reliable and compatible platform for producing functional PETase in a bacterial system.
I did not add a purification tag to PETase at this stage. The protein can be expressed and remain functional without a tag, and a His-tag or other purification tag can be added later if needed for downstream applications.
The vector was constructed using the codon-optimized DNA of the PETase protein. A PDF file (VB260217-1695xnz(pET-{PETase ORF codon_optimized_DNA}).pdf) containing the vector information, a FASTA file (VB260217-1695xnz.fasta) with the sequence, and a GenBank file (VB260217-1695xnz.gb) were downloaded.
When selecting the vector, I was presented with other options that I was not certain were appropriate. It is important to carefully evaluate which vector is most suitable and to ensure that the promoters, RBS, and other components included in the vector are optimal for this project.
Part 5: DNA Read/Write/Edit
5.1 DNA Read
1. What DNA would you want to sequence (e.g., read) and why?
I would sequence DNA from soil or ocean microbial communities exposed to plastic, aiming to detect PETase or other plastic-degrading genes. This would help identify microorganisms capable of PET degradation and monitor their presence in different environments. I would also like to work with fossil and isolated organisms to better understand evolutionary adaptations.
2. What technology or technologies would you use to perform sequencing on your DNA and why?
I would use Illumina NGS (Sequencing by Synthesis) for high-throughput and accurate sequencing of environmental DNA. For fossil or isolated organisms, long-read technologies such as PacBio or Oxford Nanopore could help assemble fragmented genomes.
3. Is your method first-, second-, or third-generation? How so?
For detecting PETase genes in environmental samples, Illumina NGS (Sequencing by Synthesis) is a second-generation sequencing method. It produces millions of short reads in parallel, offering high throughput and accuracy, which makes it ideal for complex samples.
4. What is your input? How do you prepare your input? List the essential steps.
The input for Illumina NGS is extracted DNA from environmental samples. Essential steps include: DNA extraction, fragmentation, adapter ligation, PCR amplification, and library quality control before sequencing.
5. What are the essential steps of your chosen sequencing technology? How does it decode the bases of your DNA sample (base calling)?
Illumina Sequencing by Synthesis works by attaching DNA fragments to a flow cell and amplifying them into clusters. Fluorescently labeled nucleotides are incorporated one at a time, and after each incorporation, imaging captures the fluorescent signal. Software then converts these signals into base calls (A, C, G, T).
6. What is the output of your chosen sequencing technology?
The output of Illumina sequencing consists of FASTQ files containing DNA sequences and quality scores for each base. If multiple samples are sequenced together, the reads are demultiplexed using sample-specific barcodes.
5.2 DNA Write
1. What DNA would you want to synthesize and why?
I would like to synthesize genes related to environmental rescue, including biosensors, plastic-degrading enzymes such as PETase, and biomaterials for bioremediation. These DNA constructs could allow microorganisms to detect pollutants, respond to environmental stimuli, and break down plastics. For example, I would include the codon-optimized PETase ORF used in this project, along with regulatory elements such as promoters and ribosome binding sites to ensure proper expression.
2. What technology or technologies would you use to perform this DNA synthesis and why?
I would use commercial DNA synthesis platforms such as Twist Bioscience or enzymatic DNA synthesis methods. These technologies allow accurate and customizable synthesis of codon-optimized genes and regulatory elements. They are scalable, precise, and suitable for constructing complete expression cassettes.
3. What are the essential steps of your chosen DNA synthesis method?
DNA synthesis typically involves chemically or enzymatically synthesizing short oligonucleotides, assembling them into longer DNA fragments, verifying sequence accuracy, and cloning the construct into a plasmid backbone. The final product is then amplified and quality-checked before delivery.
4. What are the limitations of your DNA synthesis method in terms of speed, accuracy, and scalability?
DNA synthesis can be limited by cost, turnaround time, and potential synthesis errors in long or repetitive sequences. Very large constructs may require hierarchical assembly, increasing complexity and time.
5.3 DNA Edit
1. What DNA would you want to edit and why?
I would like to edit genes in microorganisms involved in plastic degradation to improve their efficiency, stability, or environmental tolerance. For example, I could modify the PETase gene to enhance its catalytic activity or thermal stability. Editing could also be applied to environmental bacteria to optimize metabolic pathways for bioremediation.
2. What technology or technologies would you use to perform these DNA edits and why?
I would use CRISPR-Cas systems because they provide precise, efficient, and versatile genome editing capabilities in microorganisms, plants, or animals.
3. How does your technology of choice edit DNA? What are the essential steps?
CRISPR-Cas uses a guide RNA designed to match a specific DNA target sequence. The Cas nuclease creates a double-strand break at the target site. The cell then repairs the break either through non-homologous end joining (causing insertions or deletions) or homology-directed repair (allowing precise edits if a repair template is provided).
4. What are the limitations of your editing method in terms of efficiency or precision?
CRISPR-Cas systems may have off-target effects, variable editing efficiency depending on the target sequence, and limitations related to delivery methods. Additionally, precise edits require efficient homology-directed repair, which may not occur at high frequency in all cell types.
Week 03 HM: Lab Automation
WEEK 3 — LAB AUTOMATION
LAB PROTOCOL
1. Review of Materials
First, I reviewed the available documentation on HTGAA (LAB–Week 3 – Opentrons Art). Key information required to prepare the design was found at the beginning of the Google Colab notebook, including technical constraints and recommended parameters for droplet spacing and volume.
Initially, I had questions regarding droplet size and spacing between points. The Google Colab notebook provided specific recommendations about these parameters.
Before identifying these constraints, I created a version of the design with reduced spacing between points to increase visual detail. However, decreasing the distance between droplets increased the risk of unintended merging during robotic dispensing. After reviewing the guidelines, I adjusted the design to comply with the recommended 3.5 mm spacing.
From Donovan’s platform, I downloaded the coordinate sets to be used in the Python script. The coordinates were grouped by color and already respected the recommended spacing (3.5 mm).
I opened the HTGAA26 Opentrons Colab notebook and created a personal copy to develop the script. The notebook included reference examples from previous students, which were useful to understand how coordinate-based dispensing is implemented on agar plates.
The script was written in Python. Since the laboratory setup provides only two available colors for execution, I adapted the original design as follows:
Blue droplets were replaced with green droplets using a higher volume (1.2 µL) to create visual differentiation.
Green and red droplets were dispensed using the recommended volume (1 µL).
After completing the implementation, I executed the simulation within the Opentrons environment. The simulation ran successfully without errors, confirming correct tip usage, aspiration logic, and coordinate positioning.
Here, the script:
fromopentronsimporttypesmetadata={# see https://docs.opentrons.com/v2/tutorial.html#tutorial-metadata'author':'María José Pérez Crespo','protocolName':'Peacock','description':'Print peacock using two colors: green 1 µL, green 1.2 µL, red 1 µL','source':'HTGAA 2026 Opentrons Lab','apiLevel':'2.20'}################################################################################# Robot deck setup constants - don't change these##############################################################################TIP_RACK_DECK_SLOT=9COLORS_DECK_SLOT=6AGAR_DECK_SLOT=5PIPETTE_STARTING_TIP_WELL='A1'well_colors={'A1':'Red','B1':'Green','C1':'Orange'}defrun(protocol):################################################################################# Load labware, modules and pipettes##############################################################################tips_20ul=protocol.load_labware('opentrons_96_tiprack_20ul',TIP_RACK_DECK_SLOT,'Opentrons 20uL Tips')pipette_20ul=protocol.load_instrument("p20_single_gen2","right",[tips_20ul])temperature_module=protocol.load_module('temperature module gen2',COLORS_DECK_SLOT)temperature_plate=temperature_module.load_labware('opentrons_96_aluminumblock_generic_pcr_strip_200ul','Cold Plate')color_plate=temperature_plateagar_plate=protocol.load_labware('htgaa_agar_plate',AGAR_DECK_SLOT,'Agar Plate')center_location=agar_plate['A1'].top()pipette_20ul.starting_tip=tips_20ul.well(PIPETTE_STARTING_TIP_WELL)################################################################################# Helper functions##############################################################################deflocation_of_color(color_string):forwell,colorinwell_colors.items():ifcolor.lower()==color_string.lower():returncolor_plate[well]raiseValueError(f"No well found with color {color_string}")defdispense_and_detach(pipette,volume,location):assertisinstance(volume,(int,float))above_location=location.move(types.Point(z=location.point.z+5))pipette.move_to(above_location)pipette.dispense(volume,location)pipette.move_to(above_location)################################################################################# Definición de puntos##############################################################################sfgfp_points=[...]mrfp1_points=[...]electra2_points=[...]################################################################################# Centrado automático##############################################################################all_points=sfgfp_points+mrfp1_points+electra2_pointsmin_x=min(pt[0]forptinall_points)max_x=max(pt[0]forptinall_points)min_y=min(pt[1]forptinall_points)max_y=max(pt[1]forptinall_points)offset_x=(min_x+max_x)/2offset_y=(min_y+max_y)/2################################################################################# Green (new tip)##############################################################################green_loc=location_of_color('Green')pipette_20ul.pick_up_tip()remaining_total_volume=len(sfgfp_points)*1.0+len(electra2_points)*1.2current_volume=0.0forpointinsfgfp_points:vol=1.0ifcurrent_volume<vol:load_volume=min(20,remaining_total_volume)pipette_20ul.aspirate(load_volume,green_loc)current_volume+=load_volumex_mm,y_mm=pointtarget_location=center_location.move(types.Point(x=x_mm-offset_x,y=y_mm-offset_y,z=0))dispense_and_detach(pipette_20ul,vol,target_location)current_volume-=volremaining_total_volume-=volforpointinelectra2_points:vol=1.2ifcurrent_volume<vol:load_volume=min(20,remaining_total_volume)pipette_20ul.aspirate(load_volume,green_loc)current_volume+=load_volumex_mm,y_mm=pointtarget_location=center_location.move(types.Point(x=x_mm-offset_x,y=y_mm-offset_y,z=0))dispense_and_detach(pipette_20ul,vol,target_location)current_volume-=volremaining_total_volume-=volpipette_20ul.drop_tip()################################################################################# RED color (New tip)##############################################################################red_loc=location_of_color('Red')pipette_20ul.pick_up_tip()remaining_total_volume=len(mrfp1_points)*1.0current_volume=0.0forpointinmrfp1_points:vol=1.0ifcurrent_volume<vol:load_volume=min(20,remaining_total_volume)pipette_20ul.aspirate(load_volume,red_loc)current_volume+=load_volumex_mm,y_mm=pointtarget_location=center_location.move(types.Point(x=x_mm-offset_x,y=y_mm-offset_y,z=0))dispense_and_detach(pipette_20ul,vol,target_location)current_volume-=volremaining_total_volume-=volpipette_20ul.drop_tip()
4. Use of AI Assistance
Artificial intelligence tools were used to support the development of the script. The structural logic was based on the provided examples in the notebook, particularly Example 7 (Microbial Earth), which clarified how grouped coordinates are iterated and dispensed.
AI assistance was used to:
Understand how the center of the agar plate is calculated.
Implement coordinate offset correction for centering the design.
Structure iteration over grouped coordinate lists while maintaining color consistency.
Optimize aspiration logic to minimize reagent waste.
The questions posed were specific and focused on resolving implementation details. AI was used as a technical support tool rather than as a replacement for understanding the protocol logic.
5. Robot Scheduling
I already booked Friday 27, 14.00
6. Submission
The Python script was submitted via the corresponding Google Form, and confirmation was received.
1. Find and describe a published paper that utilizes the Opentrons or an automation tool to achieve novel biological applications.**
Rosini, E., Battaglia, C., Miani, D., Molinari, F., Arrigoni, F., Piarulli, U., … & Pollegioni, L. (2025). Valuable compounds from pollutants: converting PET into enantiopure alanine. ACS Catalysis, 15(21), 17829-17843.
In this study, the Opentrons OT‑2 automated pipetting system was used to perform a colorimetric assay with PSP dye, which allows measurement of pH changes associated with terephthalic acid (TPA) production during PET depolymerization. The automation enabled rapid and consistent processing of multiple samples, improving the reproducibility and efficiency of depolymerizing enzyme screening. The Opentrons OT‑2 automated system was used to perform a PSP colorimetric assay to measure pH changes during PET depolymerization, enabling fast and reproducible enzyme screening.
2. Write a description about what you intend to do with automation tools for your final project. You may include example pseudocode, Python scripts, 3D printed holders, a plan for how to use Ginkgo Nebula, and more. You may reference this week’s recitation slide deck for lab automation details.**
No ideas already.
FINAL PROJECT IDEAS
No ideas already.
Week 4 HW: Protein Design - Part I
Part A. Conceptual Questions
How many molecules of amino acids do you take with a piece of 500 grams of meat?
First, I researched how much protein the meat contains. I assumed it was beef, and I saw that the amount of protein per gram varies depending on the cut (FEN, 2012). I calculated the average protein content of 10 cuts and got 19.46, which I rounded up to 20g per 100g.
A Dalton is a unit of molecular mass defined as 1 atomic mass unit (amu) which is 1/12 of the mass of a carbon 12 atom. 1 Da ~ mass of a single proton or neutron ~ 1.66 x 10-24 g 100 Da ~ 1.66 x 10-22 g/molecule, that is the mass of a single amino acid (one molecule).
g of Protein to moles of amino acids: Moles of amino acids = (mass of protein (g)) / (average molecular weight per amino acid (g/mol)) (100 g)/(100 g/mol) = 1 mol of amino acid
If (Avogadro’s number): 1 mol = 6.022 x 10²³ molecules/mol
How many molecules in 1 mol of amino acids? ~6.022 x 10²³ molecules/mol in 500 gr of meat
Why do humans eat beef but do not become a cow, eat fish but do not become fish?
Because humans eat food that provides molecules, not organisms.
Why are there only 20 natural amino acids?
The 20 amino acids were selected based on availability, stability, and functional suitability, forming an optimal set for building proteins in early life.
There are only 20 standard amino acids because evolution selected a set that is chemically diverse, structurally compatible with protein folding, and efficient for accurate translation.
While more amino acids were likely available early on, translation precision and the need to reduce errors led to the retention of this specific set (Weber & Miller, 1981).
Can you make other non-natural amino acids? Design some new amino acids.
Non-natural amino acids can be designed to introduce new functions, enhance stability, or modify reactivity in proteins, extending the capabilities of the 20 standard amino acids; however, their exact behavior cannot be fully predicted, so careful experimental validation and optimization are essential.
To design a new amino acid, I need to study their structure, variability, and functionality:
Structure and Function of α-Amino Acids
Central α-carbon – the connection point for all groups.
Amino group (-NH₂) – basic, participates in peptide bonds.
Carboxyl group (-COOH) – acidic, participates in peptide bonds and proton transfer.
Hydrogen (H) – small, allows proper folding.
Side chain (R group) – variable, determines the chemical properties and identity of the amino acid.
Variability:
Differences between amino acids come from the R group, which can be hydrophobic, polar, acidic, basic, aromatic, or special (like cysteine or proline).
Functional roles:
Amino and carboxyl groups: form the peptide backbone and participate in chemical reactions. Side chain (R): defines polarity, charge, reactivity, hydrogen bonding, and overall protein structure.
Design implications:
Synthetic amino acids often modify the side chain (R) to introduce new chemical functions, stability, or reactivity. Modifying the α-carbon, amino, or carboxyl groups is less common but can create β-amino acids or N-alkylated variants, affecting folding.
The side chain (R group) determines the identity and functionality of an amino acid, while the backbone maintains the ability to form proteins. Synthetic modifications expand the chemical possibilities beyond the 20 standard amino acids.
Now, I do not know much about synthetic modifications, and I asked chatgpt3:
I also found an interesting article about synthetic amino acids (Rovner et al., 2015), it shows a practical application of synthetic amino acids (sAAs) to control and engineer organisms, and expand protein functionality beyond natural amino acids:
Incorporation into essential proteins: TAG codons in key genes are reassigned to encode synthetic amino acids, demonstrating how sAAs can be integrated into functional proteins.
Biocontainment: Cells only grow when supplied with sAAs, highlighting their use for controlling viability.
Expanding the genetic code: sAAs enable new chemical functionalities beyond the 20 natural amino acids, allowing the design of synthetic proteins with novel properties.
Where did amino acids come from before enzymes that make them, and before life started?
Before enzymes and living cells existed, amino acids likely formed through natural (abiotic) chemical processes. The main explanations are:
Chemical reactions on early times of earth: On the early Earth, simple atmospheric gases such as carbon dioxide (CO₂), methane (CH₄), ammonia (NH₃), and hydrogen (H₂) are thought to have reacted under energy sources such as lightning, volcanic heat, and ultraviolet radiation. Within the theoretical framework proposed by Alexander Oparin (1924, 1938) and J. B. S. Haldane (1929), these energy-driven processes could have led to the abiotic formation of organic molecules in the primitive oceans. Experimental support for this hypothesis was later provided by the Miller-Urey experiment (Miller, 1953), which demonstrated that amino acids can form under simulated early Earth conditions.
Formation in space and delivery to Earth: Amino acids may also have formed in space and later been delivered to Earth through cometary and asteroidal material. Organic compounds, including amino acids and phosphorus-bearing molecules, have been detected in the coma of comet 67P/Churyumov-Gerasimenko, demonstrating that prebiotic chemicals can form in extraterrestrial environments (Altwegg et al., 2016). More recently, samples from the asteroid Bennu returned by NASA’s OSIRIS-REx revealed amino acids whose isotopic signatures suggest formation in very cold primordial ices in the early Solar System (Baczynski et al 2026).
Mineral catalysis before enzymes: Mineral surfaces and metal ions on the early Earth could have acted as simple catalysts before enzymes evolved, helping to concentrate organic compounds and promote chemical reactions. Recent reviews show that clays and other minerals can adsorb and organize amino acids and other prebiotic molecules, enhancing their reactions and potentially aiding the formation of more complex organics such as peptides and polymers in prebiotic settings (Nogal et al., 2023).
Overall, evidence suggests that amino acids existed before life began, and enzymes evolved later to make these natural chemical processes more efficient.
If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?
If an α-helix is constructed entirely from D-amino acids, the helix will adopt the opposite handedness compared to a helix made of L-amino acids and its functionality can change. Natural proteins are composed of L-amino acids, which form right-handed α-helices. D-amino acids are mirror images of L-amino acids, so when used exclusively, they form left-handed α-helices.
If an α-helix composed of L-amino acids is instead made entirely from D-amino acids, the helix becomes a mirror image of the original. L-amino acids form right-handed helices, while D-amino acids form left-handed helices, flipping the spatial arrangement of the side chains. Because many biological interactions are chiral-specific, this mirror-image helix often cannot interact with the same enzymes, receptors, or partners as the L-helix. As a result, the functionality can change: D-amino acid helices may be more resistant to proteases, more stable, or have novel chemical properties, but they rarely replicate the exact biological activity of the original L-amino acid helix.
Can you discover additional helices in proteins?
Yes. Additional helices in proteins can be discovered through experimental methods such as X-ray crystallography, NMR, or cryo-EM, as well as through computational structure prediction tools like AlphaFold. In some cases, regions that appear disordered can form helices only under certain conditions (for example, upon binding to another molecule), revealing previously unrecognized helical structures.
Why are most molecular helices right-handed?
Most molecular helices are right-handed because biological molecules are chiral. Proteins are built from L-amino acids, whose geometry makes the right-handed α-helix energetically more stable and free of steric clashes than the left-handed version. Similarly, the stereochemistry of sugars in DNA favors a right-handed double helix (B-DNA).
Why do β-sheets tend to aggregate?
β-sheets tend to aggregate because their structure allows strong interactions between different protein molecules.
When a protein is partially unfolded, its backbone and hydrophobic regions become exposed. β-strands from different molecules can then align side by side and form many hydrogen bonds. Their flat and extended shape also allows close packing (antiparallel, parallel, mixed).
These interactions lower the overall free energy of the system, making the aggregated β-sheet structure (such as amyloid fibrils) more stable than the unfolded proteins (Maury 2015).
Protein aggregation does not mean the native protein becomes more stable. Instead, when a protein is partially unfolded or misfolded, it is in a relatively unstable, high-energy state. Exposed hydrophobic regions and backbone groups can then form intermolecular interactions (hydrophobic contacts and hydrogen bonds, often arranged as β-sheets).
These interactions lower the overall free energy of the system, making the aggregated state more thermodynamically stable than the unfolded monomers — even though it is not the functional native state.
So, aggregation is energetically favorable compared to the misfolded/unfolded state, but it represents a loss of normal protein function.
In short, β-sheets are fundamental structural elements of proteins that are crucial for both normal function and, in some cases, disease.
What is the driving force for β-sheet aggregation?
The main driving force for β-sheet aggregation is the formation of extensive intermolecular hydrogen bonds combined with hydrophobic interactions, which lower the free energy of the system and stabilize the cross-β fibril structure.
Why do many amyloid diseases form β-sheets?
Many amyloid diseases form β-sheets because misfolded proteins can easily reorganize into this structure.
An amyloid disease is a disorder in which certain proteins misfold and form abnormal aggregates called amyloid fibrils. The proteins misfold, and form structures rich in β-sheets, sticking together and building stable fibrils. These fibrils accumulate in tissues or organs and interfere with normal cell function. Over time, this accumulation can damage cells and lead to disease (Alzheimer’s disease - amyloid-β plaques in the brain; Parkinson’s disease - α-synuclein aggregates, Systemic amyloidosis - amyloid deposits in organs like the heart or kidneys (Monsellier & Chiti 2007, Cheng et al 2012, Bolshette et al 2014).
In simple terms, an amyloid disease happens when misfolded proteins aggregate and build up in the body, causing tissue damage.
Can you use amyloid β-sheets as materials?
Yes. Amyloid β-sheets can be used as materials because their structure forms highly ordered and stable fibrils. These fibrils are strong, resistant to heat and chemical degradation, and can self-assemble into nanofibers. Because of these properties, researchers are exploring their use in nanotechnology, tissue engineering scaffolds, biosensors, and drug delivery systems. Despite their association with disease in the body, amyloid β-sheets from vegetable protein have promising applications as engineered biomaterials. They have been applied in renewable and biodegradable bioplastics and in water purification membranes for heavy metal removal (Li et al 2023).
References
Altwegg, K., Balsiger, H., Bar-Nun, A., et al. (2016). Prebiotic chemicals—amino acid and phosphorus—in the coma of comet 67P/Churyumov-Gerasimenko. Science Advances, 2(5), e1600285.
Blachier, F. (2025). Amino Acids Before Life and in the First Living Organisms. In: The Evolutionary Journey of Amino Acids. Springer, Cham.
Bolshette, N. B., Thakur, K. K., Bidkar, A. P., et al. (2014). Protein folding and misfolding in the neurodegenerative disorders: a review. Revue Neurologique, 170(3), 151-161.
Cheng, P. N., Liu, C., Zhao, M., Eisenberg, D., & Nowick, J. S. (2012). Amyloid β-sheet mimics that antagonize protein aggregation and reduce amyloid toxicity. Nature Chemistry, 4(11), 927-933.
Fundación Española de la Nutrición (FEN). (2012). Guía Nutricional de la Carne.
Haldane, J. B. S. (1929). Origin of life. Ration. Annu., 148, 3–10.
Li, T., Zhou, J., Peydayesh, M., et al. (2023). Plant protein amyloid fibrils for multifunctional sustainable materials. Advanced Sustainable Systems, 7(4), 2200414.
Maury, C. P. J. (2015). Primordial genetics: Information transfer in a pre-RNA world based on self-replicating beta-sheet amyloid conformers. Journal of Theoretical Biology, 382, 292-297.
Miller, S. L. (1953). A production of amino acids under possible primitive earth conditions. Science, 117(3046), 528-529.
Monsellier, E., & Chiti, F. (2007). Prevention of amyloid-like aggregation as a driving force of protein evolution. EMBO Reports, 8(8), 737.
Nogal, N., Sanz-Sánchez, M., Vela-Gallego, S., et al. (2023). The protometabolic nature of prebiotic chemistry. Chemical Society Reviews, 52(21), 7359-7388.
Rovner, A. J., Haimovich, A. D., Katz, S. R., et al. (2015). Recoded organisms engineered to depend on synthetic amino acids. Nature, 518(7537), 89-93.
Weber, A. L., & Miller, S. L. (1981). Reasons for the occurrence of the twenty coded protein amino acids. Journal of Molecular Evolution, 17(5), 273-284.
Part B. Proteine Analysis and Visualization
In this part of the homework, you will be using online resources and 3D visualization software to answer questions about proteins. Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions:
1. Briefly describe the protein you selected and why you selected it.
PETase (Poly(ethylene terephthalate) hydrolase, sp|A0A0K8P6T7.1|PETH_PISS1) is a bacterial enzyme specialized in degrading PET (polyethylene terephthalate), a polymer widely used in plastic bottles and packaging. This protein belongs to the α/β-hydrolase fold family, which includes enzymes such as lipases, esterases, and cutinases, sharing a characteristic structural fold of alternating α-helices and β-sheets, as well as a conserved catalytic triad (Ser–His–Asp/Glu) essential for its hydrolytic activity. PETase functions by breaking the ester bonds in PET, facilitating plastic biodegradation and making it a biotechnologically relevant model for enzymatic recycling studies.
This protein was selected for the molecular analysis and visualization section because its structure is well-characterized and its function is clear and measurable. It allows the application of bioinformatics tools to explore features such as folds, active sites, hydrophobic regions, and interaction patterns, demonstrating in a practical way how an enzyme’s structure is directly linked to its biological function and industrial applications.
2. Identify the amino acid sequence of your protein.
This is the amino acid sequence of the protein PETase.
How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids.
The lenght of the proteine is 290 aminoacids, and the most common amino acid is S (Serine).
How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.
– How to use BLAST Analysis of PETase
BLAST (Basic Local Alignment Search Tool) is a program that compares a query sequence, either DNA or protein, against sequences in public databases. It is used to identify similar or homologous sequences, predict the family or likely function of a protein, and analyze evolutionary conservation or important functional motifs. For protein sequences, the BLASTp algorithm is typically used to compare amino acid sequences.
**Step 1: Input Sequence: Copy the amino acid sequence of the protein you wish to analyze (for example, PETase from UniProt). Paste it into the “Enter Query Sequence” field. The sequence can be provided in FASTA format or as a plain amino acid string.
**Step 2: Select Target Database: I used UniProtKB/Swiss-prot, the Best for reliable and curated results. Then check the rest of parameters:
Pres Run BLAST and wait until they notify your results are ready.
File .tsv was downloaded, and also I downloaded .png file.
How to interpret the results from Blast?
The BLAST analysis against UniProtKB returned 250 highly significant hits (E-value < 10⁻⁸²). Most sequences exhibit 47–52% identity, with some closely related homologs reaching 83–100% identity.
The identified proteins have an average length of 280–320 amino acids, which is consistent with enzymes belonging to the α/β hydrolase fold superfamily.
Functionally, the main homologs correspond to cutinases, lipases, and PET hydrolases, many of which are found in bacteria within the phylum Actinobacteria.
The presence of reviewed Swiss-Prot entries and homologs with available three-dimensional structures further supports the functional prediction.
Overall, the results indicate that the analyzed protein belongs to the α/β hydrolase superfamily and likely exhibits hydrolase activity toward ester bonds, with potential capability for polyester degradation such as PET.
3. Identify the structure page of your protein in RCSB
When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)
The crystal structure of poly(ethylene terephthalate) hydrolase (PETase) from Piscinibacter sakaiensis corresponds to PDB entry 5XH3. The structure was published in 2018 in Proceedings of the National Academy of Sciences and determined using X-ray diffraction. The reported resolution is 0.92 Å, indicating exceptionally high structural quality. Since lower resolution values reflect greater atomic precision, this structure provides near-atomic detail and is considered extremely reliable. The analyzed chain (Chain A) shows 100% sequence identity with the query sequence across residues 1–290, confirming that this structure corresponds exactly to the studied protein. In addition to the protein chain, the structure contains water molecules and other crystallographic components typical of high-resolution X-ray structures. According to SCOP structural classification, the protein belongs to the alpha and beta (α/β) protein class, consistent with the α/β-hydrolase fold family identified through sequence analysis and BLAST results.
In the amino acid sequence, the catalytic triad was identified at positions Ser160, Asp206, and His237. These conserved residues form the active site of the enzyme and are responsible for its hydrolytic activity. Ser160 acts as the nucleophile, His237 functions as a general base, and Asp206 stabilizes the histidine residue. Together, this catalytic machinery enables the hydrolysis of ester bonds, consistent with the characteristic mechanism of α/β-hydrolase fold enzymes such as PETase.
See 6EQE in the picture below:
As example, Position Asp 206 (S):
Are there any other molecules in the solved structure apart from protein? Yes, the crystal structure includes additional molecules such as water and crystallization agents. These are not part of the protein itself but are present due to the experimental conditions and may contribute to structural stabilization. The figure below (RCSB) shows other molecules such as Polymer (solid dark blue), Water (doots lighter blue around), Ion (dark blue dots) and clashes (yellow structures with pink rings).
Does your protein belong to any structure classification family? Yes, the protein belongs to the α/β-hydrolase fold structural family, characterized by a central β-sheet surrounded by α-helices and a conserved Ser–Asp–His catalytic triad (explanation above).
4. Open the structure of your protein in any 3D molecule visualization software
PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands)
Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.
Figure description (Cartoon representation):
The protein structure is shown in cartoon representation, which highlights the secondary structure elements. Red regions correspond to α-helices, yellow regions indicate β-sheets, and green regions represent loops or turns connecting the secondary structure elements. This visualization allows us to quickly see the overall fold of the protein and the organization of helices, sheets, and loops within the 3D structure.
The protein is shown in ribbon representation, highlighting its secondary structure. α-helices are colored red, β-sheets yellow, and loops/turns green. This view emphasizes the overall fold and connectivity of the protein backbone, even though short β-strands may appear as simple lines rather than arrows.
Ball-and-Stick Representation:
The protein is visualized using the ball-and-stick model, where atoms are represented as spheres and bonds as sticks. The spheres are scaled to clearly show individual atoms without cluttering the image. In this representation, the distribution of atoms along the protein backbone and side chains is visible, highlighting the chemical connectivity within the enzyme. The colors correspond to residue types: for example, carbon atoms in green, oxygen in red, nitrogen in blue, and sulfur in yellow, allowing differentiation of polar, nonpolar, and reactive groups. This view emphasizes the three-dimensional arrangement of amino acids, which is critical for understanding active sites, substrate binding, and potential catalytic interactions.
Color the protein by secondary structure. Does it have more helices or sheets? The protein is colored by secondary structure: red for α-helices, yellow for β-sheets, and green for loops/turns. Visually, the PETase structure contains more α-helices than β-sheets, indicating that helical regions dominate the fold, while sheets are present but less abundant. Loops and turns (green) connect these elements and form the flexible parts of the protein.
Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues? The protein is visualized in cartoon representation, with residues colored according to their chemical properties. Hydrophobic residues (Ala, Val, Leu, Ile, Met, Phe, Trp, Tyr) are shown in orange, hydrophilic residues (Asp, Glu, Lys, Arg, His, Ser, Thr, Asn, Gln) are in cyan, and other residues such as Pro and Gly are colored gray. The distribution shows that hydrophobic residues are mostly buried within the protein core, stabilizing the structure, while hydrophilic residues are more exposed on the surface, potentially interacting with the solvent or other molecules.
Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)? The protein is displayed as a semi-transparent surface on a black background, highlighting the overall shape and potential binding pockets. The catalytic triad is distinctly colored: serine 160 in magenta, aspartate 206 in cyan, and histidine 237 in yellow. The triad is clearly visible on the surface, within a main binding pocket. This coloring emphasizes the spatial arrangement of the catalytic residues relative to the protein fold, allowing easy identification of the active site and nearby substrate-binding regions.
As the protein PETase belongs to the PETase / PET-digesting enzyme family, which includes hydrolases capable of degrading poly(ethylene terephthalate) (EC 3.1.1.101). This family information will be used to restrict the BLAST search to proteins with the same functional classification, ensuring that the results are relevant homologs. In Restrict by Taxonomy field, I included “bacteria”. This configuration will focus the results on relevant, functional homologs in bacterial species.
Protein AI Modeling and Analysis
In this section, we will learn about the capabilities of modern protein AI models and test some of them in your chosen protein.
a. Copy the HTGAA_ProteinDesign2026.ipynb notebook and set up a Colab instance with GPU.
c.We will now try multiple things in the three sections below; report each of these results in your homework writeup on your HTGAA website:
C1. Protein Language Modeling
1. Deep Mutational Scans
a) Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.
Model Scores | Mutation Scan Heatmap | Position in Protein Sequence | Amino Acid Mutations: The heatmap represents the log-likelihood ratio (LLR) between a mutant residue and the original amino acid (wild type, WT). Positive or less negative values indicate that a mutation is relatively tolerated by the model, whereas strongly negative values suggest that the substitution is unlikely and potentially deleterious.
b)Can you explain any particular pattern?
To answer this section, the script was improved to get visual data to support the explanation:
To investigate the mutational tolerance of the protein sequence, an unsupervised deep mutational scan (DMS) was generated using the ESM2 protein language model. In this analysis, every position of the sequence is systematically mutated to each of the 20 standard amino acids, and the model estimates the likelihood of each substitution based on learned evolutionary patterns from large protein databases.
Overall, positions 11 to 30 show that more than half of the amino acids have negative probabilities, indicating these sites are less tolerant to substitutions and likely structurally or functionally constrained. In contrast, positions 262 to 289 display predominantly positive probabilities, suggesting these regions are more flexible and can accommodate a wider range of amino acid changes without compromising stability or function. This pattern highlights differential mutational tolerance across the protein sequence.
In the figure, red markers indicate the wild-type residues of the original protein sequence. These serve as a reference point for comparing the predicted likelihood of mutations at each position.
The catalytic triad residues (Ser161, Asp207, and His238) are highlighted in fuchsia. These positions show very low tolerance to mutation, consistent with their essential role in enzymatic catalysis.
Two additional positions illustrate different mutational behaviors predicted by the model:
I82 (mustard yellow): relatively tolerant region; several substitutions are accepted.
Overall, positions associated with catalytic activity or structural stability show strong intolerance to mutation, whereas more flexible regions exhibit greater mutational tolerance. These results illustrate how protein language models can infer functional signals directly from sequence data without experimental supervision.
c (Bonus): Compare predictions to experimental DMS data: Simonich, C., McMahon, T. E., & Bloom, J. (2026, January). Deep Mutational Scanning of the RSV Fusion Protein Reveals Mutational Constraint and Antibody Escape Mutations. Open Forum Infectious Diseases, 13(Supplement_1), ofaf695-1973.
The experimental scans in Simonich et al. (2026) cover the RSV F ectodomain, testing nearly all single amino-acid mutations. These sequences can be input into a protein language model, and its predictions of mutational effects can be compared directly to the DMS results to assess how well the model captures functional constraints and antibody escape.
2. Latent Space Analysis
a. Use the provided sequence dataset to embed proteins in reduced dimensionality
Protein embeddings were generated for each sequence using a pretrained protein language model. Each sequence was converted into a fixed-length vector by averaging the token-level representations from the final hidden layer. The embeddings were combined into a single matrix and reduced to three dimensions using t-distributed Stochastic Neighbor Embedding (t-SNE), which preserves local relationships so that proteins with similar embeddings appear close in the 3D space. In the final visualization, each protein is represented as a point and colored according to its SCOPe family classification, enabling the identification of structural or functional groups. The dataset contains 4,742 SCOPe families.
b. Analyze the different formed neighborhoods: do they approximate similar proteins?. The neighborhoods observed in the t-SNE visualization correspond to groups of proteins with similar embeddings. Because the embeddings capture sequence patterns and structural information learned by the language model, proteins with similar biological characteristics tend to cluster together. This is supported by the coloring based on SCOPe families, as many clusters contain proteins from the same or closely related families, reflecting shared structural folds or evolutionary relationships.
c. Place your protein (PETase) in the resulting map and explain its position and similarity to its neighbors.
The PETase sequence was embedded using the same procedure as the dataset proteins and projected into the same t-SNE space. In the visualization:
PETase highlighted in red.
SCOPe family c.69.1.0 in salmon (72 sequences).
Ten nearest neighbors identified by cosine similarity (orange).
PETase is located within the region occupied by members of this family, indicating that the embedding correctly captures its relationship with structurally related proteins.
The identifiers and SCOPe family annotations of the ten closest proteins are shown in the next Table:
Dimensionality reduction also performed with UMAP, preserving both local and global structure.
Both dimensionality reduction methods produced similar but not identical results. The sets of the ten nearest proteins identified by t-SNE and Uniform Manifold Approximation and Projection overlap partially, with three sequences differing between the two top-10 lists. This difference (in yellow) arises because t-distributed Stochastic Neighbor Embedding focuses mainly on preserving local neighbor relationships, whereas UMAP attempts to preserve both local and global structure. Consequently, the overall clustering patterns are comparable, but the exact nearest neighbors of PETase vary slightly, illustrating how different dimensionality reduction methods can influence similarity interpretations.
a.Fold your protein with ESMFold. Do the predicted coordinates match your original structure?
The structure was predicted with ESMFold (copies = 1, num_recycles = 3) and visualized with py3Dmol, colored by pLDDT (blue = high confidence, red = low confidence).
The resulting protein sequence has a total length of 290 residues. The prediction produced a pTM score of 0.913 and an average pLDDT confidence score of 96.27, indicating a high-confidence structural model.
Two output files were generated. The file ptm0.913_r3_default.pdb contains the predicted three-dimensional structure of the protein, including atomic coordinates, and will be used later for structural visualization and further structural analysis. The file ptm0.913_r3_default.pae.txt contains the Predicted Aligned Error (PAE) matrix, which describes the expected positional error between residue pairs and will be referenced in a later section for model confidence evaluation.
I used ESMFold in ESMatlas.com, which allows you to hover over amino acid positions in the protein and displays the confidence at that position. For example, the protein prediction shows areas in lighter blue. According to ESMAtlas (see the red circle in the next figure), this particular position (ASN 225) has a pLDDT of 74.30%. So a pLDDT of 74.3% indicates that ASN 225’s predicted position is moderately reliable, though there could be some flexibility or uncertainty in its exact placement.
b. Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?b) Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations? I performed a mutation in Catalytic triad to disrupt PETase catalytic function by replacing the three key residues with alanine (S160, D206, H237):
S160 (Serine) – nucleophile in the active site → mutated to A (Alanine)
D206 (Aspartate) – stabilizes histidine and participates in proton relay → mutated to A
H237 (Histidine) – acts as a general base in catalysis → mutated to A
(Positions are marked in red, orange arrow pointing to it; see 3 figures below, level of pLDDT included in the figure for each position before mutation.)
Sequence to be mutated and mutations positions in black:
The mutated sequence was inserted in ESMatlas.com (ESMFold): The local changes in color reflect a reduction in confidence/pLDDT near the active site, consistent with structural perturbation. It was faster to get the level of pLDDT in ESMatlas.com (ESMFold) where I insert the mutated sequence after.
Most of the protein remains blue, but certain areas got light blue meaning that the pLDDT falls nearly below 90% of confidence.
The mutations effectively target the catalytic function in the three positions without dramatically altering the overall fold.
C3. Protein Generation
Inverse-Folding a protein: Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN.
a. Inverse protein folding with ProteinMPNN using backbone from ptm0.913_r3_default.pdb. The input backbone for sequence design was the 3D structure predicted by ESMFold for PETase, stored in: /content/original_58458/ptm0.913_r3_default.pdb
b. Pretrained model weights v_48_020.pt were downloaded from Hugging Face.
c. Model parameters:
Version: v_48_020
Edges: 48
Hidden dimension: 128
Encoder layers: 3
Decoder layers: 3
Training noise: 0.2 Å
Backbone noise: 0.0 Å
d. Native sequence evaluated: Score = 1.4683 (negative log-likelihood).
e. Designed sequence generated (temperature = 0.1):
f. ProteinMPNN predicts the probability of each of the 21 amino acids at every position of the sequence. The predicted log-probabilities were converted into probabilities using the exponential function and averaged to obtain a position-specific probability matrix.
These probabilities were visualized using a heatmap, where:
• X-axis: sequence position
• Y-axis: amino acid type
• Color intensity: predicted probability
Two markers were added to the heatmap:
• Coral points: amino acids with the highest predicted probability at each position (ProteinMPNN prediction).
• Jade points: residues from the native sequence.
This visualization allows a direct comparison between the predicted optimal sequence and the original protein sequence, which account 45% of position recovery between them.
Analyze predicted sequence probabilities vs original: ProteinMPNN predicted sequence has lower negative log-likelihood (0.7572) compared to the native (1.4683), showing higher compatibility. Heatmap highlights conserved and flexible positions, sequence recovery ~0.455.
Input designed sequence into ESMFold and compare:
Prediction visualized in esmatlas.com.
Predicted structure shows local variations (color differences) but overall fold largely preserved.
The ESMFold-predicted structure shows regions in light blue and yellow absent in the original. These colors reflect lower pLDDT confidence, indicating higher uncertainty or flexibility in the predicted positions. The designed sequence preserves the overall fold, but local substitutions introduce slight variability in some regions, while dark blue areas remain well-defined and conserved.
Part D. Group Brainstorm on Bacteriophage Engineering
Bacteriophage Engineering Proposal: L Protein Stabilization
Primary Goal: Increased stability (easiest). Specific Approach: Engineering DnaJ-independence by reducing chaperone-recognition signals while preserving the structural scaffold of the L protein.
1. Computational Tools and Pipeline Justification
To achieve this goal, we propose a three-step computationally efficient pipeline:
Step 1: Sequence-level Mutational Scanning using ESM2
Approach: We will perform a zero-shot in silico mutational scan across the L protein sequence using the ESM2 Protein Language Model (PLM). We aim to identify exposed hydrophobic patches (typical DnaJ recognition motifs) and propose polar/hydrophilic substitutions.
Why this helps: ESM2 has learned deep evolutionary constraints across millions of protein sequences. It allows us to rapidly differentiate between highly constrained residues (which are structurally vital and “untouchable”) and mutation-tolerant positions. This ensures we only disrupt chaperone-binding motifs without breaking the core evolutionary scaffold of the protein, all at a fraction of the computational cost of molecular dynamics.
Step 2: Rapid Structural Filtering using ESMFold
Approach: The top candidate sequences from the ESM2 scan will be predicted using ESMFold. We will filter out any variants that collapse, show low pLDDT (confidence) scores, or have a high RMSD compared to the Wild-Type (WT) backbone.
Why this helps: While ESM2 evaluates sequence-level fitness, we need explicit 3D structural validation. ESMFold is significantly faster than AlphaFold2, making it ideal for high-throughput filtering. This step ensures that our hydrophilic mutations do not inadvertently destroy the L protein’s ability to fold independently.
Step 3: Complex Modeling using Boltz-1
Approach: We will model the L protein + DnaJ complex for both the WT and our top folded mutant candidates. We will analyze the predicted interface contacts and Predicted Aligned Error (PAE) to assess binding affinity.
Why this helps: Folding correctly in isolation is not enough; we must explicitly prove reduced chaperone dependency. By comparing the mutant-DnaJ interface against the WT-DnaJ interface, we can prioritize variants that maintain a stable fold but show a significantly weakened or abolished interaction with the DnaJ chaperone.
2. Potential Pitfalls
Pitfall 1: Overlapping Reading Frames and Genomic Constraints. Phage genomes are highly compact, meaning the DNA sequence encoding the L protein might also encode parts of other proteins or regulatory elements in alternative reading frames. Our targeted mutations could have unintended, fatal consequences for the phage’s overall viability. While genomic foundation models like Evo could assess these genome-wide constraints, their computational cost is prohibitive for our current scope.
Pitfall 2: The Stability vs. Function Trade-off. ESMFold guarantees that the protein adopts a stable 3D conformation in solution, but it does not guarantee biological function (membrane lysis). Lytic activity heavily depends on complex factors like membrane insertion dynamics, oligomerization, and reaction kinetics. Furthermore, completely abolishing chaperone interaction might inadvertently prevent the L protein from being properly delivered to its target membrane.
Week 5 HW: Protein Design Part II
Part 1: Generate Binders with PepMLM
1. Begin by retrieving the human SOD1 sequence from UniProt (P00441) and introducing the A4V mutation.
Superoxide Dismutase 1 (SOD1) is a human enzyme that plays a critical role in protecting cells from oxidative stress by catalyzing the conversion of superoxide radicals into oxygen and hydrogen peroxide. It is a small, 154-amino-acid protein that typically forms a stable homodimer and contains a β-βarrel core structure with metal cofactors, copper and zinc, essential for its catalytic activity. Mutations in SOD1, such as the A4V variant, are associated with familial amyotrophic lateral sclerosis (ALS), a neurodegenerative disorder. SOD1 is widely expressed in the cytoplasm and is a key model protein for studying protein folding, aggregation, and targeted protein degradation strategies.
2. Using the PepMLM Colab linked from the HuggingFace PepMLM-650M model card:
3. Generate four peptides of length 12 amino acids conditioned on the mutant SOD1 sequence.
As the original script generated four peptides containing some invalid characters, I improved the script to produce only valid peptides without those characters. You can access the script cell here:
Additionally, I included a scritp to calculate the pseudo-perplexity of a known peptide.
The .cvs file gives the peptides and the Pseudo Perplexity values:
Pseudo-perplexity measures the model’s confidence in a peptide sequence given the target protein context. Lower values indicate that the peptide is more consistent with patterns of protein–peptide interactions learned during training, while higher values suggest lower confidence. In the table, WRSGAAGAAWWK has the lowest pseudo-perplexity (7.12), indicating the model is most confident in this generated peptide, whereas the known binder FLYRWLPSRRGG has the highest value (20.64), reflecting comparatively lower model confidence. The other generated peptides fall in between, showing moderate confidence according to the model.
4. To your generated list, add the known SOD1-binding peptide FLYRWLPSRRGG for comparison.
I open the test.csv file in excel, added a line and insert the SOD1-binding peptide FLYRWLPSRRGG. Record the perplexity scores that indicate PepMLM’s confidence in the binders.
PepMLM assigns lower pseudo-perplexity scores to several generated peptides compared with the known SOD1-binding peptide FLYRWLPSRRGG. The peptide WRSGAAGAAWWK shows the lowest perplexity (7.12), indicating that the model predicts it as the most probable binder among the candidates.
1. Navigate to the AlphaFold Server: alphafoldserver.com
2. For each peptide, submit the mutant SOD1 sequence followed by the peptide sequence as separate chains to model the protein-peptide complex.
3. Record the ipTM score and briefly describe where the peptide appears to bind. Does it localize near the N-terminus where A4V sits? Does it engage the β-barrel region or approach the dimer interface? Does it appear surface-bound or partially buried?
The predicted binders 1–4 localize near residues 96–102 of the protein, a region that forms one of the parallel β-strands contributing to the β-sheet structure. In contrast, the known binder (binder 5) appears near residues 11–15. None of the binders are located close to the N-terminus or C-terminus of the protein. Additionally, the peptides do not appear tightly attached to the protein surface or inserted into a binding pocket, suggesting that they remain relatively exposed rather than strongly surface-bound or buried.
The predicted interaction confidence, measured by ipTM, is relatively low for all binders, with values ranging from 0.25 to 0.44. Binder 2 shows the highest interaction score (ipTM = 0.44), followed by binder 4 (0.39) and binder 1 (0.34), while binders 3 (0.28) and 5 (0.25) show lower confidence. These values indicate moderate to low confidence in the predicted peptide–protein interactions, suggesting that the binding poses may be weak or uncertain.
4. In a short paragraph, describe the ipTM values you observe and whether any PepMLM-generated peptide matches or exceeds the known binder.
The predicted complexes show low ipTM values (0.25–0.44), indicating weak confidence in protein–peptide interactions. The PepMLM-generated peptide WRSGAAGAAWWK has the highest ipTM score (0.44), exceeding that of the known binder FLYRWLPSRRGG (0.25), but none of the peptides appear to form a stable interaction with SOD1 in the predicted models.
Part 3: Evaluate Properties of Generated Peptides in the PeptiVerse
Structural confidence alone is insufficient for therapeutic development. Using PeptiVerse, let’s evaluate the therapeutic properties of your peptide! For each PepMLM-generated peptide:
1. Paste the peptide sequence.
2. Paste the A4V mutant SOD1 sequence in the target field.
Check the boxes:
Predicted binding affinity
Solubility
Hemolysis probability
Net charge (pH 7)
Molecular weight
Compare these predictions to what you observed structurally with AlphaFold3. In a short paragraph, describe what you see. Do peptides with higher ipTM also show stronger predicted affinity? Are any strong binders predicted to be hemolytic or poorly soluble? Which peptide best balances predicted binding and therapeutic properties?
Choose one peptide you would advance and justify your decision briefly.
The predicted binding affinities range from pKd/pKi values of ~5.4 to 6.9, indicating weak to moderate binding overall. The peptide WRSGAAGAAWWK shows both the highest predicted affinity (pKd 6.94) and the highest ipTM value (0.44), suggesting some agreement between the structural prediction from AlphaFold3 and the affinity prediction from Peptiverse. However, most peptides did not appear to form stable interactions with SOD1 in the structural models, remaining mostly surface-proximal rather than clearly bound. All peptides show good predicted solubility (score = 1) and very low hemolysis probabilities (<0.05), indicating favorable therapeutic properties. Among the candidates, WRSGAAGAAWWK (2) best balances predicted binding strength, structural confidence, and safety properties.
See all values from alphaFold web y pepTiVerse per peptide:
Binder
Pseudo Perplexity
ipTM
pTM
Solubility
Hemolysis
Binding Affinity (pKd/pKi)
Length (aa)
Molecular Weight (Da)
Net Charge (pH 7)
Isoelectric Point (pH)
Hydrophobicity (GRAVY)
WRYGAAAVEHKK
12.17656988
0.34
0.75
1
0.021
5.436
12
1415.6
1.85
9.7
-1
WRSGAAGAAWWK
7.11655395
0.44
0.84
1
0.032
6.939
12
1346.5
1.76
11
-0.46
WRYYAAGLAWKK
14.14682634
0.28
0.73
1
0.024
6.735
12
1512.8
2.76
10
-0.66
WLYYAAGARHKE
18.40388821
0.39
0.87
1
0.029
5.887
12
1464.6
0.85
8.5
-0.82
FLYRWLPSRRGG
20.63523127
0.25
0.83
1
0.047
5.968
12
1507.7
2.76
11.71
-0.71
Notes:
ipTM score estimates the confidence of protein–peptide interactions. Values above 0.7 indicate a reliable interaction, values between 0.5 and 0.7 suggest a possible interaction, and values below 0.5 indicate that an interaction is unlikely.
Solubility is predicted as a binary value where 0 indicates not soluble and 1 indicates soluble.
Hemolysis probability predicts whether a peptide may damage red blood cells; 0 indicates non-hemolytic and 1 indicates hemolytic.
Binding affinity (pKd/pKi) reflects binding strength (−log10 of Kd or Ki), where higher values indicate stronger binding. Values <5 indicate very weak binding (>100 µM), 5–6 weak (~100 µM), 6–7 moderate (~10 µM), 7–8 good (~1 µM), and >8 strong binding (<100 nM).
Part 4: Generate Optimized Peptides with moPPIt
Now, move from sampling to controlled design. moPPIt uses Multi-Objective Guided Discrete Flow Matching (MOG-DFM) to steer peptide generation toward specific residues and optimize binding and therapeutic properties simultaneously. Unlike PepMLM, which samples plausible binders conditioned on just the target sequence, moPPIt lets you choose where you want to bind and optimize multiple objectives at once.
1. Open the moPPit Colab linked from the HuggingFace moPPIt model card
2. Make a copy and switch to a GPU runtime.
3. In the notebook:
Paste your A4V mutant SOD1 sequence.
Choose specific residue indices on SOD1 that you want your peptide to bind (for example, residues near position 4, the dimer interface, or another surface patch).
Set peptide length to 12 amino acids.
Enable motif and affinity guidance (and solubility/hemolysis guidance if available). Generate peptides.
4. After generation, briefly describe how these moPPit peptides differ from your PepMLM peptides. How would you evaluate these peptides before advancing them to clinical studies?
moPPIt Generated Peptides
Binder
Hemolysis
Solubility
Affinity
Motif
GGGRQKCFTLNM
0.9349
0.75
6.5764
0.5096
SKKQKTITCELC
0.9677
0.8333
7.2126
0.6561
KTCKKSFEKKQN
0.9739
0.9167
6.2008
0.6707
GTTCIQGKKKDE
0.9785
0.9167
6.3731
0.5156
The peptides generated with moPPIt differ from the PepMLM peptides because their generation was guided by multiple objectives defined in the notebook, including affinity, motif targeting, solubility, hemolysis, and specificity. In contrast, PepMLM generates peptides based mainly on sequence likelihood conditioned on the target protein sequence. As a result, the moPPIt peptides show moderate predicted affinities (pKd ≈ 6.2–7.2), relatively high motif scores (≈0.51–0.67), and good solubility predictions (0.75–0.92). The PepMLM peptides also show moderate affinities but were not optimized simultaneously for these therapeutic objectives during generation.
However, when one of the generated peptides was modeled with the SOD1 structure using AlphaFold, the peptide did not appear to form a stable interaction with the protein surface. This suggests that although the predicted affinity and motif scores are moderate, the structural predictions do not clearly support strong binding.
Comparison of PepMLM and moPPIt Generated Peptides
Method
Binder
Hemolysis
Solubility
Affinity
PepMLM
WRYGAAAVEHKK
0.021
Yes
5.436
PepMLM
WRSGAAGAAWWK
0.032
Yes
6.939
PepMLM
WRYYAAGLAWKK
0.024
Yes
6.735
PepMLM
WLYYAAGARHKE
0.029
Yes
5.887
PepMLM
FLYRWLPSRRGG
0.047
Yes
5.968
moPPIt
GGGRQKCFTLNM
0.9349
0.75
6.5764
moPPIt
SKKQKTITCELC
0.9677
0.8333
7.2126
moPPIt
KTCKKSFEKKQN
0.9739
0.9167
6.2008
moPPIt
GTTCIQGKKKDE
0.9785
0.9167
6.3731
Before advancing any peptide toward clinical studies, further validation would be necessary. Additional structural modeling and docking analyses should confirm stable binding to the intended region of SOD1. Promising candidates should then be synthesized and tested experimentally through in vitro binding assays, hemolysis and cytotoxicity tests, and stability analyses to verify their predicted properties. Only peptides showing clear binding and favorable safety profiles would be considered for further development.
Week 6 HW: Genetic Circuits: Part I
Answer these questions about the protocol in this week’s lab:
1. What are some components in the Phusion High-Fidelity PCR Master Mix and what is their purpose?
Components of the Phusion High-Fidelity PCR Master Mix and their purpose:
Phusion DNA polymerase – a high-fidelity DNA polymerase that synthesizes new DNA strands with a low error rate during PCR.
Primers (forward and reverse) – short DNA sequences that bind to the target DNA and define the region that will be amplified.
dNTPs (dATP, dTTP, dCTP, dGTP) – the nucleotide building blocks used by the polymerase to synthesize new DNA strands.
Reaction buffer – maintains optimal pH and ionic conditions for proper enzyme activity.
Mg²⁺ ions – an essential cofactor required for DNA polymerase catalytic activity.
Nuclease-free water – maintains the correct reaction volume and prevents degradation of DNA.
2. What are some factors that determine primer annealing temperature during PCR?
Primer melting temperature (Tm) – The annealing temperature is usually set about 3–5 °C below the primer Tm to allow specific binding to the DNA template.
Primer length – Longer primers generally have higher Tm values, which increases the annealing temperature.
GC content – Primers with higher GC content bind more strongly (three hydrogen bonds), increasing Tm and the annealing temperature.
Primer–template complementarity – Mismatches between the primer and the template reduce binding efficiency and may require a lower annealing temperature.
Reaction conditions – Salt concentration and Mg²⁺ levels in the PCR mix influence DNA stability and can affect the optimal annealing temperature.
3. There are two methods from this class that create linear fragments of DNA: PCR, and restriction enzyme digests. Compare and contrast these two methods, both in terms of protocol as well as when one may be preferable to use over the other.
Comparison of PCR and Restriction Enzyme Digestion
Criterion
PCR (Polymerase Chain Reaction)
Restriction Enzyme Digestion
Underlying mechanism
DNA amplification using primers and a thermostable DNA polymerase through repeated thermal cycles.
Enzymatic cleavage of DNA at specific recognition sequences by restriction endonucleases.
Key reagents
DNA template, forward and reverse primers, DNA polymerase (e.g., Taq or Phusion), dNTPs, buffer.
Restriction enzyme(s), compatible reaction buffer, and DNA substrate.
Sequence constraints
Only requires primer binding regions; primers can be designed to introduce mutations or overlaps.
Requires pre-existing restriction sites within the DNA sequence.
Experimental flexibility
Highly adaptable; enables mutagenesis, sequence insertion, and creation of overlaps for cloning strategies such as Gibson Assembly.
Limited flexibility; modification depends on the presence and position of restriction sites.
Resulting DNA product
Defined amplified DNA fragment.
DNA fragments generated by site-specific cleavage.
Typical applications
Gene amplification, site-directed mutagenesis, preparation of fragments for advanced cloning methods.
Traditional cloning workflows where compatible restriction sites are available.
When it is preferred
When sequence modification or precise amplification is required.
When simple and reliable DNA cutting is sufficient for cloning.
4. How can you ensure that the DNA sequences that you have digested and PCR-ed will be appropriate for Gibson cloning?
Design 20–40 bp complementary overhangs between adjacent DNA fragments.
Include the correct 5′ overhangs in PCR primers so fragments share matching overlaps.
Ensure fragments are in the correct 5′→3′ orientation for proper assembly.
Avoid secondary structures (e.g., hairpins or primer dimers) in the overlap regions.
5. How does the plasmid DNA enter the E. coli cells during transformation?
Heat shock: A sudden temperature change creates temporary pores in the membrane, allowing plasmid DNA to enter.
Electroporation: An electrical pulse creates pores in the membrane through which plasmid DNA enters the cell.
6. Describe another assembly method in detail (such as Golden Gate Assembly). Explain the other method in 5 - 7 sentences plus diagrams (either handmade or online).
Golden Gate Assembly is a molecular cloning method that allows the assembly of multiple DNA fragments in a single reaction. It uses Type IIS restriction enzymes (such as BsaI) together with DNA ligase. These enzymes cut DNA outside of their recognition site, creating specific overhangs that can be designed to match between fragments. The complementary overhangs allow the DNA fragments to anneal in a predefined order. DNA ligase then joins the fragments together to form a continuous DNA molecule. Because the restriction sites are removed during the process, the final assembled DNA can no longer be cut again, making the reaction efficient and directional. This method allows the simultaneous and seamless assembly of multiple fragments in one tube.
DNA Fragment A [ATGCGTACGTTAGCTAGCTAGCGATCGATCGTAGCTAGCTAGCTA] DNA Fragment B [AGCTATTTGCGGATCGATCA] DNA Fragment C [ATCAGGCTAGCGTATCGTAA]
Golden Gate Assembly – Conceptual Diagram
These overhangs ensure correct directional assembly of fragments A, B, and C. The sequences have been prepared and modified to include the complementary overhangs for accurate Golden Gate assembly simulation in Benchling.