Homework

Weekly homework submissions:

  • Week 1 HW: Principles and Practices

    (The Challenge) Mastitis is an inflammation of the mammary gland in dairy cows, often caused by bacterial pathogens. It is a common and costly issue in dairy farms, leading to significant economic losses and affecting overall milk quality(Damasceno et al., 2025). The condition can arise from various factors, including poor hygiene, stress, and injuries to the udder. Common bacterial pathogens responsible for mastitis include Staphylococcus aureus, Escherichia coli, and Streptococcus uberis, which can enter the udder through damaged skin or during milking. Mastitis presents in either clinical or subclinical form. Coliforms from Escherichia coli, Klebsiella spp., and Enterobacter spp. account for 40% of clinical mastitis cases.

  • Week 2 HW: Read, write and Edit

    Part 1: Benchling & In-silico Gel Art I began by signing into my Benchling account and creating a new folder called HTGAA-read, write, and edit I then added a new sequence that is the Lambda_NEB fasta sequence On the right-hand side, I used the icon that looks like scissors to enter the different restriction enzymes and then viewed the different digestion sites on the virtual digest tab. Below is the image from the virtual digest for the different enzymes.

  • Week 3 HW: Lab Automation

    Python Script for Opentrons Artwork I first designed the sunflower using the Opentrons-Art Website. For the design, I went with a sunflower design. https://opentrons-art.rcdonovan.com/?id=e3z1i8r73863y1k I then downloaded the Excel file, and on HTGAA26 Opentrons Colab, I leveraged Gemini assistance to be able to write the Python script. I uploaded the Excel sheet and promted gemini to assist in developing a script that would give me the sunflower design. Link to the Colab:https://colab.research.google.com/drive/1arsozAVNQhs-4Ol0LMIRKZ4QGVld0Kgf?usp=sharing

  • Week 4 HW: protein-design-part-i

    Part A. Conceptual Questions How many molecules of amino acids in 500g of meat? A quick search shows that, for example, beef contains ~20–22g of protein per 100g of meat. Also, it would be good to mention that Raw meat contains more water, so the total weight of amino acids is lower per 100g compared to cooked meat, where water loss concentrates the nutrients (often increasing protein to 28–36g per 100g).

Subsections of Homework

Week 1 HW: Principles and Practices

cover image cover image

(The Challenge)

Mastitis is an inflammation of the mammary gland in dairy cows, often caused by bacterial pathogens. It is a common and costly issue in dairy farms, leading to significant economic losses and affecting overall milk quality(Damasceno et al., 2025). The condition can arise from various factors, including poor hygiene, stress, and injuries to the udder. Common bacterial pathogens responsible for mastitis include Staphylococcus aureus, Escherichia coli, and Streptococcus uberis, which can enter the udder through damaged skin or during milking. Mastitis presents in either clinical or subclinical form. Coliforms from Escherichia coli, Klebsiella spp., and Enterobacter spp. account for 40% of clinical mastitis cases.

While antibiotics remain the primary treatment for clinical mastitis, improper use contributes to antimicrobial resistance (AMR), making infections harder to control. The overuse and misuse of antibiotics in farms contribute to the emergence of drug-resistant mastitis-causing pathogens, further complicating disease management. Mastitis is difficult to eradicate, but its prevalence can be significantly reduced through proper farm management and preventive measures to ensure profitable production in dairy farms. One of such methods is the use of teat dipping disinfectants before and after milking. Current post-milking disinfectants are designed to reduce bacterial exposure and control infections. However, their effectiveness can vary, and there are limitations to their use.

Commercial teat dips typically contain active ingredients such as iodine, chlorhexidine, and lactic acid, each chosen for their bactericidal properties. Iodine-based products have long been favored for their broad-spectrum efficacy; however, concerns about iodine residues in milk are prompting farmers to consider alternatives. Chlorhexidine is another common disinfectant known for its effectiveness against various pathogens, but its performance can vary significantly depending on the formulation. Lactic acid is gaining popularity as well, particularly when combined with hydrogen peroxide, which has shown promising results in reducing bacterial loads. Nevertheless, these traditional disinfectants are not without limitations. For example, their effectiveness might diminish in the presence of organic matter, which can inhibit their action on bacteria (Fitzpatrick et al., 2021).

Reliance on these products may lead to issues such as the development of resistant bacterial strains, raising questions about their long-term efficacy(Damasceno et al., 2025). Given these challenges, the dairy industry is increasingly looking for innovative solutions to enhance mastitis control. This exploration includes synthetic antimicrobial peptides, which could provide a more effective and safer alternative to conventional disinfectants, addressing both efficacy and the concerns associated with traditional products (Ózsvári & Ivanyos, 2022)

(The Intervention)

Synthetic antimicrobial peptides (AMPs) is a promising substitute for traditional post-teat dips in dairy farming. The peptides can be designed to control a broad range of bacteria, making them effective against the pathogens responsible for mastitis. The mechanism of action of AMPs can involve disrupting the bacterial cell membranes, which leads to cell death. This method of attack is different from that of many conventional disinfectants that often rely on chemical reactions.

(The Promise)

A safer, more effective preventative solution to control mastitis.


‘Week 1 HW: Governance Policy Goals’

alt text alt textalt text alt text

‘Week 1 HW: ‘Current Status & Potential Governance Actions’

In Kenya, currently, there are no defined regulations for products developed through synthetic biology. Medicines and veterinary products are regulated mainly through the Pharmacy and Poisons Board and the Directorate of Veterinary Services. For products that may have foreign DNA they would be considered as GMOs and therefore would be regulated by the National Biosafety Authority.

There is a need to explore the development of a regulatory pathway within existing Kenyan institutions for synthetic biology-based products. This will ensure that the products meet all safety requirements and that there is public confidence in their safety. However, there is concern that the introduction of these regulations may over-extend approval timelines and regulatory processes, potentially delaying research progress and the deployment of synthetic biology–derived products. The following governance actions are meant to help balance building public trust and ensuring that the products from synthetic biology reach the end users.

Development of a regulatory pathway for the regulation of Syn-Bio products

Picking lessons from the development of gene editing regulatory products, the same aspect can be used to decide on how syn-bio products are regulated. It can be 2-phased, where level 1 is those that do not incorporate foreign DNA, and level 2 is those that involve incorporated foreign DNA. In terms of regulation, a clear pathway can be developed for the two levels, whereby the different government agencies can work together to develop factors such as setting the maximum residue limit in Milk & Milk products, as in the case of antimicrobial peptides.

For this to be achieved, there is a need for buy-in from the National Government, researchers, and regulatory bodies such as NBA-Biotech regulation, NEMA- environment regulation, and directorate of Veterinary Service, and the Public Health Institutes.

One of the main assumptions is that the various organizations have the technical capacity and know-how to regulate synthetic biology products. There is also an assumption that there is an existing risk assessment structure that can easily be adapted for antimicrobial peptides and even those for ranking syn-bio products as high, moderate, and low risk.

Develop and Implement a stewardship plan

Drawing from best practices from stewardship plans from biotech products, a detailed stewardship plan can be developed by regulators and researchers to ensure that farmers use the products as stipulated and that there is a way of ensuring that they are not abused. A quick example would be guidelines on how to ensure traceability and compliance, and use AMP-based dips as part of integrated mastitis management plans and not as the magic bullet that would solve all the problems.

The assumptions made are that the products can be easily traceable and that the farmers would be willing to keep proper records without tokenism involved. The expectation is that the farmers will follow the farm management and record-keeping plan provided, and that they will be honest and not falsify the records.

alt text alt text

‘Week 1 HW: ‘Week 2 Lecture Prep’

Homework Questions from Professor Jacobson

Nature’s machinery for copying DNA is called polymerase. What is the error rate of polymerase? How does this compare to the length of the human genome. How does biology deal with that discrepancy? Error Rate: 1:106 Biology deals with this through proofreading and replacing the mismatched base pairs.

How many different ways are there to code (DNA nucleotide code) for an average human protein? In practice what are some of the reasons that all of these different codes don’t work to code for the protein of interest?

Homework Questions from Dr. LeProust

What’s the most commonly used method for oligo synthesis currently? Phosphoramidite synthesis

Why is it difficult to make oligos longer than 200nt via direct synthesis? There would be compounding inefficiencies, since the addition process is not perfect

Why can’t you make a 2000bp gene via direct oligo synthesis? There would be higher error rates

Homework Question from George Church

What are the 10 essential amino acids in all animals and how does this affect your view of the “Lysine Contingency”?

Arginine, Histidine, Isoleucine, Leucine, Lysine, Methionine, Phenylalanine, Threonine, Tryptophan, and Valine.

Lysine is an essential amino acid, meaning that we have to eat certain foods to obtain it. Based on the story from Jurassic Park, the scientist inserted a gene that created a single faulty enzyme in protein metabolism. The animals could therefore not manufacture the amino acid lysine. This technically does not matter because with or without the engineered enzyme, the dinosaurs would still be dependent on the food they consume to get the nutrients. In the event they escaped, they would still eat other lysine-rich foods in nature and would get lysine and therefore survive without being in captivity.

Week 2 HW: Read, write and Edit

Part 1: Benchling & In-silico Gel Art

I began by signing into my Benchling account and creating a new folder called HTGAA-read, write, and edit I then added a new sequence that is the Lambda_NEB fasta sequence On the right-hand side, I used the icon that looks like scissors to enter the different restriction enzymes and then viewed the different digestion sites on the virtual digest tab. Below is the image from the virtual digest for the different enzymes.

alt text alt text

Part 3: DNA Design Challenge

3.1. Choose your protein

Through a literature search using pubmed I searched for papers that have explored naturally occurring peptides that have antimicrobial activity towards both Gram-positive and Gram-negative bacteria. Staphylococcus aureus (Gram-positive), Escherichia coli (Gram-negative), and Streptococcus uberis (Gram-positive) are the most common mastitis-causing bacteria in dairy farming.

Pituitary adenylate cyclase-activating polypeptide (PACAP) is a naturally occurring cationic peptide known for its strong immunosuppressive and cell-protective effects. It belongs to the secretin, growth hormone-releasing hormone (GHRH), and vasoactive intestinal peptide (VIP) family, and exhibits significant anti-inflammatory and cytoprotective capabilities. While PACAP is most concentrated in the brain, it is also present in notable amounts in other tissues, such as the thymus, spleen, lymph nodes, and duodenal mucosa (Vaudry et al., 2009). Despite its therapeutic potential, the use of PACAP as a pharmaceutical agent is hindered by its extremely short half-life in the bloodstream after systemic delivery, largely due to rapid degradation—particularly by the enzyme DPP IV, which targets the peptide’s amino terminus through exopeptidase activity (Green et al., 2006). PACAP occurs in two bioactive forms, both amidated, consisting of either 38 or 27 amino acids.

Producing PACAP synthetically is crucial for scalability, uniformity, and ethical reasons(Starr et al., 2018). Since the peptide is primarily found in sensitive tissues like the brain and immune organs, extracting it from natural sources is neither feasible nor appropriate for widespread agricultural applications. Instead, recombinant methods using engineered microorganisms enable large-scale, cost-efficient production through fermentation, ensuring consistent quality and purity.

The UniProt Sequence

sp|Q29W19|PACA_BOVIN Pituitary adenylate cyclase-activating polypeptide OS=Bos taurus OX=9913 GN=ADCYAP1 PE=2 SV=1 MTMCSGARLALLVYGILMHSSVYGSPAASGLRFPGIRPENEVYDEDGNPQQDFYDSESLG VGSPASALRDAYALYYPAEERDVAHGILNKAYRKVLDQPSARRSPADAHGQGLGWDPGGS ADDDSEPLSKRHSDGIFTDSYSRYRKQMAVKKYLAAVLGKRYKQRVKNKGRRIPYL

REFERENCES

Green, B. D., Irwin, N., & Flatt, P. R. (2006). Pituitary adenylate cyclase-activating peptide (PACAP): Assessment of dipeptidyl peptidase IV degradation, insulin-releasing activity and antidiabetic potential. Peptides, 27(6), 1349–1358. https://doi.org/10.1016/j.peptides.2005.11.010

Starr, C. G., Maderdrut, J. L., He, J., Coy, D. H., & Wimley, W. C. (2018). Pituitary adenylate cyclase-activating polypeptide is a potent broad-spectrum antimicrobial peptide: Structure-activity relationships. Peptides, 104, 35–40. https://doi.org/10.1016/j.peptides.2018.04.006

Vaudry, D., Falluel-Morel, A., Bourgault, S., Basille, M., Burel, D., Wurtz, O., Fournier, A., Chow, B. K. C., Hashimoto, H., Galas, L., & Vaudry, H. (2009). Pituitary Adenylate Cyclase-Activating Polypeptide and Its Receptors: 20 Years after the Discovery. Pharmacological Reviews, 61(3), 283–357. https://doi.org/10.1124/pr.109.001370

3.2. Reverse Translate: Protein (amino acid) sequence to DNA (nucleotide) sequence

I used the “Reverse Translate” tool of the Sequence Manipulation Suite to get the nucleotide sequence. And confirmed that the sequence was accurate using the NCBI BlastX to confirm that it is accurate.

atgaccatgtgcagcggcgcgcgcctggcgctgctggtgtatggcattctgatgcatagc agcgtgtatggcagcccggcggcgagcggcctgcgctttccgggcattcgcccggaaaac gaagtgtatgatgaagatggcaacccgcagcaggatttttatgatagcgaaagcctgggc gtgggcagcccggcgagcgcgctgcgcgatgcgtatgcgctgtattatccggcggaagaa cgcgatgtggcgcatggcattctgaacaaagcgtatcgcaaagtgctggatcagccgagc gcgcgccgcagcccggcggatgcgcatggccagggcctgggctgggatccgggcggcagc gcggatgatgatagcgaaccgctgagcaaacgccatagcgatggcatttttaccgatagc tatagccgctatcgcaaacagatggcggtgaaaaaatatctggcggcggtgctgggcaaa cgctataaacagcgcgtgaaaaacaaaggccgccgcattccgtatctg

alt text alt text Screenshot of the BlastX results confirming accurate reverse translation

3.3. Codon optimization

Codon optimization is essential because, while several DNA codons can code for the same amino acid, different species show distinct preferences in codon usage. Expressing the bovine PACAP sequence directly in a microbial host without adjustment could lead to poor translation efficiency and, consequently, low protein output. To support economical, large-scale production, the gene sequence was adapted for use in Escherichia coli, a widely adopted system for recombinant protein expression. By aligning codon usage with the host’s natural preferences, translation becomes more efficient, protein yields improve, and overall production costs decrease, enabling viable industrial synthesis of the antimicrobial peptide.

alt text alt text Codon optimization using VectorBuilder

alt text alt text Improved sequences from VectorBuilder

The following is the improved DNA sequence post-codon optimization ATGACCATGTGCAGCGGCGCCCGTCTGGCCCTGCTGGTGTATGGCATTCTGATGCACAGCAGCGTGTATGGCTCGCCGGCGGCGAGCGGCCTGCGATTCCCGGGTATTCGTCCGGAAAATGAAGTGTACGATGAAGATGGCAACCCGCAGCAGGATTTTTATGATAGCGAAAGCCTGGGCGTTGGCAGCCCGGCGAGCGCGCTGCGCGATGCGTATGCGCTGTATTATCCGGCAGAAGAACGTGATGTGGCGCACGGCATTCTGAATAAAGCGTATCGCAAAGTGCTGGATCAGCCGAGCGCGCGCCGCAGCCCGGCAGATGCGCATGGCCAGGGTCTGGGCTGGGATCCGGGTGGCAGCGCCGATGATGATAGCGAACCGCTGAGCAAACGCCATAGCGATGGCATTTTTACCGATAGCTATAGCCGCTATC

3.4. You have a sequence! Now what?

Once the codon-optimized DNA sequence for PACAP is available, it needs to be integrated into a biological system capable of transcribing the DNA into mRNA and subsequently translating that mRNA into the PACAP protein. Given the objective of producing PACAP as an affordable antimicrobial peptide for managing mastitis, a cell-based production system using E. coli fermentation is the most suitable choice. This approach offers high protein output, low manufacturing costs, scalability for industrial use, and practical potential for commercial application. The PACAP gene would first be incorporated into an expression vector, typically a plasmid, which is then delivered into Escherichia coli cells. Within the bacterial host, RNA polymerase recognizes the promoter region and initiates transcription of the PACAP gene into mRNA. Ribosomes then attach to the mRNA and begin translation, synthesizing the PACAP peptide with the help of tRNAs that correspond to the optimized codons. As the bacteria grow and divide, they generate substantial quantities of the peptide, which can later be harvested and purified.

Part 4: Prepare a Twist DNA Synthesis Order

4.2. Build Your DNA Insert Sequence

Following the instructions from the homework, I manually inserted the following sequences sequentially into the bases section

Promoter (e.g. BBa_J23106): TTTACGGCTAGCTCAGTCCTAGGTATAGTGCTAGC RBS (e.g. BBa_B0034 with spacers for optimal expression): CATTAAAGAGGAGAAAGGTACC Coding Sequence (The codon optimized sequence already had a start codon so I did not put two start codons) 7x His Tag (Let’s add a 7×His tag at the C-terminus of the protein to enable protein purification from E. coli): CATCACCATCACCATCATCAC Stop Codon: TAA Terminator (e.g. BBa_B0015): CCAGGCATCAAATAAAACGAAAGGCTCAGTCGAAAGACTGGGCCTTTCGTTTTATCTGTTGTTTGTCGGTGAACGCTCTCTACTAGAGTCACACTGGCTCACCTTCGGGTGGGCCTTTCTGCGTTTATA

Here is the link to the annotated sequence: https://benchling.com/s/seq-7dMM0y6i3bzKtgtg4ARN?m=slm-ovXPs57Txzw1F9ZhxeXZ

Below are the screenshots from the various steps during the annotation process.

alt text alt text Annotation window

alt text alt text Linear map of the annotated sequence.

Twist Workflow

alt text alt text

Construct- Twist alt text alt text

Plasmid with expression cassette - Benchling

alt text alt text

Part 5: DNA Read/Write/Edit

5.1 DNA Read

(i) What DNA would you want to sequence (e.g., read) and why?

In the context of mastitis control and epidemiological surveillance, I would sequence metagenomic DNA extracted from milk, teat swabs, and environmental samples collected from dairy farms with and without biosecurity measures. Sequencing this DNA would allow identification of the dominant mastitis-causing bacteria circulating within these farms and enable detection of antibiotic resistance genes present in the microbial populations. This information would provide real-world insight into pathogen prevalence and resistance patterns, which would directly inform the rational design and optimisation of my PACAP-based antimicrobial peptide to ensure it is effective against the most relevant and resistant strains. In addition, I would sequence my final PACAP expression construct to confirm sequence accuracy and ensure that no mutations were introduced during cloning or synthesis.

(ii) In lecture, a variety of sequencing technologies were mentioned. What technology or technologies would you use to perform sequencing on your DNA and why?

For metagenomic analysis of dairy farm samples, I would use shotgun next-generation sequencing (NGS), specifically a short-read platform such as Illumina sequencing. This method is classified as a second-generation sequencing technology.

The input for this method would be total DNA extracted from milk, teat swabs, or environmental samples collected from dairy farms. After extraction, the DNA must be prepared into a sequencing library. (Usually, there is a step by step protocal shared on how to prepare the libraries, including pulling and how to send them for sequencing)

The output of this sequencing technology is a large dataset of short DNA reads in FASTQ format, which includes both the nucleotide sequence and associated quality scores for each base. These reads can then be assembled or mapped against reference databases to identify bacterial species present in the samples and detect antibiotic resistance genes.

5.2 DNA Write

(i) What DNA would you want to synthesize (e.g., write) and why?

I would synthesise a codon-optimised version of the PACAP-27 gene designed for efficient expression in Escherichia coli. The sequence would be optimised to match the preferred codon usage of the host organism in order to maximise protein yield. Additionally, based on insights gained from metagenomic surveillance data, I may incorporate rational modifications to improve antimicrobial activity, stability, or resistance to proteolytic degradation. Synthesising the gene rather than amplifying it from a natural source allows full control over sequence design and ensures the construct is tailored specifically to the pathogens identified in dairy farm environments.

(ii) What technology or technologies would you use to perform this DNA synthesis and why?

I would use commercial chemical DNA synthesis technology, which relies on phosphoramidite chemistry to generate short oligonucleotides that are enzymatically assembled into full-length genes. This method allows precise sequence customisation, incorporation of optimised codons, and removal of unwanted restriction sites. It is reliable, accurate, and particularly suitable for small genes such as PACAP, making it an efficient approach for generating a ready-to-clone expression construct.

5.3 DNA Edit

(i) What DNA would you want to edit and why?

I would edit the PACAP gene sequence to enhance its antimicrobial properties while maintaining safety and specificity. Edits may include amino acid substitutions that increase cationic charge, improve membrane disruption, or enhance activity against Gram-negative bacteria and biofilm-forming strains identified in the metagenomic analysis. These modifications would be guided by epidemiological data to ensure the peptide is effective against the most prevalent mastitis pathogens. I may also edit regulatory elements within the expression vector to optimise protein yield or secretion efficiency.

(ii) What technology or technologies would you use to perform these DNA edits and why?

To introduce precise amino acid substitutions into the PACAP gene, I would use site-directed mutagenesis, as it enables targeted and controlled sequence modifications without altering the rest of the construct. Site-directed mutagenesis would be sufficient to generate improved peptide variants informed by metagenomic surveillance data.

Week 3 HW: Lab Automation

Python Script for Opentrons Artwork

I first designed the sunflower using the Opentrons-Art Website. For the design, I went with a sunflower design. https://opentrons-art.rcdonovan.com/?id=e3z1i8r73863y1k

alt text alt text

I then downloaded the Excel file, and on HTGAA26 Opentrons Colab, I leveraged Gemini assistance to be able to write the Python script. I uploaded the Excel sheet and promted gemini to assist in developing a script that would give me the sunflower design. Link to the Colab:https://colab.research.google.com/drive/1arsozAVNQhs-4Ol0LMIRKZ4QGVld0Kgf?usp=sharing

alt text alt text

Publication Title: Automation of biochemical assays using an open-sourced, inexpensive robotic liquid handler

While going through the paper, I analyzed two things.

  1. How the Opentrons OT-2 was used.
  2. How was it making it easier?

The Opentrons OT-2 was being used to automate biochemical assays for protein and DNA measurement. Specifically, the researchers developed two assays: the Bradford assay for protein quantification and the PicoGreen assay for double-stranded DNA measurement. These assays are highly relevant to vaccine development, as they provide information on efficacy, potency, safety, and quality control.

Based on this article, the OT-2 simplifies laboratory automation in several ways:

  1. Cost-effective: Unlike traditional commercial liquid handlers that cost over $250,000, this system costs under $10,000, making automation accessible to smaller labs.
  2. User-friendly programming: It uses Python programming language with open-source flexibility, reducing training requirements.
  3. Accurate and precise: The study found the pipettes exhibited excellent accuracy and precision, with relative inaccuracy of 1.30% for the P20 pipette and 0.53% for the P300 pipette.
  4. Time-saving: The Bradford assay completed in 75 minutes and the PicoGreen assay in 41 minutes, reducing manual labor.

How would I leverage this for my work

Based on the work I want to do that is aimed at prpoducing Antimicrobial Peptide at industrial scaled there are several ways I can leverage it to my advantage.

The OT-2 could be adapted for this application in the following ways:

  • Quality control testing: The automated protein and DNA quantification assays demonstrated here could monitor protein expression levels and detect residual DNA contamination during antimicrobial peptide synthesis and purification.

  • Assay automation: The system’s ability to run customized Python protocols means I could develop automated workflows for peptide synthesis optimization, mixing reagents, and conducting potency assays.

  • High-throughput screening: The medium-throughput capability would allow testing multiple peptide variants or production conditions simultaneously to identify the most effective formulations.

Week 4 HW: protein-design-part-i

Part A. Conceptual Questions

  1. How many molecules of amino acids in 500g of meat? A quick search shows that, for example, beef contains ~20–22g of protein per 100g of meat.

Also, it would be good to mention that Raw meat contains more water, so the total weight of amino acids is lower per 100g compared to cooked meat, where water loss concentrates the nutrients (often increasing protein to 28–36g per 100g).

If: 100g = 20g of protein 500g = ? (500*20)/100= 100g of protein An average amino acid molecular weight of ~100 Da (100 g/mol): Moles of amino acids = 100g ÷ 100 g/mol = 1 mol Number of molecules = 1 mol × 6.022 × 10²³ ≈ 6 × 10²³ amino acid molecules

  1. Why do humans eat beef but do not become a cow, eat fish but do not become fish? The short answer is because of digestion. Proteases in the gut break all dietary proteins down to their constituent free amino acids, meaning the informational content (sequence) of the cow’s proteins is destroyed. Ribosomes then reassemble amino acids in the order dictated by ones own mRNA, encoding human proteins.

Interesting fact I discovered Any intact foreign protein that did slip through is attacked by ones immune system as a non-self antigen which is the basis of food allergies (fragments sometimes escape digestion).

  1. Why are there only 20 natural amino acids? One of the main theories as to why there are only 20 amino acids in the world is attributed to evolution. The “frozen accident” / historical contingency The genetic code was fixed early in evolution (~3.5–4 billion years ago) and became locked in. Changing it would be catastrophic (every codon reassignment would corrupt thousands of proteins simultaneously).

  2. Can you make other non-natural amino acids? Design some new amino acids. Yes, numerous non-natural (or non-canonical) amino acids (ncAAs) can be designed and synthesized using advanced chemical methods, such as palladium-catalyzed C-H bond functionalization. By incorporating unnatural amino acids, scientists can extend the genetic code, creating proteins with unique structural and functional properties.

  3. Where did amino acids come from before enzymes that make them, and before life started? Some of the theories of the origin of amino acids include: a) The Miller-Urey experiment (1953) b) Meteorite delivery c) Hydrothermal vents d) HCN and formaldehyde chemistry e) The “RNA world” scenario

  4. If you make an α-helix using D-amino acids, what handedness (right or left) would you expect? It would be LEFT-HANDED. Proteins built from D-amino acids form left-handed α-helices, which are mirror images of the right-handed helices formed by L-amino acids.

  5. Can you discover additional helices in proteins? Proteins can adopt several types of helical structures, each with distinct geometric and hydrogen-bonding characteristics. The α-helix is the most common helical structure found in proteins. Beyond the classic α-helix, several other helical structures exist. There is:

  • 3₁₀-helix is a tighter helix with 3.0 residues per turn. Its hydrogen bonds form between residue i and i + 3, resulting in a rise of about 2.0 Å per residue. This helix often appears at the termini of α-helices rather than as long, independent structures.

  • π-helix is a rare and wider helix, containing 4.4 residues per turn. Its hydrogen bonding occurs between residue i and i + 5, with a rise of approximately 1.1 Å per residue. π-helices are uncommon but are frequently found at functional sites in proteins.

  • Polyproline II (PPII) helix has 3.0 residues per turn and lacks intramolecular hydrogen bonds. Instead, it adopts an extended left-handed conformation with a rise of about 3.1 Å per residue. This structure plays an important role in signaling, particularly in binding interactions such as those involving SH3 domains.

  • Polyproline I (PPI) helix contains 3.3 residues per turn and also lacks intramolecular hydrogen bonds. It is right-handed and more compact, with a rise of about 1.9 Å per residue. This form is typically observed in organic solvents.

  • Collagen helix consists of approximately 3.3 residues per turn and is stabilized by interchain hydrogen bonds rather than intramolecular ones. Each chain has a rise of about 2.9 Å per residue, and three chains wind together to form a characteristic triple helix. This structure is distinguished by its repeating Gly–X–Y sequence pattern.

  1. Why are most molecular helices right-handed? Natural proteins are composed mainly of L-amino acids. The stereochemistry of the peptide backbone energetically favors right-handed α-helices.

  2. Why do β-sheets tend to aggregate? What is the driving force for β-sheet aggregation? β-sheets hydrogen bond through their edge strands, which have free NH and C=O groups pointing outward, ready to hydrogen bond with another β-sheet edge. Unlike α-helices (where all H-bond donors/acceptors are internally satisfied), β-sheet edges are “sticky.”

Part B: Protein Analysis and Visualization

One of the final individual projects I want to do is the development of a biosensor for the diagnosis of Vibrio Cholerae. To help me learn more about V. Cholerae I will use this assignment to study one of the main toxins that is associated with toxicity.

Cholera toxin (CT) is defined as the major virulence factor of the bacterium Vibrio cholerae, comprising a single A subunit and five identical B subunits in an A-B5 architecture, which causes severe diarrheal symptoms in infected individuals. Cholera toxin is a member of the AB toxin family and is composed of a catalytically active heterodimeric A-subunit linked with a homopentameric B-subunit.

  1. Briefly describe the protein you selected and why you selected it. The protein I have selected is the Cholera enterotoxin subunit A. Cholera toxin subunit A (CTA) is the catalytically active component of the AB₅ cholera enterotoxin secreted by Vibrio cholerae, and it is the central molecular component responsible for the devastating secretory diarrhea that defines cholera disease.

  2. Identify the amino acid sequence of your protein. (For this part I search the Cholera toxin on UniProt, after which i selected the Cholera enterotoxin subunit A)

The amino acid sequence MVKIIFVFFIFLSSFSYANDDKLYRADSRPPDEIKQSGGLMPRGQSEYFDRGTQMNINLYDHARGTQTGFVRHDDGYVSTSISLRSAHLVGQTILSGHSTYYIYVIATAPNMFNVNDVLGAYSPHPDEQEVSALGGIPYSQIYGWYRVHFGVLDEQLHRNRGYRDRYYSNLDIAPAADGYGLAGFPPEHRAWREEPWIHHAPPGCGNAPRSSMSNTCDEKTQSLGVKFLDEYQSKVKRQIFSGYQSDIDTHNRIKDEL

Sequence Lenght 258 Amino Acids

Amino Acid Frequency Using the provided colab notebook I was able to get the amino acid count.

alt text alt text The output is as follows: S: 23, G: 22, D: 19, Y: 18, R: 17, I: 16, L: 16, A: 15, P: 14, V: 13, F: 12, Q: 12, N: 11, E: 11, H: 11, T: 10, K: 8, M: 5, W: 3, C: 2 The letters represent each of the amino acids: S = Serine, G = Glycine, D = Aspartate, Y = Tyrosine, R = Arginine, I = Isoleucine, L = Leucine, A = Alanine, P = Proline, V = Valine, F = Phenylalanine, Q = Glutamine, N = Asparagine, E = Glutamate, H = Histidine, T = Threonine, K = Lysine, M = Methionine, W = Tryptophan, C = Cysteine.

The most abundant amino acid is Serine (S, n=23), closely followed by Glycine (G, n=22) and Aspartate (D, n=19) The overall composition hints at a hydrophilic, surface-exposed protein given the prevalence of polar and charged residues (S, D, R, Y, Q, N, E, H).

Protein Sequence Homologs Protein homologs are proteins that share a common evolutionary origin. They come from the same ancestral gene, even if their sequences or functions have changed over time. There are two main types:

  • Orthologs – Proteins in different species that evolved from a common ancestral gene after a speciation event. They often retain similar functions.
  • Paralogs – Proteins within the same species that arose by gene duplication. They may evolve new or specialized functions over time.

Through BlastKB there were 250 hits

A comprehensive search for cholera toxin homologs across biological databases yielded 250 results distributed across three domains. The vast majority of results (207 sequences, 83%) came from fungi, particularly Ascomycota species, with notable representations from entomopathogenic fungi in the families Ophiocordycipitaceae, Clavicipitaceae, and Cordycipitaceae, as well as the plant pathogen Colletotrichum (84 results). Bacterial homologs comprised 42 sequences (17%), dominated by Pseudomonadota, including pathogenic species such as Escherichia coli, Vibrio cholerae, and Paraburkholderia, alongside environmental bacteria like Bartonella and Leptospira. A single viral sequence (1 result) was identified as Vibrio phage CTXphi, which is notably the well-characterized prophage known to carry the cholera toxin genes themselves, validating the search methodology. The predominance of fungal and bacterial sequences suggests that toxin-like proteins with structural or functional homology to cholera toxin are widespread across microbial organisms, though the biological significance of these fungal matches warrants further investigation given that toxin production is not a typical trait of this kingdom.

alt text alt text

Protein Family Based on the results, the protein belongs to the Cholera enterotoxin subunit A family.

  1. Identify the structure page of your protein in RCSB When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)

It was solved on 2004-04-06. It has a resolution of 1.9 Å which was obtained through X-RAY DIFFRACTION. I would say it has good quality since the resolution is smaller than 2.70 Å.

alt text alt text

Are there any other molecules in the solved structure apart from protein?

There are two unique ligands present in the structure, specifically: beta-D-galactopyranose and Sodium Ion.

alt text alt text

Does your protein belong to any structure classification family?

Yes. It belongs to the ADP-ribosylating toxins. (NOT SURE ABOUT THIS ANSWER) alt text alt text

  1. Open the structure of your protein in any 3D molecule visualization software: Using PyMol I wanted to visualize my protein of interest. Download the PDB file for the protein, then on PyMol, go to file then open, select the downloaded PDB file. (Chatgpt helped with the python code for the different visualizations)
alt text alt text
  • Cartoon

I used the following commands

PyMOL>hide everything

PyMOL>show cartoon

PyMOL>color red

PyMOL>util.cbc (color by secondary structure)

alt text alt text
  • Ribbon

I used the following commands

PyMOL>hide everything

PyMOL>show ribbon

alt text alt text

Ball and Stick

I used the following command

yMOL>hide everything

PyMOL>show sticks

PyMOL>show spheres

To Improve structure

PyMOL>set sphere_scale, 0.25

PyMOL>set stick_radius, 0.15

alt text alt text

To answer the following questions I used chapgpt to explain how to identify the different structures and also to understand the residues

  • Color the protein by secondary structure. Does it have more helices or sheets? By default, PyMOL colors:

Red → α-helices

Yellow → β-sheets

Green → loops/coils

The protein has α-helices that appeared as a spiral / corkscrew shape, Long cylindrical coils and was red It also has β-Sheets that appeared as Flat arrow-shaped strands, Multiple arrows next to each other and were yellowSpiral/corkscrew It also had Loops that appeared as Thin connecting regions that were green

To tell whether I have more helices or sheets, I ran the following code PyMOL>select helices, ss h Selector: selection “helices” defined with 3854 atoms. PyMOL>select sheets, ss s Selector: selection “sheets” defined with 2744 atoms.

My protein has more Helices

  • Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues? “hydrophobic” defined with 4211 atoms. “hydrophilic” defined with 3626 atoms.

There are more hydrophobic atoms than hydrophilic atoms.

  • Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

Part C. Using ML-Based Protein Design Tools

C1. Protein Language Modeling

  1. Deep Mutational Scans

Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.

alt text alt text

Can you explain any particular pattern? (choose a residue and a mutation that stands out) The protein has several highly conserved positions (deep blue vertical stripes).

C2. Protein Folding

C3. Protein Generation

Part D. Group Brainstorm on Bacteriophage Engineering