Homework

Weekly homework submissions:

  • Week 1 HW: Principles and Practices

    Biological Engineering Application or Tool The proposed application is an AI-guided protein therapeutic discovery and bioproduction platform. The system uses machine learning–based protein design models to generate novel therapeutic protein candidates, such as antimicrobial proteins, enzymes, or biologics optimized for stability and activity. These candidates are then evaluated for manufacturability and functional performance using controlled bioproduction workflows, including microbial expression or cell-free systems. This application reflects an emerging paradigm in biopharmaceutical development, where AI accelerates early-stage discovery while scalable bioproduction determines clinical and commercial feasibility. However, as AI enables rapid de novo protein design, many generated sequences may lack homology to known natural proteins, introducing novel biosecurity and safety risks if not properly governed.

  • DNA Read, Write, and Synthesis

    This week, we were tasked to utilize different tools to be able to virtually read, write, and visualize using samples like lambda DNA from Escherichia coli and the Tumor suppressor gene from humans. Part 1 - Introduction and DNA digest. Gel Electrophoresis Gel - material Electro - Electric Phoresis - to transport It is a method used to transport charged materials using an electric field through a gel (a Semi-liquid substance). Digested fragments of Lambda DNA

  • Genetic Circuits - 1

    Part - 1 What are some components in the Phusion High-Fidelity PCR Master Mix, and what is their purpose? Phusion High-Fidelity PCR Master Mix, commonly produced by Thermo Fisher Scientific, contains a high-fidelity DNA polymerase with proofreading ability, a reaction buffer that maintains optimal conditions, Mg²⁺ ions as a cofactor, dNTPs as building blocks, and stabilizing additives. Together, these components enable accurate and efficient DNA amplification with a low error rate. What are some factors that determine primer annealing temperature during PCR?

  • Protein design - 1

    A. Conceptual Questions How many molecules of amino acids do you take with a piece of 500 grams of meat? (On average, an amino acid is ~100 Daltons) Answer 1 Dalton ≈ 1 g/mol Average amino acid ≈ 100 g/mol If you eat 500 g of (pure) amino acids: number of moles = Gm/ Tm = 500g/100g/mol

  • Week-07 Genetic Circuits - 2

    Assignment Part 1: Intracellular Artificial Neural Networks (IANNs) What advantages do IANNs have over traditional genetic circuits, whose input/output behaviors are Boolean functions? Intracellular artificial neural networks provide more flexible and nuanced behavior than traditional Boolean genetic circuits because they can process inputs in a graded, continuous manner rather than simple on or off states. This allows cells to integrate multiple signals and produce proportional responses, making them better suited for complex decision making and pattern recognition inside biological systems. Describe a useful application for an IANN; include a detailed description of input/output behavior, as well as any limitations an IANN might face to achieve your goal.

Subsections of Homework

Week 1 HW: Principles and Practices

  1. Biological Engineering Application or Tool

The proposed application is an AI-guided protein therapeutic discovery and bioproduction platform. The system uses machine learning–based protein design models to generate novel therapeutic protein candidates, such as antimicrobial proteins, enzymes, or biologics optimized for stability and activity. These candidates are then evaluated for manufacturability and functional performance using controlled bioproduction workflows, including microbial expression or cell-free systems.

This application reflects an emerging paradigm in biopharmaceutical development, where AI accelerates early-stage discovery while scalable bioproduction determines clinical and commercial feasibility. However, as AI enables rapid de novo protein design, many generated sequences may lack homology to known natural proteins, introducing novel biosecurity and safety risks if not properly governed.

  1. Governance / Policy Goals

The overarching governance goal is to ensure that AI-enabled protein drug discovery and bioproduction contribute to a safe, ethical, and socially beneficial future, while preventing misuse or unintended harm. This goal can be divided into the following sub-goals:

2.1. Non-malfeasance and biosecurity

  Prevent the accidental or intentional creation of harmful, toxic, or dual-use proteins enabled by AI-assisted design.

2.2. Responsible scale-up and traceability
Ensure that the transition from digital protein design to physical bioproduction is secure, auditable, and accountable.

2.3. Preservation of constructive innovation
Maintain open scientific collaboration and efficient therapeutic development without imposing unnecessary regulatory burdens that would slow innovation.

These goals align with arguments advanced by Baker and Church, who emphasize that enhanced biosecurity should be embedded into protein design and DNA synthesis infrastructure without undermining transparency or information sharing.

  1. Governance Action (Purpose, Design, Assumptions, Risks)

3.1 Governance Action 1: Integrated Safety Screening and Secure Sequence Logging

Purpose

Currently, AI protein design pipelines primarily optimize for functional performance, and existing biosecurity measures rely heavily on sequence homology screening at the DNA synthesis stage. As Baker and Church note, this approach is increasingly insufficient for de novo designed proteins. This project proposes an integrated governance mechanism that embeds mandatory AI-based safety screening and secure sequence logging directly into the protein design and bioproduction pipeline.

Design

This governance approach would be implemented through collaboration among AI tool developers, biopharmaceutical companies, and DNA synthesis or bioproduction providers. All AI-generated protein sequences would undergo computational screening for toxicity, virulence, and dual-use potential before synthesis approval. Once synthesized, sequences would be logged in encrypted repositories tied to production systems, with access restricted to exceptional circumstances such as public health investigations. This design enables traceability and accountability while protecting intellectual property and minimizing interference with normal research workflows.

Assumptions

This approach assumes that predictive models for protein toxicity and risk are sufficiently accurate to identify high-risk candidates and that industry actors are willing to adopt shared security standards. It also assumes that secure logging can be implemented in a way that does not expose proprietary information or discourage legitimate research.

Risks of Failure and “Success”

Potential failure modes include false negatives that allow harmful proteins to proceed or false positives that block legitimate therapeutic candidates. Additionally, if logging systems are unevenly implemented, malicious actors may bypass regulated platforms. A potential risk of “success” is increased centralization of bioproduction infrastructure, which could disadvantage smaller labs or researchers in low-resource settings if access is not equitably managed.

3.2 Governance Action Option 2

Tiered Access and Credentialing for Advanced Protein Design Models

Purpose

Currently, many AI protein design tools are becoming increasingly accessible with minimal differentiation between low-risk exploratory use and high-risk de novo protein generation. This action proposes a tiered access system where more powerful generative protein design capabilities require additional credentials, training, or institutional affiliation.

Design

AI tool providers and research institutions would implement access tiers based on user role, training completion, and intended application. Basic design and analysis features would remain widely accessible, while advanced generative functions (e.g., unrestricted de novo protein design) would require completion of biosecurity and ethics training, institutional oversight, or project-level approval. This mirrors governance models used in high-performance computing, clinical data access, and human-subjects research.

Assumptions

This approach assumes that access restrictions can meaningfully reduce misuse without pushing users toward unregulated alternatives. It also assumes institutions are capable of fairly and consistently evaluating access requests.

Risks of Failure and “Success”

If too restrictive, tiered access could slow innovation or disadvantage independent researchers and low-resource institutions. If too permissive, it may fail to deter misuse. A risk of “success” is the normalization of credential-based gatekeeping that could reinforce existing inequities in global research participation.

3.3 Governance Action Option 3

Safety-by-Design Standards Linked to Incentives and Recognition

Purpose

While safety measures are often framed as compliance requirements, this action reframes governance as an incentive-based system that rewards early integration of biosecurity and safety considerations into AI-driven protein design and bioproduction.

Design

Funding agencies, journals, and investors would establish safety-by-design criteria as part of grant evaluation, publication standards, and due diligence. Projects that demonstrate integrated risk assessment, secure production workflows, and ethical reflection would receive preferential funding, expedited review, or public recognition. This approach aligns governance with existing academic and commercial reward structures rather than relying solely on enforcement.

Assumptions

This approach assumes that researchers and companies respond strongly to funding, publication, and reputational incentives. It also assumes evaluators have sufficient expertise to assess safety claims without turning the process into box-checking.

Risks of Failure and “Success”

If poorly designed, incentives may encourage superficial compliance rather than genuine risk mitigation. A risk of “success” is that safety standards become rigid or outdated, unintentionally discouraging novel approaches that do not fit existing evaluation frameworks.

  1. Does the option:Option 1Option 2Option 3
    Enhance Biosecurity
    • By preventing incidents122
    • By helping respond133
    Foster Lab Safety
    • By preventing incident221
    • By helping respond132
    Protect the environment
    • By preventing incidents232
    • By helping respond132
    Other considerations
    • Minimizing costs and burdens to stakeholders211
    • Feasibility?122
    • Not impede research211
    • Promote constructive applications121
  2. Evaluation and Prioritization of Governance Approach

Overall, this integrated governance approach performs well across the major policy goals of biosecurity, lab safety, and responsible innovation. By focusing on prevention at the design stage and accountability at the production stage, it strengthens biosecurity while remaining feasible and compatible with existing biopharmaceutical workflows. Although the approach introduces some additional cost and procedural overhead, it does not fundamentally impede research and instead helps reduce downstream failures and regulatory risk.

  1. Final Recommendation and Trade-offs

Based on this evaluation, the integrated safety screening and secure sequence logging approach should be prioritized as the primary governance mechanism for AI-enabled protein drug discovery and bioproduction. This strategy addresses the highest-risk stages—design and scale-up—while remaining technically feasible and aligned with existing biopharmaceutical practices. The key trade-off involves balancing innovation speed with safety and accountability. While additional screening and logging may introduce modest overhead, these costs are outweighed by reduced downstream failures, increased regulatory confidence, and improved public trust.

This recommendation is directed toward biopharmaceutical R&D leadership and regulatory agencies, where early alignment between AI-driven discovery and governance expectations can ensure that emerging therapeutic technologies are both innovative and trustworthy.

cover image cover image

DNA Read, Write, and Synthesis

This week, we were tasked to utilize different tools to be able to virtually read, write, and visualize using samples like lambda DNA from Escherichia coli and the Tumor suppressor gene from humans.

Part 1 - Introduction and DNA digest.

Gel Electrophoresis

  • Gel - material
  • Electro - Electric
  • Phoresis - to transport
  • It is a method used to transport charged materials using an electric field through a gel (a Semi-liquid substance).

Digested fragments of Lambda DNA

Image of Design created Image of Design created Restrictive Enzyme digest of lambda DNA on Benchling Restrictive Enzyme digest of lambda DNA on Benchling

Part 2

For this assignment, I have chosen the Tumor Repressor protein 53 in humans. I chose this because I have previously made a comparative analysis with the Trp 53 protein from the mouse.

3.1 The full amino acid sequence of Tp53 protein in FASTA format

AAH03596.1 Tumor protein p53 [Homo sapiens] MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAA PRVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKT CPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRN TFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACAGR DRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALEL KDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD

3.2 Reverse Translated Sequence

atggaagaaccgcagagcgatccgagcgtggaaccgccgctgagccaggaaacctttagc gatctgtggaaactgctgccggaaaacaacgtgctgagcccgctgccgagccaggcgatg gatgatctgatgctgagcccggatgatattgaacagtggtttaccgaagatccgggcccg gatgaagcgccgcgcatgccggaagcggcgccgcgcgtggcgccggcgccggcggcgccg accccggcggcgccggcgccggcgccgagctggccgctgagcagcagcgtgccgagccag aaaacctatcagggcagctatggctttcgcctgggctttctgcatagcggcaccgcgaaa agcgtgacctgcacctatagcccggcgctgaacaaaatgttttgccagctggcgaaaacc tgcccggtgcagctgtgggtggatagcaccccgccgccgggcacccgcgtgcgcgcgatg gcgatttataaacagagccagcatatgaccgaagtggtgcgccgctgcccgcatcatgaa cgctgcagcgatagcgatggcctggcgccgccgcagcatctgattcgcgtggaaggcaac ctgcgcgtggaatatctggatgatcgcaacacctttcgccatagcgtggtggtgccgtat gaaccgccggaagtgggcagcgattgcaccaccattcattataactatatgtgcaacagc agctgcatgggcggcatgaaccgccgcccgattctgaccattattaccctggaagatagc agcggcaacctgctgggccgcaacagctttgaagtgcgcgtgtgcgcgtgcgcgggccgc gatcgccgcaccgaagaagaaaacctgcgcaaaaaaggcgaaccgcatcatgaactgccg ccgggcagcaccaaacgcgcgctgccgaacaacaccagcagcagcccgcagccgaaaaaa aaaccgctggatggcgaatattttaccctgcagattcgcggccgcgaacgctttgaaatg tttcgcgaactgaacgaagcgctggaactgaaagatgcgcaggcgggcaaagaaccgggc ggcagccgcgcgcatagcagccatctgaaaagcaaaaaaggccagagcaccagccgccat aaaaaactgatgtttaaaaccgaaggcccggatagcgat

3.3 Optimized codon

ATGGAAGAACCACAAAGTGACCCCAGCGTTGAACCGCCGCTGAGCCAGGAAACCTTCAGTGATCTGTGGAAACTGCTGCCGGAAAACAACGTGCTGAGCCCGCTGCCGAGCCAGGCGATGGATGATCTGATGCTGTCTCCGGATGACATTGAGCAGTGGTTCACCGAAGACCCCGGACCGGATGAAGCGCCGCGTATGCCGGAAGCAGCACCGCGCGTAGCACCGGCACCGGCAGCACCGACCCCGGCTGCACCTGCACCGGCACCCTCATGGCCGCTCAGCAGCTCAGTGCCCAGCCAGAAAACCTATCAGGGCAGCTATGGCTTCCGCCTGGGCTTCCTGCACAGCGGCACGGCAAAATCGGTGACCTGCACCTACAGCCCTGCGCTGAACAAGATGTTCTGCCAGCTGGCGAAAACCTGCCCGGTGCAGCTGTGGGTTGACTCCACACCGCCGCCAGGCACCCGTGTGCGTGCGATGGCGATCTATAAACAGAGCCAGCACATGACCGAAGTGGTGCGTCGCTGCCCGCACCATGAGCGCTGCTCTGACAGCGACGGTCTGGCACCGCCGCAGCATCTGATCCGCGTTGAAGGTAACCTGCGTGTGGAGTATCTGGATGACCGCAACACCTTCCGCCACAGCGTGGTGGTGCCGTATGAACCGCCGGAAGTGGGCAGCGACTGCACCACCATCCACTACAACTACATGTGCAACTCCTCCTGCATGGGCGGTATGAACCGCCGTCCGATTCTGACCATTATCACCCTGGAAGACTCCAGCGGTAACCTGCTGGGCCGTAACAGCTTTGAAGTGCGTGTGTGTGCCTGTGCCGGCCGCGATCGCCGCACGGAAGAAGAAAACCTGCGCAAGAAAGGTGAACCGCACCACGAACTGCCGCCGGGCAGCACCAAGCGTGCGCTGCCGAACAACACCTCCTCCAGCCCGCAGCCGAAGAAGAAACCGCTGGATGGCGAGTACTTCACCCTGCAGATCCGTGGGCGTGAACGTTTTGAAATGTTCCGTGAGCTGAACGAAGCGCTGGAGCTGAAAGATGCGCAGGCGGGTAAAGAGCCGGGTGGCTCACGTGCGCACAGCAGCCACCTGAAATCCAAAAAAGGTCAGAGCACCAGCCGTCACAAAAAACTGATGTTTAAAACTGAAGGGCCGGACAGCGAT

3.4 Expression Method

Cell-dependent

  • Transform plasmid into E. coli
  • Cells replicate plasmid + express protein
  • Induce expression (e.g., IPTG)
  • Lyse cells, purify protein

Cell-free

  • Add DNA/RNA to cell extract
  • Extract contains ribosomes + factors
  • Protein made in a test tube
  • Faster; good for toxic proteins
3.5 Protein Alignment

The main reason that the same gene can produce different proteins at the transcriptional level is mainly because of :

  • Alternative Splicing
  • Alternative transcriptional and translational initiation.
Benchling_Protein_Alignment of our protein Benchling_Protein_Alignment of our protein
4 Preparing a Twist DNA Synthesis Order

In this part, I was able to create an expression cassette that can be inserted into a vector plasmid and incorporated with a cell-free or a cell-dependent medium to express a desired protein. To exercise the entire procedure of making a construct and getting a customised plasmid vector benchling and Twist were used. I used the sGFP gene sequence from NCBI and annotated its promoter, ribosome-binding site, optimized codon region, and its terminator on benchling and later a pTwist Amp High Copy vector was used after downloading from Twist.

Finalized DNA Construct Finalized DNA Construct
5. Tools and Techniques to Read, Write, and Edit DNA.

DNA Read

Next-generation sequencing (NGS / Illumina) Next-generation sequencing (NGS / Illumina)Nanopore sequencing (Oxford Nanopore) Nanopore sequencing (Oxford Nanopore)Sanger sequencing Sanger sequencing

DNA Write

Phosphoramidite synthesis (column synthesis) Phosphoramidite synthesis (column synthesis)Array-based synthesis Array-based synthesisEnzymatic DNA synthesis Enzymatic DNA synthesis

DNA Edit

CRISPR-Cas9 CRISPR-Cas9Base editing Base editingPrime editing Prime editing

Genetic Circuits - 1

Part - 1

  1. What are some components in the Phusion High-Fidelity PCR Master Mix, and what is their purpose?

    • Phusion High-Fidelity PCR Master Mix, commonly produced by Thermo Fisher Scientific, contains a high-fidelity DNA polymerase with proofreading ability, a reaction buffer that maintains optimal conditions, Mg²⁺ ions as a cofactor, dNTPs as building blocks, and stabilizing additives. Together, these components enable accurate and efficient DNA amplification with a low error rate.
  2. What are some factors that determine primer annealing temperature during PCR?

    • Primer annealing temperature in PCR is mainly determined by the melting temperature of the primers, which depends on their length and GC content. Higher GC content and longer primers increase the melting temperature, leading to a higher annealing temperature, while mismatches and low salt conditions can reduce it.
  3. There are two methods from this class that create linear fragments of DNA: PCR, and restriction enzyme digests. Compare and contrast these two methods, both in terms of protocol as well as when one may be preferable to use over the other.

    • PCR and restriction enzyme digestion both generate linear DNA fragments but differ fundamentally in approach. PCR amplifies DNA from a template using a polymerase and primers, making it ideal when starting material is limited or when sequence modifications are needed, while restriction digestion cuts existing DNA at specific sequences using enzymes, making it preferable when precise, predefined sites are available and no amplification is required.
  4. How can you ensure that the DNA sequences that you have digested and PCR-ed will be appropriate for Gibson cloning?

    • PCR and restriction enzyme digestion both generate linear DNA fragments, but differ fundamentally in approach. PCR amplifies DNA from a template using a polymerase and primers, making it ideal when the starting material is limited or when sequence modifications are needed, while restriction digestion cuts existing DNA at specific sequences using enzymes, making it preferable when precise, predefined sites are available, and no amplification is required.
  5. How does the plasmid DNA enter the E. coli cells during transformation?

    • To ensure DNA fragments are suitable for Gibson Assembly, the sequences must be designed with overlapping ends of about 20 to 40 base pairs that are complementary between adjacent fragments. These overlaps must have appropriate melting temperatures and correct sequence alignment so that the fragments can anneal properly and be joined seamlessly.
  6. Describe another assembly method in detail (such as Golden Gate Assembly) Explain the other method in 5 - 7 sentences plus diagrams (either handmade or online).

    • Golden Gate Assembly works by repeatedly cycling between digestion and ligation in one reaction mixture containing DNA fragments, a Type IIS enzyme, and ligase. The enzyme cuts to create specific overhangs, fragments anneal based on complementary ends, and ligase seals them together. Because the recognition sites are eliminated after cutting, correctly assembled products accumulate over time. This enables efficient and accurate multi-fragment assembly without leaving extra sequences between parts. The method is widely used in synthetic biology for building complex constructs.

Protein design - 1

A. Conceptual Questions

  1. How many molecules of amino acids do you take with a piece of 500 grams of meat?

(On average, an amino acid is ~100 Daltons)

Answer

1 Dalton ≈ 1 g/mol

Average amino acid ≈ 100 g/mol

If you eat 500 g of (pure) amino acids:

number of moles = Gm/ Tm = 500g/100g/mol

Using Avogadro’s number: 5×6.022×10^23 ≈ 3.0 × 10²⁴ molecules

So you consume roughly 3 septillion amino acid molecules.

2. Why do humans eat beef but do not become cows, eat fish but do not become fish?
Answer

Proteins are digested into individual amino acids in the stomach and small intestine.

Your body:

  • Breaks proteins down.
  • Absorbs amino acids.
  • Reassembles them into human proteins according to your DNA.
3. Why are there only 20 natural amino acids?
Answer

Because they have been created by an intelligent design in such a way.

4. Can you make other non-natural amino acids? Design some new amino acids.
Answer

Yes. Scientists create non-natural amino acids using synthetic biology.

Examples of designs:

• A fluorescent amino acid (attach a fluorophore to side chain) • A metal-binding amino acid (add a bipyridine group) • A photo-switchable amino acid (add an azobenzene group) • A redox-active amino acid

These can:

  • Expand protein function
  • Create new biomaterials
  • Enable bioelectronics
5. Where did amino acids come from before enzymes that make them, and before life started?
Answer
  • Everything was created by the almighty God, who is an intelligent being.
6. If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?
Answer
  • Natural proteins use L-amino acids and form right-handed α-helices.

  • If you use D-amino acids, you would expect a left-handed α-helix.

The handedness flips due to stereochemistry.

7. Can you discover additional helices in proteins?
Answer

Yes.

Beyond the α-helix, proteins contain:

  • 3₁₀ helix

  • π-helix

  • Collagen triple helix

Structural biology and protein design can reveal or engineer new helix types.

Helices Helices
8. Why are most molecular helices right-handed?
Answer

Because biological systems predominantly use L-amino acids.

Their stereochemistry naturally favors right-handed packing for minimal steric clash and optimal hydrogen bonding.

9. Why do β-sheets tend to aggregate?

What is the driving force for β-sheet aggregation? Why do many amyloid diseases form β-sheets? Can you use amyloid β-sheets as materials? Design a β-sheet motif that forms a well-ordered structure.

Answer

Why β-sheets aggregate: β-strands expose backbone hydrogen bonding groups. They stack via intermolecular hydrogen bonds.

Driving force:

  • Hydrogen bonding

  • Hydrophobic interactions

  • π–π stacking (aromatic residues)

Amyloid diseases: Proteins misfold and form stable β-sheet fibrils.

Examples include:

  • Alzheimer’s disease

  • Parkinson’s disease

Amyloid β-peptides form cross-β sheet structures.

Materials applications: Yes — amyloid fibrils can be used as:

  • Nanowires

  • Hydrogels

Biocompatible scaffolds

  • Design idea: Create a repeating sequence like:
    • Val–Ile–Val–Ile–Tyr–Val–Ile–Val

Alternating hydrophobic residues promotes stacking and ordered β-sheet assembly.

B. Protein Analysis

I have chosen Herceptin (trastuzumab) for this section. Herceptin is a monoclonal antibody mainly involved in recognising cancer cells. It binds specifically to the HER2 receptor on cancer cells and blocks signaling pathways that promote tumor growth. I selected this protein because it is an important example of a therapeutic antibody widely used in breast cancer treatment.

Amino Acid Sequence (P04626-1)

CLICK HERE SEE THE SEQUENCE

MELAALCRWGLLLALLPPGAASTQVCTGTDMKLRLPASPETHLDMLRHLYQGCQVVQGNLELTYLPTNASLSFLQDIQEVQGYVLIAHNQVRQVPLQRLRIVRGTQLFEDNYALAVLDNGDPLNNTTPVTGASPGGLRELQLRSLTEILKGGVLIQRNPQLCYQDTILWKDIFHKNNQLALTLIDTNRSRACHPCSPMCKGSRCWGESSEDCQSLTRTVCAGGCARCKGPLPTDCCHEQCAAGCTGPKHSDCLACLHFNHSGICELHCPALVTYNTDTFESMPNPEGRYTFGASCVTACPYNYLSTDVGSCTLVCPLHNQEVTAEDGTQRCEKCSKPCARVCYGLGMEHLREVRAVTSANIQEFAGCKKIFGSLAFLPESFDGDPASNTAPLQPEQLQVFETLEEITGYLYISAWPDSLPDLSVFQNLQVIRGRILHNGAYSLTLQGLGISWLGLRSLRELGSGLALIHHNTHLCFVHTVPWDQLFRNPHQALLHTANRPEDECVGEGLACHQLCARGHCWGPGPTQCVNCSQFLRGQECVEECRVLQGLPREYVNARHCLPCHPECQPQNGSVTCFGPEADQCVACAHYKDPPFCVARCPSGVKPDLSYMPIWKFPDEEGACQPCPINCTHSCVDLDDKGCPAEQRASPLTSIISAVVGILLVVVLGVVFGILIKRRQQKIRKYTMRRLLQETELVEPLTPSGAMPNQAQMRILKETELRKVKVLGSGAFGTVYKGIWIPDGENVKIPVAIKVLRENTSPKANKEILDEAYVMAGVGSPYVSRLLGICLTSTVQLVTQLMPYGCLLDHVRENRGRLGSQDLLNWCMQIAKGMSYLEDVRLVHRDLAARNVLVKSPNHVKITDFGLARLLDIDETEYHADGGKVPIKWMALESILRRRFTHQSDVWSYGVTVWELMTFGAKPYDGIPAREIPDLLEKGERLPQPPICTIDVYMIMVKCWMIDSECRPRFRELVSEFSRMARDPQRFVVIQNEDLGPASPLDSTFYRSLLEDDDMGDLVDAEEYLVPQQGFFCPDPAPGAGGMVHHRHRSSSTRSGGGDLTLGLEPSEEEAPRSPLAPSEGAGSDVFDGDLGMGAAKGLQSLPTHDPSPLQRYSEDPTVPLPSETDGYVAPLTCSPQPEYVNQPDVRPQPPSPREGPLPAARPAGATLERPKTLSPGKNGVVKDVFAFGGAVENPEYLTPQGGAAPQPHPPPAFSPAFDNLYYWDQDPPERGAPPSTFKGTPTAENPEYLGLDVPV

Total Length: 1255 Most Common Amino Acid: Leucine(L)

  • It belongs to the immunoglobulin G (IgG1) subclass within the immunoglobulin superfamily. And it is part of the L-domian family. (Immunoglobulin Light-chain domain.)

  • Resolution: 4.36 Å, which shows low resolution of the model.

  • The crystal structure of trastuzumab bound to HER2 was solved in 2004.

Blast Analysis

Blast Analysis Blast Analysis
  • The BLAST search identified homologous ERBB2 (HER2) protein sequences in several primates, including chimpanzee, bonobo, gorilla, and orangutan. These sequences show very high similarity (98–99% identity) with the query sequence, indicating that the HER2 receptor is highly conserved among mammals.

PYMOL Analysis of Trastuzumab

Ribbon Representation

Ribbon View Ribbon View

Ball and Stick

Ball and Stick Ball and Stick

Protein Surface

Surface Surface

*Hydrophobic Region

Hydrophobic Region Hydrophobic Region

Secondary structures

Alpha & Beta structure Alpha & Beta structure

C. Using ML-Based Protein Design Tools

C1. Protein Language Modeling

Deep Mutational Scans

  • Deep Mutational Scans
  1. Heatmap-1 Heatmap-1
  2. Heatmap-2 Heatmap-2
  3. Heatmap-3 Heatmap-3
  • Latent Space Analysis
Latent Space Analysis Latent Space Analysis
  • The Latent space analysis shows the 3D representation of different proteins. This plot is a map of protein similarity — proteins close together are similar in sequence/function/structure, the dense center contains common proteins, and the scattered edges contain unusual ones. The color encodes an additional property (likely functional or structural) layered on top of the spatial layout.

Explanation

Shape

One large continuous cloud — no hard separate clusters Reflects that protein sequence space is smooth and gradual, not divided into distinct categories

The Dense Purple Core

Where most proteins sit These are common, well-represented protein families that ESM2 has seen many times

The Scattered Orange/Yellow Periphery

Outlier proteins that are unusual or specialized Score higher on whatever the colorbar is measuring (likely a biological property or cluster score ranging from -7 to +7)

The Elongated Arms

Streaks radiating outward from the core Represent protein subfamilies that share a common origin but have diverged over evolution.

ESM fold Prediction

  • N.B For this section, I selected Insulin because it is relatively smaller than HER2, which kept crashing while trying to predict how it folds.

Predicted Structure Predicted Structure Real Structure Real Structure

  • ESMFold correctly predicted the beta sheet topology of insulin, identifying the major secondary structure elements consistent with the experimental RCSB structure. However, the predicted structure is notably more extended and loosely packed, with larger irregular loops compared to the compact real structure. This discrepancy is most likely due to insulin’s three disulfide bonds between Chain A and Chain B, which ESMFold does not explicitly model; these bonds are critical for anchoring the loops and achieving the tight globular shape seen in the experimental structure. The TM-score and RMSD would quantify this difference precisely, but visually, the fold class is correct while the fine-grained packing is not.

Reverse folding using ProteinMPNN.

For this part, I used the PDB file of the HER2 protein. After uploading the pdb file, a reverse folding was run, and 20 possible candidates for the actual sequence of the protein was predicted. Among the results, the one with the lowest log score was identified through manual screeing and was folded using the ESMfold model. The predicted sequence and the folded protein are attached below.

Predicted Structure

ALTPEQAALLAAAWAPVFADREANARAFVLDLFRAYPSLADLFPEFKGKTLEQIAASPALGPYAGAFADRLAQFVASSDNAAKMATFWENYANEHIRRGITASHFEQVRAVFPGFVASVAEPPPGAAAAWDQFWGGIIDALKKAGG

T=0.5, sample=0, score=0.9440, seq_recovery=0.4932

T = 0.5 (Temperature)

Controls how creative/diverse the designed sequence is 0.5 is moderate — balanced between staying close to original and exploring new sequences Lower (0.1) = conservative, Higher (1.0) = very adventurous

sample = 0

This is the first designed sequence (counting starts from 0) If you generated 10 sequences, you’d see sample=0 through sample=9 Each sample is an independent design attempt for the same backbone

score = 0.9440

Negative log likelihood — measures model confidence Lower = better — model is very confident this sequence fits your backbone Your score of 0.9440 is excellent — it’s below 1.0 which is better than your insulin results (1.06 and 1.08)

seq_recovery = 0.4932

49.32% of positions match the original protein sequence exactly Roughly 1 in 2 residues is identical to the original This is your best recovery so far — slightly higher than insulin’s ~46%

Refolding of the sequence Refolding of the sequence

Week-07 Genetic Circuits - 2

Assignment Part 1: Intracellular Artificial Neural Networks (IANNs)

  1. What advantages do IANNs have over traditional genetic circuits, whose input/output behaviors are Boolean functions?

    • Intracellular artificial neural networks provide more flexible and nuanced behavior than traditional Boolean genetic circuits because they can process inputs in a graded, continuous manner rather than simple on or off states. This allows cells to integrate multiple signals and produce proportional responses, making them better suited for complex decision making and pattern recognition inside biological systems.
  2. Describe a useful application for an IANN; include a detailed description of input/output behavior, as well as any limitations an IANN might face to achieve your goal.

    • A useful application of an intracellular artificial neural network would be in disease sensing, such as detecting cancer-specific molecular signatures. Inputs could be multiple biomarkers like microRNAs or metabolites, and the output could be the expression of a therapeutic protein only when a specific combination and threshold of signals is reached. This enables precise targeting and reduces off-target effects, although limitations include noise in gene expression, slow response times, and difficulty in tuning weights accurately inside living cells.
  3. Below is a diagram depicting an intracellular single-layer perceptron where the X1 input is DNA encoding for the Csy4 endoribonuclease and the X2 input is DNA encoding for a fluorescent protein output whose mRNA is regulated by Csy4. Tx: transcription; Tl: translation.

    • The perceptron system described works by using inputs that influence gene expression levels, where one input produces the Csy4 enzyme that regulates the mRNA of another gene encoding a fluorescent protein. Transcription and translation convert DNA inputs into proteins, and the interaction between Csy4 and the target mRNA effectively acts as a weighted connection, allowing the system to compute a combined output similar to a neural network node.

Assignment Part 2: Fungal Materials

  1. What are some examples of existing fungal materials and what are they used for? What are their advantages and disadvantages over traditional counterparts?

    • Fungal materials include products like mycelium based packaging, leather alternatives, and construction materials, often developed by companies such as Ecovative. These materials are biodegradable, sustainable, and require low energy to produce compared to plastics or animal based materials, but they can have limitations in durability, scalability, and consistency compared to traditional materials.
  2. What might you want to genetically engineer fungi to do and why? What are the advantages of doing synthetic biology in fungi as opposed to bacteria?

    • Genetically engineering fungi could allow them to produce specialized biomaterials, degrade environmental pollutants, or synthesize valuable compounds such as pharmaceuticals. Fungi are advantageous over bacteria because they naturally secrete large amounts of proteins, can grow into structured materials like mycelium networks, and are better suited for producing complex molecules, although they are generally slower growing and harder to genetically manipulate.