Week 04 hw protein design part

Part A. Conceptual Questions

  1. How many amino acid molecules do you ingest when eating 500 grams of meat? (Average amino acid ≈ 100 Daltons)

Approximately 20% of meat is protein. Therefore, 500 g of meat contains about 100 g of protein. If we assume an average amino acid molecular weight of 100 g/mol (≈100 Daltons), then:

100 g protein ÷ 100 g/mol ≈ 1 mol of amino acids.

One mole corresponds to 6 × 10²³ molecules (Avogadro’s number). Therefore, eating 500 g of meat provides on the order of 6 × 10²³ amino acid molecules.

  1. Why do humans eat beef but do not become a cow, or eat fish but not become fish?

Because we do not absorb intact proteins. During digestion, proteins are broken down into peptides and then into free amino acids. The original protein sequence and structure are completely destroyed. Our cells then use those amino acids as building blocks to synthesize human proteins according to the instructions encoded in our DNA. Biological identity comes from genetic information, not from dietary molecules.

  1. Why are there only 20 natural amino acids?

The 20 standard amino acids represent the set that became fixed in the genetic code during evolution. Together, they provide sufficient chemical diversity to build functional proteins: hydrophobic, polar, charged, aromatic, small, rigid, and flexible residues are all included. This set appears to be an optimal balance between chemical versatility and biological efficiency. Although additional amino acids such as selenocysteine and pyrrolysine exist, the core genetic code relies on 20.

  1. Can you make non-natural amino acids? Design some new amino acids.

Yes, hundreds of non-natural amino acids have been synthesized in organic chemistry and synthetic biology. They can be designed by modifying the side chain.

Examples of designed amino acids:

A fluorinated phenylalanine (to alter hydrophobicity and electronic properties).

An amino acid with an azide or alkyne group for bioorthogonal “click” chemistry.

A metal-binding amino acid containing a bipyridine group.

A photoactivatable amino acid with a diazirine group for crosslinking experiments.

An amino acid with a bulky polyaromatic side chain to enhance π–π stacking.

Some engineered organisms have even expanded their genetic code to incorporate such amino acids into proteins.

  1. Where did amino acids come from before enzymes and life existed?

Amino acids likely formed through prebiotic chemistry. The Miller–Urey experiment demonstrated that amino acids can form spontaneously under simulated early Earth conditions (reducing atmosphere + electrical discharge). Amino acids have also been detected in meteorites such as the Murchison meteorite, suggesting that they can form in space. Therefore, amino acids likely existed before enzymes and before life emerged.

  1. If you make an α-helix using D-amino acids, what handedness would you expect?

Natural proteins use L-amino acids and form right-handed α-helices. If you construct an α-helix entirely from D-amino acids, the helix would be left-handed. Inverting the chirality of the monomers inverts the handedness of the secondary structure.

  1. Can you discover additional helices in proteins?

Yes. Besides the α-helix, proteins can contain 3₁₀ helices and π-helices. These are less common but structurally distinct. Other biological polymers, such as DNA, also form helical structures (e.g., the right-handed B-DNA helix).

  1. Why are most molecular helices right-handed?

Most biological helices are right-handed because proteins are built from L-amino acids. The stereochemistry of the alpha carbon restricts backbone dihedral angles (φ and ψ), making right-handed helices energetically favorable.

  1. Why do β-sheets tend to aggregate?

β-sheets can extend through backbone hydrogen bonding between neighboring strands. Their structure allows repeating hydrogen bond networks and tight packing of side chains. This makes it easy for partially unfolded proteins to associate and form extended β-sheet assemblies.

  1. What is the driving force for β-sheet aggregation?

The primary driving forces are:

Extensive backbone hydrogen bonding.

Hydrophobic interactions between side chains.

Reduction of exposed hydrophobic surface area in aqueous environments.

Thermodynamic stabilization through ordered packing.

Together, these factors favor the formation of highly stable aggregated structures.

  1. Why do many amyloid diseases form β-sheets?

When proteins misfold, hydrophobic regions that are normally buried become exposed. These regions can reorganize into extended β-sheet structures that stack into fibrils. These fibrils are extremely stable and resistant to degradation. Many diseases, such as Alzheimer’s and Parkinson’s, involve accumulation of such amyloid β-sheet aggregates.

  1. Can amyloid β-sheets be used as materials?

Yes. Amyloid fibrils are mechanically strong, resistant to degradation, and capable of self-assembly. They are being explored as nanomaterials, scaffolds for tissue engineering, and components in bioelectronics. Some organisms even produce functional amyloids naturally.

  1. Design a β-sheet motif that forms a well-ordered structure.

A well-ordered β-sheet motif often uses alternating polar and nonpolar residues to create amphipathic strands.

Example sequence: Val–Ser–Val–Ser–Val–Ser–Val–Ser

In this design:

Valine (hydrophobic) promotes tight packing.

Serine (polar) allows hydrogen bonding and solubility.

Alternating pattern promotes regular β-strand geometry.

To enhance order:

Add aromatic residues (e.g., phenylalanine) for π–π stacking.

Place charged residues at termini to control aggregation.

Use repeating patterns to encourage uniform self-assembly.

Such rational design principles are commonly used in peptide-based biomaterials research.

Part B: Protein Analysis and Visualization

Briefly describe the protein you selected and why you selected it. Formate dehydrogenase FDH2: These electrons are transferred to components of the electron transport chain or to specific electron carriers, depending on the organism. In many bacteria, FDH2 plays an important role in energy metabolism by allowing formate to serve as an electron donor during respiration.

Identify the amino acid sequence of your protein.:

Bioe Bioe

MSEQNLYFQGLADYEKLKFGLDSTTVTPLSVTLSSPSLNDLKANAVVTVPEEGLIIAVADGTNVGKSTTTGRILNAVQQFFETLA ETDGLDAVLRRQLGELAEHPWPMVAVNADGALLPVIDAGGTGTITLHAGATPKFTITDAGYGIPFADKPLEHFLHGEDFVEIIPM GMDHQAVAVTDVKGGTMGGLAVAEANGLNKATGTLAKVIDAQGSLGVSIVGATGAGGIGSAEATYLHDMTVPGHFDLSMAEGLG YVADTGGVNATVWVVPGSRADVAVEGATIVMAXXXXVTLVVATGGNGLANASALDGLGKIIQADGATVLEGASLIGMGEDVMPNV VILDGGEGQSVDALMVDAGFSPALIRAVAGVDGIVATGDSAANLYEKLGKDAVIIAENAGFVATPAGGSMGGLIIADGATNAGT HVTVGGPLGLGGIVAVGTNTGERTLTVADTGPAVIGMSIQAGGEASVIVGTNGTSM

How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids.

ParameterValue
PDB ID6T8C
Protein NameFormate dehydrogenase FDH2
Sequence Length386 amino acids
Most Frequent Amino AcidGlycine (G)
Number of Occurrences (Gly)120

How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.

Bioe Bioe

Does your protein belong to any protein family?

Yes, the protein FDH2 (PDB 6T8C) belongs to the formate dehydrogenase (FDH) family, which is part of the larger oxidoreductase class of enzymes. More specifically, it is a molybdenum-dependent formate dehydrogenase and is classified within the DMSO reductase superfamily. These enzymes are widely distributed in bacteria and archaea and are involved in C1 metabolism, catalyzing the oxidation of formate to carbon dioxide while transferring electrons to cellular electron carriers. They share conserved structural motifs for binding the molybdenum cofactor and often contain iron–sulfur clusters for electron transfer.

When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å) Bioe Bioe

Good quality :1.98

Are there any other molecules in the solved structure apart from protein?

Yes, NAPD

Does your protein belong to any structure classification family?

it is part of the molybdenum-containing oxidoreductase family within the DMSO reductase superfamily. In structural classification databases such as SCOP and CATH, proteins of this type fall into the α/β class, characterized by mixed alpha helices and beta sheets arranged in a conserved fold Bioe Bioe

Open the structure of your protein in any 3D molecule visualization software:

Bioe Bioe Bioe Bioe

Color the protein by secondary structure. Does it have more helices or sheets?

Bioe Bioe

Have more helices

Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues? select hydrophobic, resn ALA+VAL+LEU+ILE+MET+PHE+TRP+PRO select polar, resn SER+THR+ASN+GLN+TYR+CYS select charged, resn ASP+GLU+LYS+ARG+HIS color orange, hydrophobic color green, polar color blue, charged

Bioe Bioe

Hydrophobic residues tend to cluster in the core, while polar/charged residues tend to be more surface exposed (typical of soluble proteins) Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

Bioe Bioe

Yes., Grooves and indentations at subunit interfaces and along the surfaceb NAPD interaction regions.

Bioe Bioe Bioe Bioe

Part C. Using ML-Based Protein Design Tools

https://colab.research.google.com/drive/1HtlIHDKFsrMeEC5p8ENv6Kzdlhpz-VvQ?authuser=1#scrollTo=33580eea

Protein Language Modeling

Deep Mutational Scans Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods. Bioe Bioe

Can you explain any particular pattern? (choose a residue and a mutation that stands out)

Bioe Bioe

A clear pattern in the mutation heatmap is that substitutions to cysteine (C) consistently show strongly negative scores across most positions in the protein, indicated by the dark purple color. This suggests that introducing cysteine is generally destabilizing. A likely explanation is that cysteine contains a reactive thiol group capable of forming disulfide bonds or interfering with the local chemical environment, which can disrupt proper folding or structural stability. In contrast, some regions appear more tolerant to mutations to glycine (G), shown by greener or yellowish areas, likely because glycine’s small size allows flexibility without causing major steric clashes. This overall pattern indicates that chemical properties of amino acid side chains strongly influence mutation tolerance in this protein.

Latent Space Analysis

Use the provided sequence dataset to embed proteins in reduced dimensionality.

Bioe Bioe

Analyze the different formed neighborhoods: do they approximate similar proteins?

Yes, the formed neighborhoods generally approximate similar proteins. In the 3D embedding plot, clusters of points appear grouped together, suggesting that proteins within the same neighborhood share higher sequence similarity. These clusters likely represent proteins belonging to similar families, functional classes, or structural folds

Place your protein in the resulting map and explain its position and similarity to its neighbors.

The protein of interest (FDH2) appears embedded within a dense cluster rather than isolated at the periphery. This position suggests that it shares significant sequence similarity with nearby proteins in the dataset. Its neighbors likely correspond to other formate dehydrogenases or closely related oxidoreductases within the same structural and functional family. The fact that it is not separated into a distant cluster indicates that it is not an outlier but rather part of a conserved evolutionary group.

Input this sequence into ESMFold and compare the predicted structure to your original. New Sequence:ALTPEEAALLRAAWAPVAADRAANGRAFVLALFAAYPELAELFPEFRGKTLAEIAASPALDAISTAIFDRLDTLVANADDAAAMATLFADLAARHVARGVTAAHFEAIRALFPGFVASVAPPPPGAAAAWDRLFGDVIAALRAAGG

Part D. Group Brainstorm on Bacteriophage Engineering Group: (James, USE), (Karen, Ecuador)

Review the Bacteriophage Final Project Goals for engineering the L Protein: Increased stability (easiest) Higher titers (medium) Higher toxicity of lysis protein (hard) Brainstorm Session | Goal | Tunable toxicity | | Strategy | Design variants with graded L–DnaJ binding strength: attenuated (stronger binding) → wild-type → enhanced (weaker binding) | | Tools | AlphaFold-Multimer (L + DnaJ), Rosetta interface ΔΔG |

Write a 1-page proposal (bullet points or short paragraphs) describing: Engineer the MS2 L lysis protein (75 aa) into a small panel of variants that produce graded lysis phenotypes
(attenuated → WT-like → enhanced) with predictable control over dose and timing.

Because MS2 lysis depends on the E. coli chaperone DnaJ, and DnaJ physically associates with L, we propose to computationally tune the L–DnaJ interaction as a controllable “toxicity knob.”


Proposed Computational Strategy

Tools and Rationale

ToolPurposeWhy It Helps
Protein Language Models (PLMs)In silico mutagenesisSuggest substitutions that are evolutionarily plausible and tolerated
AlphaFold-MultimerPredict L–DnaJ complex structureIdentify interface residues and assess structural stability
Rosetta Interface ΔΔGPredict binding energy changesQuantify strengthening or weakening of L–DnaJ interaction

Together, these tools allow rational selection of variants expected to modulate binding strength and therefore toxicity.


Computational Pipeline

flowchart TD A[Input: MS2 L sequence (75 aa)] –> B[Constraint check: overlapping coding region / allowed codons] B –> C[Generate variants (PLM-guided mutagenesis)] C –> D[AlphaFold-Multimer: predict L–DnaJ complex + interface residues] D –> E[Rosetta interface ΔΔG: rank binding changes across variants] E –> F[Pick graded panel: weak / mid / strong binders] F –> G[Wet lab: lysis timing + dose-response curves] e propose:

Weak binding ───────► Attenuated lysis Medium binding ─────► WT-like phenotype Strong binding ─────► Enhanced / faster lysis

──────────────────────────────── DESIGN STRATEGY ────────────────────────────────

(1) Sequence Level → Protein Language Models • Suggest tolerated mutations • Preserve evolutionary plausibility • Maintain fold stability

        │
        ▼

(2) Structural Level → AlphaFold-Multimer • Model L–DnaJ complex • Identify interface residues • Evaluate model confidence

        │
        ▼

(3) Energetic Level → Rosetta Interface ΔΔG • Quantify binding shifts • Rank weak → strong binders • Select graded panel (~8–15 variants)

Potential Pitfalls and Mitigation Strategies Challenge Why It Matters Mitigation Strategy Overlapping coding constraints Limits mutation space Apply codon filtering before ranking Small protein size Increases structural prediction uncertainty Use consensus scoring across PLM + AF-Multimer + Rosetta Membrane association Can reduce model reliability Focus on interface trends rather than absolute ΔΔG