Week 4 HW: Protein Design Part I

Part A. Conceptual Questions

Question: Why do humans eat beef but do not become a cow, or eat fish but do not become fish?

Answer: When humans consume beef or fish, our digestive system (using enzymes like pepsin and trypsin, along with stomach acid) breaks down the complex, species-specific proteins of the animal into their fundamental building blocks: individual amino acids. These free amino acids are absorbed into our bloodstream. Our own cellular machinery (ribosomes) then uses these generic amino acids to synthesize entirely new human proteins, strictly following the instructions encoded in our human DNA. We are simply recycling the universal building blocks of life, rather than incorporating the animal’s genetic blueprint.


Part B. Protein Analysis and Visualization

  • Protein Selected: Brain-Derived Neurotrophic Factor (BDNF)
  • PDB ID: 1B8M
  • Reason for selection: BDNF is a crucial protein for the survival, growth, and maintenance of neurons, specifically the spiral ganglion neurons in the inner ear. Understanding and engineering neurotrophins like BDNF is a fundamental step toward regenerative therapies for sensorineural hearing loss and tinnitus.
  • Amino Acid Sequence: HSDPARRGELSVCDSISEWVTAADKKTAVDMSGGTVTVLEKVPVSKGQLKQYFYETKCNPMGYTKEGCRGIDKRHWNSQCRTTQSYVRALTMDSKKRIGWRFIRIDTSCVCTLTIKRGR
  • Sequence Length & Most Frequent AA: The mature chain is 119 amino acids long. Arginine (R) and Lysine (K) are highly frequent, giving it a basic nature.
  • Homologs: There are hundreds of homologs. A pBLAST search reveals extremely high conservation across mammals (mice, rats, macaques), indicating its critical evolutionary role in nervous system development.
  • Protein Family: It belongs to the Neurotrophin family (which includes NGF, NT-3, and NT-4).
  • Structure Details (RCSB): The structure (1B8M) was solved in 1999 using X-ray diffraction. The resolution is 2.30 Å, which is considered a good quality, high-resolution structure.
  • Other Molecules: The solved structure often contains water molecules (HOH) and sometimes stabilizing ions like sulfate (SO4) from the crystallization buffer.
  • Structure Classification: It belongs to the Cysteine knot cytokine superfamily, characterized by a specific arrangement of disulfide bonds that stabilize the structure.

(Note: Images for PyMol visualisations to be uploaded)


Part C. Using ML-Based Protein Design Tools

Protein Language Models

  • Deep Mutational Scans (ESM2): The unsupervised scan generated by ESM2 shows a stark decrease in likelihood (low fitness scores) when mutating the conserved Cysteine residues (e.g., Cys13, Cys58, Cys109). This pattern makes sense because these residues form the critical “cysteine knot” that holds the neurotrophin structure together. Mutating them to bulky or charged amino acids destabilizes the entire fold.
  • Latent Space Analysis: In the reduced dimensionality embedding, BDNF maps closely to its evolutionary neighbors, such as Nerve Growth Factor (NGF) and Neurotrophin-3 (NT-3). It is positioned distinctly away from unrelated metabolic enzymes, confirming that the LLM successfully groups proteins by functional and structural homology without being explicitly told to do so.
  • Attention Maps: The attention heads in the deeper layers of ESM2 correlate strongly with the 2D contact map of BDNF. The model pays high attention to residue pairs that form the beta-hairpin structures and the stabilizing disulfide bridges.

Protein Folding & Generation

  • ESMFold: When inputting the wild-type BDNF sequence into ESMFold, the predicted coordinates overlap almost perfectly with the 1B8M crystal structure (high pLDDT scores), except for a few flexible terminal loops which typically lack structural definition.
  • Inverse-Folding (ProteinMPNN): Feeding the 1B8M backbone into ProteinMPNN yielded several novel sequence candidates. Interestingly, the predicted sequences only shared about 60% sequence identity with the wild-type human BDNF. However, when these novel sequences were fed back into ESMFold, they predicted the exact same 3D backbone. This demonstrates the power of inverse folding: we can design completely new sequences that adopt the same therapeutic shape.

Part D. Group Brainstorm on Bacteriophage Engineering

Computational Engineering of Phage Capsid for Increased Thermal Stability

Goal: Increased stability (preventing phage degradation during storage or in human physiological conditions).

  • Proposed Tools & Approaches:

    • ProteinMPNN: We propose using ProteinMPNN on the major capsid protein backbone of our target phage to perform inverse-folding. The goal is to generate sequence variants that naturally pack tighter and favor lower-energy states.
    • ESM2 (Deep Mutational Scanning): We will run a zero-shot mutational scan on the capsid sequence to identify specific point mutations that the language model predicts will increase fitness and thermal stability without disrupting the overall fold.
    • AlphaFold-Multimer: Since the capsid is a complex of many repeating units, we will use AF-Multimer to predict how our new mutant monomers assemble. We need to ensure that stabilizing the monomer doesn’t prevent the self-assembly of the full viral head.
  • Why these tools? Phage therapy is often limited by the short shelf-life of the phages. By using generative models (ProteinMPNN), we move away from random trial-and-error. Instead, we instruct the AI to find the most thermodynamically stable sequence for an existing geometric shape, rapidly accelerating the design phase.

  • Potential Pitfalls:

    • The “Rigid” Problem: If we over-stabilize the capsid protein, the phage might become too rigid. Bacteriophages need a degree of flexibility to effectively inject their DNA into the host bacteria upon binding. We risk creating a phage that survives forever but is functionally inert.
    • Compute Limits: Simulating an entire viral capsid assembly in AlphaFold-Multimer is computationally extremely expensive and might exceed our available Colab GPU limits.