Week 04 HW: Protein Design Part 1

1: Estrogen Receptor Beta (Human) Selected because of its direct relevance to endometriosis, PCOS, and female reproductive health. PDB ID: 1QKM

Information about ESR2 from PDB site Information about ESR2 from PDB site Information about ESR2 from PDB site

Protein Classification and Amino acid frequencies Protein Classification and Amino acid frequencies Protein Classification and Amino acid frequencies

Protein structure quality graphs Protein structure quality graphs Protein structure quality graphs

Binding Pocket Ligand Binding Pocket Ligand Binding Pocket Ligand

Adding visualisation to GFP protein instead of ESR2 - as i changed it for my final project - Cartoon G1 G1

Ribbon G2 G2

Ball & Stick G3 G3

The Secondary structure of GFP : Seems to have more helixes G4 G4

Part C1. Protein Language Modeling C1.1 Deep Mutational Scan ESM2 is a protein language model trained on ~250 million protein sequences. It generates per-residue probability distributions over all 20 amino acids by learning co-evolutionary patterns from sequence context alone, without any structural input. For a deep mutational scan, the key output is the log-likelihood ratio (LLR): for every position i and every possible amino acid m, LLR = log P(m | context) − log P(wildtype | context). A strongly negative LLR means ESM2 considers that substitution evolutionarily disfavored; a near-zero or positive LLR means it is tolerated.

Running this scan across all 239 residues of GFP and all 20 amino acids produces a 239 × 20 LLR heatmap: G5 G5

The most striking pattern is the sharp, strongly negative signal at the chromophore triad - Ser65, Tyr66, and Gly67. These three residues form GFP’s fluorophore through spontaneous backbone cyclization and oxidation. ESM2 assigns extremely low likelihood to any substitution at these positions, reflecting deep evolutionary conservation.

C1.2 Latent Space Analysis ESM2’s internal transformer layers produce high-dimensional embedding vectors (~1280 dimensions for ESM2-650M) that encode evolutionary, structural, and functional information simultaneously. Dimensionality reduction via UMAP projects these into 2D, allowing visual inspection of how proteins relate to one another: G6 G6

GFP (1GFL) appears in the all-β region, consistent with its 11-stranded β-can fold. Its nearest neighbors are other fluorescent protein family members and other β-barrel proteins. This confirms ESM2 encodes functional as well as structural similarity.

C2. Protein Folding ESMFold is a single-sequence structure predictor that bypasses multiple sequence alignments (MSAs), instead leveraging latent structural knowledge from a protein language model. After inputting the 239-residue GFP sequence, ESMFold produces full-atom coordinates along with per-residue pLDDT confidence scores (0–100, where >90 = very high confidence).

Folding a protein ( al other parts are performed as part of the final project as well , so its visible on google collab files documentation, more files cannot seem to be added here)