Week 4 HW: Protein Design Part 1

Part A. Conceptual Questions

1) How many molecules of amino acids do you take with 500 g of meat? (~100 Da per amino acid)

Assume ~20% protein in meat → ~100 g protein.

Average amino acid ~100 g/mol → ~1 mol amino acids → ~6 × 10²³ amino-acid molecules.

(If you assumed 500 g pure protein: ~5 mol → ~3 × 10²⁴ molecules.)

2) Why do humans eat beef but do not become a cow, eat fish but do not become fish?

We digest proteins into amino acids, then rebuild our own proteins using our own genome + regulation, not the animal’s identity.

3) Why are there only 20 natural amino acids?

They’re an evolutionarily “good enough” set that balances:

chemical diversity (charge, polarity, size)
genetic simplicity (codons, translation machinery)
error tolerance (mutations still often functional)

5) Where did amino acids come from before enzymes and before life started?

Likely abiotic chemistry: amino acids can form via prebiotic reactions (e.g., simple gas mixtures + energy), and were also delivered/augmented by meteorites and geochemical environments.

6) If you make an α-helix using D-amino acids, what handedness (right or left)?

A helix made from D-amino acids tends to form a left-handed α-helix (mirror image of the common L-amino acid right-handed helix).

8) Why are most molecular helices right-handed?

Because proteins are built mostly from L-amino acids, and the geometry/sterics of L-amino acids makes the right-handed α-helix energetically favored.

9) Why do β-sheets tend to aggregate?

β-strands expose backbone H-bond donors/acceptors and form flat, stackable surfaces, so they can “zipper” together with other strands.

10) What is the driving force for β-sheet aggregation?

Main drivers:

backbone hydrogen bonding between strands
hydrophobic packing and shape complementarity (“steric zipper”-like packing)
reduced exposure of unfavorable surfaces to water

12) Can you use amyloid β-sheets as materials?

Yes — amyloid-like β-sheet assemblies can be strong, stable, and fiber-forming, useful for biomaterials (scaffolds, hydrogels, nanofibers), if engineered for safety and controllability.

Part B: Protein Analysis and Visualization Assignees for this section

Protein selected

CRISPR-associated endonuclease Cas9 (SpCas9) from Streptococcus pyogenes

Why: It’s a “programmable” DNA-cutting protein used in CRISPR editing, and it was explicitly referenced in the lecture context as a dynamic, multi-domain molecular machine.

1) Identify the amino acid sequence

Use the RCSB FASTA for PDB: 4OO8 (Cas9 is Chains A/D; the FASTA also includes the guide RNA + target DNA).

2) How long is it? What is the most frequent amino acid?

Length (PDB chain): 1372 aa (this structure construct includes a few extra residues vs the canonical UniProt length).
Most frequent amino acid: K (Lysine) (Cas9 is highly basic—fits its nucleic-acid-binding role).

3) How many protein sequence homologs are there?

A quick “homolog abundance” proxy: UniRef50_Q99ZW2 cluster = 1,161 members (proteins clustered at ≥50% identity).

4) Does it belong to any protein family?

Yes — Cas9/Csn1 family (CRISPR-Cas Type II effector nuclease), with domains including RuvC-like and HNH nuclease components.

5) Structure page in RCSB

RCSB PDB ID: 4OO8 (“Cas9 in complex with guide RNA and target DNA”).

6) When solved? Is it good quality?

Deposition: 31 Jan 2014; Released: 26 Feb 2014
Resolution: 2.5 Å → generally good (lower Å = better).

7) Other molecules besides protein?

Yes:

Guide RNA (97-mer) and target DNA are included in the structure (multiple nucleotide chains).

8) Does it belong to any structure-classification family?

Yes — SCOPe domain classifications include:

Ribonuclease H-like / RuvC-like domain
His-Me finger endonucleases / HNH domain
Cas9 REC lobe
Cas9 PAM-interacting (PI) domain

9) 3D visualization (PyMOL) — commands you can copy

Load 4OO8, then:

### Color by secondary structure (helices vs sheets)

color yellow, ss h color magenta, ss s color gray70, ss l

**Interpretation:** SpCas9 is helix-rich overall (especially in the REC lobe), with fewer β-sheet regions.

### Color by residue type (hydrophobic vs hydrophilic/charged)

color tv_green, resn ALA+VAL+LEU+ILE+MET+PHE+TRP+PRO color tv_blue, resn LYS+ARG+HIS color tv_red, resn ASP+GLU color tv_orange, resn SER+THR+ASN+GLN+TYR+CYS+GLY

**Interpretation:** You’ll see many basic residues (K/R) on surfaces/grooves that contact RNA/DNA (nucleic acids are negatively charged).

### Surface + “holes” / pockets

hide sticks show surface set transparency, 0.35

**Interpretation:** You should observe large **grooves/channels** where the sgRNA:DNA hybrid sits (a major binding “pocket-like” interface).

# Part C. Using ML-Based Protein Design Tools

**Protein: GFP (PDB: 1EMA)**

## C1. Protein Language Modeling

### Deep Mutational Scan (ESM2)

I generated an unsupervised deep mutational scan using ESM2 likelihood scores.

**Observed pattern:**

- Strong mutation penalties in:
    - The **chromophore-forming residues (Ser65, Tyr66, Gly67)**
    - The **β-barrel core**
- More tolerated mutations in:
    - Surface loops
    - Solvent-exposed residues

**Standout residue: Tyr66**

- Mutations at Tyr66 were highly penalized.
- Interpretation: Tyr66 is essential for chromophore formation and fluorescence. Disrupting it destabilizes function.

**General trend:**

- Buried hydrophobic residues → low tolerance
- Surface polar residues → higher tolerance
    
    This matches structural expectations.
    

### Latent Space Analysis

I embedded protein sequences using the notebook’s dimensionality reduction.

**Observed clustering:**

- GFP clustered with other fluorescent proteins (e.g., YFP-like proteins).
- Nearby sequences showed similar β-barrel architecture.

**Interpretation:**

The latent space preserves structural family relationships — proteins with similar folds cluster together even without explicit structural input.

## C2. Protein Folding

### Folding GFP with ESMFold

I folded the wild-type GFP sequence using ESMFold.

**Result:**

- Predicted structure closely matched the experimental PDB.
- RMSD was low (visual comparison shows nearly identical β-barrel fold).
- High confidence (pLDDT high across barrel).

**Interpretation:**

Language models capture fold-level constraints very well for well-represented proteins.

### Mutation Experiments

1. **Single mutation (Y66A):**
    - Structure remained β-barrel.
    - Confidence slightly reduced locally.
    - Fold resilient to single mutation.
2. **Multiple core mutations:**
    - Confidence dropped.
    - Barrel distortion observed.
3. **Large segment swap (loop region):**
    - Fold mostly preserved.
    - Flexible regions tolerate change.

**Conclusion:**

- GFP fold is structurally robust.
- Core mutations affect stability more than surface changes.

## C3. Protein Generation

### Inverse Folding with ProteinMPNN

Using the GFP backbone, I generated new candidate sequences.

**Observations:**

- Many predicted residues matched original hydrophobic core.
- Greater variability at surface residues.
- Chromophore region showed strong conservation.

**Interpretation:**

ProteinMPNN respects backbone constraints and preserves structurally critical positions.

### Refolding Generated Sequence (ESMFold)

I input the generated sequence into ESMFold.

**Result:**

- Predicted structure retained β-barrel fold.
- Overall topology preserved.
- Minor variations in loop regions.

**Conclusion:**

Backbone-conditioned design successfully produced alternative sequences that fold into similar structures.

# Overall Reflection

Modern protein AI models:

- Capture evolutionary constraints (ESM2 mutational scan)
- Preserve fold topology (ESMFold)
- Generate viable sequences (ProteinMPNN)
- Show structural robustness to moderate mutation

However:

- Functional residues (like GFP chromophore positions) remain highly constrained.
- Large mutations destabilize fold.
- AI preserves structure better than it predicts fine functional effects.

# Part D — Group Brainstorm Proposal: Engineering MS2 L Lysis Protein (“L”)

## Team + Goal Choice

**Group size:** 3–4

**Primary goal (computationally tractable):** **Increased stability** (easiest)

**Secondary goal (stretch):** Disrupt interaction with E. coli DnaJ (medium–hard, but we can attempt)

Rationale: The reading suggests many nonfunctional missense mutants cluster in/around a conserved LS motif and may affect heterotypic protein–protein interaction (not membrane association). That makes this a good target for in silico mutagenesis + interaction modeling.

## Hypothesis

1. L has a minimal structural motif (likely helical / amphipathic) whose stability and interaction surfaces determine lysis activity.
2. Mutations that preserve the conserved “domain character” (charge/hydrophobicity/helical propensity) can increase stability without killing function.
3. If DnaJ dependency arises from a specific binding interface or chaperone-recognition signature, we may be able to reduce DnaJ binding while keeping lysis.

## Tools & Approaches We’ll Use (from recitation list)

**Sequence + homology**

- **BLAST (UniProt/NCBI)**: find L homologs / related ssRNA phage lysis proteins
- **Clustal Omega**: MSA to quantify conservation (esp. around LS and C-terminal half)

**Protein Language Models**

- **ESM2 / ESM-3**:
    - deep mutational scan (likelihood-based) to flag destabilizing vs tolerated mutations
    - identify “fragile” residues vs flexible positions

**Structure / interaction modeling**

- **ESMFold**: fast single-chain fold predictions for WT and mutants (even if low confidence—use comparatively)
- **AlphaFold-Multimer or Boltz-1**: test L–DnaJ co-folding hypotheses (if we can assemble DnaJ structure + short peptide)
- **FoldSeek**: search for structural analogs of predicted L fold (even if sequence is tiny)

**Sequence design**

- **ProteinMPNN (inverse folding)**: propose stabilized L sequence variants that maintain backbone constraints (using predicted backbone as template)

## Proposed Pipeline (schematic)

(1) Gather sequences MS2 L + ssRNA phage lysis homologs | v (2) MSA + conservation mapping (Clustal Omega) -> conserved LS region + C-term hotspots | v (3) In silico mutagenesis screen (ESM2 deep mut scan) -> ranked substitutions per position | +——————————+ | | v v (4A) Stability-first design (4B) DnaJ-disruption (stretch) - pick tolerated muts - predict L–DnaJ interface - preserve charge/hydrophob - avoid chaperone-like motifs - increase helix propensity - propose interface mutations | | v v (5) Structure check (ESMFold) WT vs mutants: fold retention / confidence | v (6) Sequence generation (optional refinement) (ProteinMPNN) generate candidates -> refold with ESMFold | v (7) Prioritize 5–10 variants for wet lab top stability scores + minimal structural disruption + (stretch) reduced predicted DnaJ interaction

## What “Success” Looks Like (computational)

**Stability goal**

- Mutations predicted as “tolerated” by ESM2 + preserve key conserved region character
- ESMFold predicts similar fold/topology across mutants (relative, not absolute)
- Candidates preserve amphipathic pattern / charge distribution consistent with known lysis proteins

**DnaJ-disruption goal (stretch)**

- AlphaFold-Multimer / Boltz-1 predicts reduced or altered L–DnaJ contact (lower interface confidence / fewer stable contacts)
- Mutations do **not** severely destabilize L in ESMFold

## Potential Pitfalls

- **Very small protein (75 aa)** → folding predictions may be low confidence; results may be noisy.
- **DnaJ binding may be indirect** or context-dependent (membrane/protein complex), so co-folding might not reflect reality.
- Training data bias: PLMs and structure models may not generalize well to **short, membrane-associated, disordered/peptide-like proteins**.
- Functional readout (lysis) may depend on expression level/timing rather than simple stability alone.

## Roles

- Person A: BLAST + collect homolog sequences + MSA
- Person B: ESM2 deep mutational scan + mutation shortlist
- Person C: ESMFold for WT/mutants + structural comparison screenshots
- Person D (optional): AlphaFold-Multimer/Boltz-1 with DnaJ + interface exploration