Week 4 HW: Protein Design Part 1

Part A. Conceptual Questions

1) How many molecules of amino acids do you take with 500 g of meat? (~100 Da per amino acid)

Assume ~20% protein in meat → ~100 g protein.

Average amino acid ~100 g/mol → ~1 mol amino acids → ~6 × 10²³ amino-acid molecules.

(If you assumed 500 g pure protein: ~5 mol → ~3 × 10²⁴ molecules.)

2) Why do humans eat beef but do not become a cow, eat fish but do not become fish?

We digest proteins into amino acids, then rebuild our own proteins using our own genome + regulation, not the animal’s identity.

3) Why are there only 20 natural amino acids?

They’re an evolutionarily “good enough” set that balances:

  • chemical diversity (charge, polarity, size)
  • genetic simplicity (codons, translation machinery)
  • error tolerance (mutations still often functional)

5) Where did amino acids come from before enzymes and before life started?

Likely abiotic chemistry: amino acids can form via prebiotic reactions (e.g., simple gas mixtures + energy), and were also delivered/augmented by meteorites and geochemical environments.

6) If you make an α-helix using D-amino acids, what handedness (right or left)?

A helix made from D-amino acids tends to form a left-handed α-helix (mirror image of the common L-amino acid right-handed helix).

8) Why are most molecular helices right-handed?

Because proteins are built mostly from L-amino acids, and the geometry/sterics of L-amino acids makes the right-handed α-helix energetically favored.

9) Why do β-sheets tend to aggregate?

β-strands expose backbone H-bond donors/acceptors and form flat, stackable surfaces, so they can “zipper” together with other strands.

10) What is the driving force for β-sheet aggregation?

Main drivers:

  • backbone hydrogen bonding between strands
  • hydrophobic packing and shape complementarity (“steric zipper”-like packing)
  • reduced exposure of unfavorable surfaces to water

12) Can you use amyloid β-sheets as materials?

Yes — amyloid-like β-sheet assemblies can be strong, stable, and fiber-forming, useful for biomaterials (scaffolds, hydrogels, nanofibers), if engineered for safety and controllability.

Part B: Protein Analysis and Visualization Assignees for this section

Protein selected

CRISPR-associated endonuclease Cas9 (SpCas9) from Streptococcus pyogenes

Why: It’s a “programmable” DNA-cutting protein used in CRISPR editing, and it was explicitly referenced in the lecture context as a dynamic, multi-domain molecular machine.

1) Identify the amino acid sequence

Use the RCSB FASTA for PDB: 4OO8 (Cas9 is Chains A/D; the FASTA also includes the guide RNA + target DNA).

2) How long is it? What is the most frequent amino acid?

  • Length (PDB chain): 1372 aa (this structure construct includes a few extra residues vs the canonical UniProt length).
  • Most frequent amino acid: K (Lysine) (Cas9 is highly basic—fits its nucleic-acid-binding role).

3) How many protein sequence homologs are there?

A quick “homolog abundance” proxy: UniRef50_Q99ZW2 cluster = 1,161 members (proteins clustered at ≥50% identity).

4) Does it belong to any protein family?

Yes — Cas9/Csn1 family (CRISPR-Cas Type II effector nuclease), with domains including RuvC-like and HNH nuclease components.

5) Structure page in RCSB

RCSB PDB ID: 4OO8 (“Cas9 in complex with guide RNA and target DNA”).

6) When solved? Is it good quality?

  • Deposition: 31 Jan 2014; Released: 26 Feb 2014
  • Resolution: 2.5 Å → generally good (lower Å = better).

7) Other molecules besides protein?

Yes:

  • Guide RNA (97-mer) and target DNA are included in the structure (multiple nucleotide chains).

8) Does it belong to any structure-classification family?

Yes — SCOPe domain classifications include:

  • Ribonuclease H-like / RuvC-like domain
  • His-Me finger endonucleases / HNH domain
  • Cas9 REC lobe
  • Cas9 PAM-interacting (PI) domain

9) 3D visualization (PyMOL) — commands you can copy

Load 4OO8, then:

### Color by secondary structure (helices vs sheets)

color yellow, ss h color magenta, ss s color gray70, ss l

**Interpretation:** SpCas9 is helix-rich overall (especially in the REC lobe), with fewer β-sheet regions.

### Color by residue type (hydrophobic vs hydrophilic/charged)

color tv_green, resn ALA+VAL+LEU+ILE+MET+PHE+TRP+PRO color tv_blue, resn LYS+ARG+HIS color tv_red, resn ASP+GLU color tv_orange, resn SER+THR+ASN+GLN+TYR+CYS+GLY

**Interpretation:** You’ll see many basic residues (K/R) on surfaces/grooves that contact RNA/DNA (nucleic acids are negatively charged).

### Surface + “holes” / pockets

hide sticks show surface set transparency, 0.35

**Interpretation:** You should observe large **grooves/channels** where the sgRNA:DNA hybrid sits (a major binding “pocket-like” interface).

# Part C. Using ML-Based Protein Design Tools

**Protein: GFP (PDB: 1EMA)**

## C1. Protein Language Modeling

### Deep Mutational Scan (ESM2)

I generated an unsupervised deep mutational scan using ESM2 likelihood scores.

**Observed pattern:**

- Strong mutation penalties in:
    - The **chromophore-forming residues (Ser65, Tyr66, Gly67)**
    - The **β-barrel core**
- More tolerated mutations in:
    - Surface loops
    - Solvent-exposed residues

**Standout residue: Tyr66**

- Mutations at Tyr66 were highly penalized.
- Interpretation: Tyr66 is essential for chromophore formation and fluorescence. Disrupting it destabilizes function.

**General trend:**

- Buried hydrophobic residues → low tolerance
- Surface polar residues → higher tolerance
    
    This matches structural expectations.
    

### Latent Space Analysis

I embedded protein sequences using the notebook’s dimensionality reduction.

**Observed clustering:**

- GFP clustered with other fluorescent proteins (e.g., YFP-like proteins).
- Nearby sequences showed similar β-barrel architecture.

**Interpretation:**

The latent space preserves structural family relationships — proteins with similar folds cluster together even without explicit structural input.

## C2. Protein Folding

### Folding GFP with ESMFold

I folded the wild-type GFP sequence using ESMFold.

**Result:**

- Predicted structure closely matched the experimental PDB.
- RMSD was low (visual comparison shows nearly identical β-barrel fold).
- High confidence (pLDDT high across barrel).

**Interpretation:**

Language models capture fold-level constraints very well for well-represented proteins.

### Mutation Experiments

1. **Single mutation (Y66A):**
    - Structure remained β-barrel.
    - Confidence slightly reduced locally.
    - Fold resilient to single mutation.
2. **Multiple core mutations:**
    - Confidence dropped.
    - Barrel distortion observed.
3. **Large segment swap (loop region):**
    - Fold mostly preserved.
    - Flexible regions tolerate change.

**Conclusion:**

- GFP fold is structurally robust.
- Core mutations affect stability more than surface changes.

## C3. Protein Generation

### Inverse Folding with ProteinMPNN

Using the GFP backbone, I generated new candidate sequences.

**Observations:**

- Many predicted residues matched original hydrophobic core.
- Greater variability at surface residues.
- Chromophore region showed strong conservation.

**Interpretation:**

ProteinMPNN respects backbone constraints and preserves structurally critical positions.

### Refolding Generated Sequence (ESMFold)

I input the generated sequence into ESMFold.

**Result:**

- Predicted structure retained β-barrel fold.
- Overall topology preserved.
- Minor variations in loop regions.

**Conclusion:**

Backbone-conditioned design successfully produced alternative sequences that fold into similar structures.

# Overall Reflection

Modern protein AI models:

- Capture evolutionary constraints (ESM2 mutational scan)
- Preserve fold topology (ESMFold)
- Generate viable sequences (ProteinMPNN)
- Show structural robustness to moderate mutation

However:

- Functional residues (like GFP chromophore positions) remain highly constrained.
- Large mutations destabilize fold.
- AI preserves structure better than it predicts fine functional effects.

# Part D — Group Brainstorm Proposal: Engineering MS2 L Lysis Protein (“L”)

## Team + Goal Choice

**Group size:** 3–4

**Primary goal (computationally tractable):** **Increased stability** (easiest)

**Secondary goal (stretch):** Disrupt interaction with E. coli DnaJ (medium–hard, but we can attempt)

Rationale: The reading suggests many nonfunctional missense mutants cluster in/around a conserved LS motif and may affect heterotypic protein–protein interaction (not membrane association). That makes this a good target for in silico mutagenesis + interaction modeling.

## Hypothesis

1. L has a minimal structural motif (likely helical / amphipathic) whose stability and interaction surfaces determine lysis activity.
2. Mutations that preserve the conserved “domain character” (charge/hydrophobicity/helical propensity) can increase stability without killing function.
3. If DnaJ dependency arises from a specific binding interface or chaperone-recognition signature, we may be able to reduce DnaJ binding while keeping lysis.

## Tools & Approaches We’ll Use (from recitation list)

**Sequence + homology**

- **BLAST (UniProt/NCBI)**: find L homologs / related ssRNA phage lysis proteins
- **Clustal Omega**: MSA to quantify conservation (esp. around LS and C-terminal half)

**Protein Language Models**

- **ESM2 / ESM-3**:
    - deep mutational scan (likelihood-based) to flag destabilizing vs tolerated mutations
    - identify “fragile” residues vs flexible positions

**Structure / interaction modeling**

- **ESMFold**: fast single-chain fold predictions for WT and mutants (even if low confidence—use comparatively)
- **AlphaFold-Multimer or Boltz-1**: test L–DnaJ co-folding hypotheses (if we can assemble DnaJ structure + short peptide)
- **FoldSeek**: search for structural analogs of predicted L fold (even if sequence is tiny)

**Sequence design**

- **ProteinMPNN (inverse folding)**: propose stabilized L sequence variants that maintain backbone constraints (using predicted backbone as template)

## Proposed Pipeline (schematic)

(1) Gather sequences MS2 L + ssRNA phage lysis homologs | v (2) MSA + conservation mapping (Clustal Omega) -> conserved LS region + C-term hotspots | v (3) In silico mutagenesis screen (ESM2 deep mut scan) -> ranked substitutions per position | +——————————+ | | v v (4A) Stability-first design (4B) DnaJ-disruption (stretch) - pick tolerated muts - predict L–DnaJ interface - preserve charge/hydrophob - avoid chaperone-like motifs - increase helix propensity - propose interface mutations | | v v (5) Structure check (ESMFold) WT vs mutants: fold retention / confidence | v (6) Sequence generation (optional refinement) (ProteinMPNN) generate candidates -> refold with ESMFold | v (7) Prioritize 5–10 variants for wet lab top stability scores + minimal structural disruption + (stretch) reduced predicted DnaJ interaction

## What “Success” Looks Like (computational)

**Stability goal**

- Mutations predicted as “tolerated” by ESM2 + preserve key conserved region character
- ESMFold predicts similar fold/topology across mutants (relative, not absolute)
- Candidates preserve amphipathic pattern / charge distribution consistent with known lysis proteins

**DnaJ-disruption goal (stretch)**

- AlphaFold-Multimer / Boltz-1 predicts reduced or altered L–DnaJ contact (lower interface confidence / fewer stable contacts)
- Mutations do **not** severely destabilize L in ESMFold

## Potential Pitfalls

- **Very small protein (75 aa)** → folding predictions may be low confidence; results may be noisy.
- **DnaJ binding may be indirect** or context-dependent (membrane/protein complex), so co-folding might not reflect reality.
- Training data bias: PLMs and structure models may not generalize well to **short, membrane-associated, disordered/peptide-like proteins**.
- Functional readout (lysis) may depend on expression level/timing rather than simple stability alone.

## Roles

- Person A: BLAST + collect homolog sequences + MSA
- Person B: ESM2 deep mutational scan + mutation shortlist
- Person C: ESMFold for WT/mutants + structural comparison screenshots
- Person D (optional): AlphaFold-Multimer/Boltz-1 with DnaJ + interface exploration