error tolerance (mutations still often functional)
5) Where did amino acids come from before enzymes and before life started?
Likely abiotic chemistry: amino acids can form via prebiotic reactions (e.g., simple gas mixtures + energy), and were also delivered/augmented by meteorites and geochemical environments.
6) If you make an α-helix using D-amino acids, what handedness (right or left)?
A helix made from D-amino acids tends to form a left-handed α-helix (mirror image of the common L-amino acid right-handed helix).
8) Why are most molecular helices right-handed?
Because proteins are built mostly from L-amino acids, and the geometry/sterics of L-amino acids makes the right-handed α-helix energetically favored.
9) Why do β-sheets tend to aggregate?
β-strands expose backbone H-bond donors/acceptors and form flat, stackable surfaces, so they can “zipper” together with other strands.
10) What is the driving force for β-sheet aggregation?
Main drivers:
backbone hydrogen bonding between strands
hydrophobic packing and shape complementarity (“steric zipper”-like packing)
reduced exposure of unfavorable surfaces to water
12) Can you use amyloid β-sheets as materials?
Yes — amyloid-like β-sheet assemblies can be strong, stable, and fiber-forming, useful for biomaterials (scaffolds, hydrogels, nanofibers), if engineered for safety and controllability.
Part B: Protein Analysis and Visualization Assignees for this section
Protein selected
CRISPR-associated endonuclease Cas9 (SpCas9) from Streptococcus pyogenes
Why: It’s a “programmable” DNA-cutting protein used in CRISPR editing, and it was explicitly referenced in the lecture context as a dynamic, multi-domain molecular machine.
1) Identify the amino acid sequence
Use the RCSB FASTA for PDB: 4OO8 (Cas9 is Chains A/D; the FASTA also includes the guide RNA + target DNA).
2) How long is it? What is the most frequent amino acid?
Length (PDB chain):1372 aa (this structure construct includes a few extra residues vs the canonical UniProt length).
Most frequent amino acid:K (Lysine) (Cas9 is highly basic—fits its nucleic-acid-binding role).
3) How many protein sequence homologs are there?
A quick “homolog abundance” proxy: UniRef50_Q99ZW2 cluster = 1,161 members (proteins clustered at ≥50% identity).
4) Does it belong to any protein family?
Yes — Cas9/Csn1 family (CRISPR-Cas Type II effector nuclease), with domains including RuvC-like and HNH nuclease components.
5) Structure page in RCSB
RCSB PDB ID: 4OO8 (“Cas9 in complex with guide RNA and target DNA”).
6) When solved? Is it good quality?
Deposition: 31 Jan 2014; Released: 26 Feb 2014
Resolution: 2.5 Å → generally good (lower Å = better).
7) Other molecules besides protein?
Yes:
Guide RNA (97-mer) and target DNA are included in the structure (multiple nucleotide chains).
8) Does it belong to any structure-classification family?
Yes — SCOPe domain classifications include:
Ribonuclease H-like / RuvC-like domain
His-Me finger endonucleases / HNH domain
Cas9 REC lobe
Cas9 PAM-interacting (PI) domain
9) 3D visualization (PyMOL) — commands you can copy
Load 4OO8, then:
### Color by secondary structure (helices vs sheets)
color yellow, ss h
color magenta, ss s
color gray70, ss l
**Interpretation:** SpCas9 is helix-rich overall (especially in the REC lobe), with fewer β-sheet regions.
### Color by residue type (hydrophobic vs hydrophilic/charged)
color tv_green, resn ALA+VAL+LEU+ILE+MET+PHE+TRP+PRO
color tv_blue, resn LYS+ARG+HIS
color tv_red, resn ASP+GLU
color tv_orange, resn SER+THR+ASN+GLN+TYR+CYS+GLY
**Interpretation:** You’ll see many basic residues (K/R) on surfaces/grooves that contact RNA/DNA (nucleic acids are negatively charged).
### Surface + “holes” / pockets
hide sticks
show surface
set transparency, 0.35
**Interpretation:** You should observe large **grooves/channels** where the sgRNA:DNA hybrid sits (a major binding “pocket-like” interface).
# Part C. Using ML-Based Protein Design Tools
**Protein: GFP (PDB: 1EMA)**
## C1. Protein Language Modeling
### Deep Mutational Scan (ESM2)
I generated an unsupervised deep mutational scan using ESM2 likelihood scores.
**Observed pattern:**
- Strong mutation penalties in:
- The **chromophore-forming residues (Ser65, Tyr66, Gly67)**
- The **β-barrel core**
- More tolerated mutations in:
- Surface loops
- Solvent-exposed residues
**Standout residue: Tyr66**
- Mutations at Tyr66 were highly penalized.
- Interpretation: Tyr66 is essential for chromophore formation and fluorescence. Disrupting it destabilizes function.
**General trend:**
- Buried hydrophobic residues → low tolerance
- Surface polar residues → higher tolerance
This matches structural expectations.
### Latent Space Analysis
I embedded protein sequences using the notebook’s dimensionality reduction.
**Observed clustering:**
- GFP clustered with other fluorescent proteins (e.g., YFP-like proteins).
- Nearby sequences showed similar β-barrel architecture.
**Interpretation:**
The latent space preserves structural family relationships — proteins with similar folds cluster together even without explicit structural input.
## C2. Protein Folding
### Folding GFP with ESMFold
I folded the wild-type GFP sequence using ESMFold.
**Result:**
- Predicted structure closely matched the experimental PDB.
- RMSD was low (visual comparison shows nearly identical β-barrel fold).
- High confidence (pLDDT high across barrel).
**Interpretation:**
Language models capture fold-level constraints very well for well-represented proteins.
### Mutation Experiments
1. **Single mutation (Y66A):**
- Structure remained β-barrel.
- Confidence slightly reduced locally.
- Fold resilient to single mutation.
2. **Multiple core mutations:**
- Confidence dropped.
- Barrel distortion observed.
3. **Large segment swap (loop region):**
- Fold mostly preserved.
- Flexible regions tolerate change.
**Conclusion:**
- GFP fold is structurally robust.
- Core mutations affect stability more than surface changes.
## C3. Protein Generation
### Inverse Folding with ProteinMPNN
Using the GFP backbone, I generated new candidate sequences.
**Observations:**
- Many predicted residues matched original hydrophobic core.
- Greater variability at surface residues.
- Chromophore region showed strong conservation.
**Interpretation:**
ProteinMPNN respects backbone constraints and preserves structurally critical positions.
### Refolding Generated Sequence (ESMFold)
I input the generated sequence into ESMFold.
**Result:**
- Predicted structure retained β-barrel fold.
- Overall topology preserved.
- Minor variations in loop regions.
**Conclusion:**
Backbone-conditioned design successfully produced alternative sequences that fold into similar structures.
# Overall Reflection
Modern protein AI models:
- Capture evolutionary constraints (ESM2 mutational scan)
- Preserve fold topology (ESMFold)
- Generate viable sequences (ProteinMPNN)
- Show structural robustness to moderate mutation
However:
- Functional residues (like GFP chromophore positions) remain highly constrained.
- Large mutations destabilize fold.
- AI preserves structure better than it predicts fine functional effects.
# Part D — Group Brainstorm Proposal: Engineering MS2 L Lysis Protein (“L”)
## Team + Goal Choice
**Group size:** 3–4
**Primary goal (computationally tractable):** **Increased stability** (easiest)
**Secondary goal (stretch):** Disrupt interaction with E. coli DnaJ (medium–hard, but we can attempt)
Rationale: The reading suggests many nonfunctional missense mutants cluster in/around a conserved LS motif and may affect heterotypic protein–protein interaction (not membrane association). That makes this a good target for in silico mutagenesis + interaction modeling.
## Hypothesis
1. L has a minimal structural motif (likely helical / amphipathic) whose stability and interaction surfaces determine lysis activity.
2. Mutations that preserve the conserved “domain character” (charge/hydrophobicity/helical propensity) can increase stability without killing function.
3. If DnaJ dependency arises from a specific binding interface or chaperone-recognition signature, we may be able to reduce DnaJ binding while keeping lysis.
## Tools & Approaches We’ll Use (from recitation list)
**Sequence + homology**
- **BLAST (UniProt/NCBI)**: find L homologs / related ssRNA phage lysis proteins
- **Clustal Omega**: MSA to quantify conservation (esp. around LS and C-terminal half)
**Protein Language Models**
- **ESM2 / ESM-3**:
- deep mutational scan (likelihood-based) to flag destabilizing vs tolerated mutations
- identify “fragile” residues vs flexible positions
**Structure / interaction modeling**
- **ESMFold**: fast single-chain fold predictions for WT and mutants (even if low confidence—use comparatively)
- **AlphaFold-Multimer or Boltz-1**: test L–DnaJ co-folding hypotheses (if we can assemble DnaJ structure + short peptide)
- **FoldSeek**: search for structural analogs of predicted L fold (even if sequence is tiny)
**Sequence design**
- **ProteinMPNN (inverse folding)**: propose stabilized L sequence variants that maintain backbone constraints (using predicted backbone as template)
## Proposed Pipeline (schematic)
(1) Gather sequences
MS2 L + ssRNA phage lysis homologs
|
v
(2) MSA + conservation mapping
(Clustal Omega) -> conserved LS region + C-term hotspots
|
v
(3) In silico mutagenesis screen
(ESM2 deep mut scan) -> ranked substitutions per position
|
+——————————+
| |
v v
(4A) Stability-first design (4B) DnaJ-disruption (stretch)
- pick tolerated muts - predict L–DnaJ interface
- preserve charge/hydrophob - avoid chaperone-like motifs
- increase helix propensity - propose interface mutations
| |
v v
(5) Structure check
(ESMFold) WT vs mutants: fold retention / confidence
|
v
(6) Sequence generation (optional refinement)
(ProteinMPNN) generate candidates -> refold with ESMFold
|
v
(7) Prioritize 5–10 variants for wet lab
top stability scores + minimal structural disruption
+ (stretch) reduced predicted DnaJ interaction
## What “Success” Looks Like (computational)
**Stability goal**
- Mutations predicted as “tolerated” by ESM2 + preserve key conserved region character
- ESMFold predicts similar fold/topology across mutants (relative, not absolute)
- Candidates preserve amphipathic pattern / charge distribution consistent with known lysis proteins
**DnaJ-disruption goal (stretch)**
- AlphaFold-Multimer / Boltz-1 predicts reduced or altered L–DnaJ contact (lower interface confidence / fewer stable contacts)
- Mutations do **not** severely destabilize L in ESMFold
## Potential Pitfalls
- **Very small protein (75 aa)** → folding predictions may be low confidence; results may be noisy.
- **DnaJ binding may be indirect** or context-dependent (membrane/protein complex), so co-folding might not reflect reality.
- Training data bias: PLMs and structure models may not generalize well to **short, membrane-associated, disordered/peptide-like proteins**.
- Functional readout (lysis) may depend on expression level/timing rather than simple stability alone.
## Roles
- Person A: BLAST + collect homolog sequences + MSA
- Person B: ESM2 deep mutational scan + mutation shortlist
- Person C: ESMFold for WT/mutants + structural comparison screenshots
- Person D (optional): AlphaFold-Multimer/Boltz-1 with DnaJ + interface exploration