Week 5 HW: Protein Design Part 2
Part A: SOD1 A4V Peptide Binder Design
Superoxide dismutase 1 (SOD1) is a cytosolic antioxidant enzyme that converts superoxide radicals into hydrogen peroxide and oxygen. In its native state, it forms a stable homodimer and binds copper and zinc. Mutations in SOD1 cause familial Amyotrophic Lateral Sclerosis (ALS). Among them, the A4V mutation (Alanine → Valine at residue 4) leads to one of the most aggressive forms of the disease. The mutation subtly destabilizes the N-terminus, perturbs folding energetics, and promotes toxic aggregation.
The goal is to design short peptides that bind mutant SOD1, then decide which ones are worth advancing toward therapy, using three models: PepMLM, PeptiVerse, and moPPIt.
Generate four 12-mer peptide binders with PepMLM and record the perplexity scores.
Four 12-residue peptides were generated using PepMLM-650M conditioned on the SOD1 A4V mutant sequence, alongside the known binder FLYRWLPSRRGG.
| Peptide ID | Sequence | Source | Perplexity |
|---|---|---|---|
| 1 | WRYYVAAVRWGE | generated | 21.23 |
| 2 | WRSPPVGVEHKA | generated | 22.21 |
| 3 | WLYYPVGAELKE | generated | 16.06 |
| 4 | WHSGVVVLALKA | generated | 13.84 |
| 5 | FLYRWLPSRRGG | known_binder | 20.64 |
Lower pseudo-perplexity indicates higher model confidence in the peptide as a binder for the target. Peptide 4 (WHSGVVVLALKA, PPL=13.84) shows the highest PepMLM confidence, followed by Peptide 3 (WLYYPVGAELKE, PPL=16.06). Both outperform the known binder (PPL=20.64), suggesting the model considers them plausible binders. All four generated peptides begin with Trp (W), indicating a strong positional preference at the N-terminus for aromatic anchoring to SOD1.
Evaluate binders with AlphaFold3. Record ipTM scores and describe binding locations.
All five peptide-SOD1 complexes were submitted to AlphaFold Server (fold date: 2026-03-09). Each job modeled the SOD1 A4V monomer (154 residues, chain A) with one 12-mer peptide (chain B). Results are stored in peptides/af3_results/.
| Peptide | ipTM (best) | Binding Location | Surface/Buried | Notes |
|---|---|---|---|---|
| WRYYVAAVRWGE | 0.31 | Dimer interface / β-barrel | Surface-bound | Moderate confidence; PAE 9.07 Å |
| WRSPPVGVEHKA | 0.36 | Extended surface groove | Surface-bound | Second-best ipTM; extended conformation |
| WLYYPVGAELKE | 0.24 | β-barrel region | Surface-bound | Lowest confidence; PAE 10.81 Å, uncertain binding |
| WHSGVVVLALKA | 0.48 | Dimer interface pocket | Partially buried | Best model; PAE 4.97 Å, well-defined binding |
| FLYRWLPSRRGG | 0.31 | β-barrel / dimer interface | Surface-bound | Known binder; PAE 8.60 Å |
ipTM values range from 0.24 to 0.48 across the five complexes. While all fall below the 0.6 threshold typically considered high-confidence for protein-peptide interactions, they show meaningful differentiation among candidates. Peptide 4 (WHSGVVVLALKA, ipTM=0.48) clearly stands out: its ipTM exceeds the known binder FLYRWLPSRRGG (0.31) by 55%, and its PAE of 4.97 Å is roughly half that of the next-best model, indicating a well-resolved binding pose at the dimer interface pocket. This peptide is also the only one predicted to be partially buried, suggesting tighter engagement with the SOD1 surface.
Peptide 2 (WRSPPVGVEHKA, ipTM=0.36) ranks second structurally, adopting an extended conformation along a surface groove. Peptides 1 and 5 tie at ipTM=0.31, with Peptide 1 localizing to the dimer interface / β-barrel region and Peptide 5 (known binder) similarly positioned. Peptide 3 (WLYYPVGAELKE, ipTM=0.24) has the weakest structural prediction despite its moderate PepMLM perplexity (16.06), with a high PAE (10.81 Å) indicating uncertain binding geometry.
Notably, none of the five peptides bind near the N-terminus where the A4V mutation resides (position 4). All predicted binding sites localize to the dimer interface or β-barrel region, suggesting these peptides may act through general fold stabilization or dimerization modulation rather than direct mutation-site engagement.
Evaluate therapeutic properties with PeptiVerse. Which peptide would you advance?
| Peptide | Source | Perplexity | Binding Affinity (pKd) | Solubility | Hemolysis | Net Charge (pH 7) | MW (Da) |
|---|---|---|---|---|---|---|---|
| WRYYVAAVRWGE | generated | 21.23 | 7.021 (Medium) | 1.000 (Soluble) | 0.093 (Non-hemolytic) | +0.77 | 1555.7 |
| WRSPPVGVEHKA | generated | 22.21 | 4.826 (Weak) | 1.000 (Soluble) | 0.013 (Non-hemolytic) | +0.85 | 1362.5 |
| WLYYPVGAELKE | generated | 16.06 | 5.722 (Weak) | 1.000 (Soluble) | 0.033 (Non-hemolytic) | -1.23 | 1467.7 |
| WHSGVVVLALKA | generated | 13.84 | 6.055 (Weak) | 1.000 (Soluble) | 0.079 (Non-hemolytic) | +0.85 | 1279.5 |
| FLYRWLPSRRGG | known_binder | 20.64 | 5.968 (Weak) | 1.000 (Soluble) | 0.047 (Non-hemolytic) | +2.76 | 1507.7 |
ipTM vs. PeptiVerse affinity: AlphaFold3 structural confidence and PeptiVerse-predicted binding affinity disagree on the top candidate. Peptide 4 (WHSGVVVLALKA) dominates structurally (ipTM=0.48, PAE=4.97 Å) but has only moderate predicted affinity (pKd=6.055, “Weak”). Conversely, Peptide 1 (WRYYVAAVRWGE) has the best PeptiVerse affinity (pKd=7.021, “Medium binding”) but an unremarkable ipTM of 0.31. This divergence likely reflects that PeptiVerse predicts binding strength from sequence features while AF3 models 3D structural complementarity — different and complementary views of the interaction.
PepMLM perplexity vs. ipTM: These two metrics show better agreement. Peptide 4 ranks first in both (PPL=13.84, ipTM=0.48), supporting its candidacy from two independent structural/sequence perspectives. However, the correlation is imperfect: Peptide 3 ranks second in PepMLM (PPL=16.06) but last in AF3 (ipTM=0.24), indicating that low perplexity does not guarantee a well-resolved binding pose.
Therapeutic safety: All five peptides are predicted to be fully soluble (probability=1.000) and non-hemolytic (all below 0.10). No candidates present safety red flags. Peptide 2 (WRSPPVGVEHKA) has the lowest hemolysis risk (0.013) but also the weakest binding (pKd=4.826).
Physicochemical properties: Net charges range from -1.23 to +2.76 at pH 7, all within a reasonable range for cell-penetrating peptides. The known binder FLYRWLPSRRGG has the highest positive charge (+2.76), consistent with its arginine-rich C-terminus. Molecular weights are all in the 1280-1556 Da range, typical for 12-mer peptides.
Peptide 4 (WHSGVVVLALKA) is the top candidate to advance, with Peptide 1 (WRYYVAAVRWGE) as a strong alternative.
Peptide 4 has the best PepMLM confidence (PPL=13.84) and the best AlphaFold3 structural prediction by a wide margin (ipTM=0.48, PAE=4.97 Å). Two independent methods — one sequence-based (PepMLM), one structure-based (AF3) — agree that this peptide has the most credible interaction with SOD1. Its predicted binding at the dimer interface pocket, where it is partially buried, suggests a geometrically specific interaction rather than nonspecific surface adhesion. While its PeptiVerse-predicted affinity is moderate (pKd=6.055), the structural evidence from AF3 provides stronger support for a real binding event. It is fully soluble, non-hemolytic (0.079), and has the lowest molecular weight (1279.5 Da) among all candidates.
Peptide 1 (WRYYVAAVRWGE) remains a compelling alternative: it has the strongest predicted binding affinity (pKd=7.021, the only “Medium binding” peptide), excellent safety properties, and a moderate ipTM (0.31). If PeptiVerse affinity predictions are weighted more heavily than AF3 structural models, Peptide 1 would be the preferred choice.
For experimental validation, both peptides merit testing — Peptide 4 as the structurally favored lead and Peptide 1 as the affinity-favored alternative.
Generate optimized peptides with moPPIt. How do they differ from PepMLM peptides?
The moPPIt model (discrete flow matching with multi-objective gradient guidance) was used to generate 11 peptides targeting the SOD1 A4V mutant. Target motifs were set to residues 1-15 (N-terminus, near the A4V mutation) and residues 49-54 (dimer interface near the EFGDN loop). Peptide length was 12 amino acids. Objective weights were set to [1, 1, 1, 4, 4, 2] — affinity and motif specificity were weighted 4x to prioritize binding. Results are stored in peptides/moPPIt/sod1_moppit_results.csv.
| Peptide | Hemolysis | Non-Fouling | Half-Life | Affinity | Motif | Specificity |
|---|---|---|---|---|---|---|
| QKRRLLSLPVFK | 0.902 | 0.602 | 0.80 | 6.00 | 0.478 | 0.622 |
| YPPCAYYWQATD | 0.929 | 0.587 | 3.42 | 7.10 | 0.563 | 0.686 |
| SIVKTGVTFLTK | 0.920 | 0.186 | 1.81 | 6.38 | 0.584 | 0.699 |
| PPLIHRWYAATM | 0.922 | 0.321 | 3.49 | 6.30 | 0.444 | 0.660 |
| EEQVVKRIKVGP | 0.953 | 0.736 | 0.68 | 6.54 | 0.580 | 0.679 |
| CVQNKKPTFLII | 0.911 | 0.497 | 1.56 | 6.14 | 0.668 | 0.647 |
| LKKKIREFLKLG | 0.952 | 0.561 | 1.16 | 6.19 | 0.512 | 0.660 |
| YDPLPCAWTPTH | 0.935 | 0.726 | 2.69 | 6.57 | 0.482 | 0.699 |
| KPFVFFAKTEIM | 0.932 | 0.130 | 1.41 | 6.25 | 0.589 | 0.538 |
| PTWVIETKKKFR | 0.979 | 0.611 | 2.30 | 5.73 | 0.609 | 0.667 |
| GPKGWTGKQCFI | 0.888 | 0.711 | 2.07 | 7.00 | 0.474 | 0.635 |
Hemolysis: probability of being non-hemolytic (higher = safer). Affinity: predicted binding score (higher = stronger). Motif: fraction of binding at target residues (higher = more on-target).
All 11 peptides show high predicted hemolysis scores (0.89-0.98), indicating low hemolytic risk. Affinity predictions range from 5.73 to 7.10, with YPPCAYYWQATD (7.10) and GPKGWTGKQCFI (7.00) showing the strongest predicted binding. Half-lives vary considerably (0.68-3.49 hours), with PPLIHRWYAATM (3.49 h) and YPPCAYYWQATD (3.42 h) predicted to be the most stable.
Top candidates:
- Highest affinity: YPPCAYYWQATD (7.10) — also has good half-life (3.42) and high specificity (0.686)
- Best motif targeting: CVQNKKPTFLII (0.668) — strongest on-target binding to N-terminus + dimer interface
- Best therapeutic profile: EEQVVKRIKVGP — highest non-hemolytic score (0.953), best non-fouling (0.736), strong affinity (6.54)
- Best overall balance: YDPLPCAWTPTH — high affinity (6.57), good non-fouling (0.726), long half-life (2.69), high specificity (0.699)
Comparison to PepMLM peptides:
- Design philosophy: PepMLM generates peptides via masked language modeling conditioned on the target sequence — it learns what peptide “looks right” next to SOD1 based on evolutionary patterns. moPPIt uses discrete flow matching with explicit multi-objective gradient guidance — it actively optimizes for binding affinity, motif specificity, and therapeutic properties simultaneously.
- Binding specificity: PepMLM peptides are generated without any notion of where on SOD1 they should bind. moPPIt peptides are explicitly guided toward residues 1-15 and 49-54 via the BindEvaluator motif score, with a specificity penalty that discourages off-target binding elsewhere on SOD1.
- Sequence composition: PepMLM peptides all start with W (tryptophan), suggesting the model has a strong bias for aromatic N-terminal anchors. moPPIt peptides are more diverse — no single residue dominates, and the compositions vary based on which objective trade-offs the sampler explores.
- Affinity: moPPIt’s highest-affinity peptide (YPPCAYYWQATD, 7.10) is comparable to PepMLM’s best (WRYYVAAVRWGE, 7.02 via PeptiVerse). However, moPPIt consistently produces peptides in the 6.0-7.1 range, while PepMLM has more variance (4.8-7.0), suggesting moPPIt’s affinity guidance is effective.
- Solubility trade-off: PepMLM peptides all have perfect predicted solubility (1.000). Some moPPIt peptides sacrifice solubility (e.g., SIVKTGVTFLTK non-fouling = 0.186, KPFVFFAKTEIM = 0.130) in favor of higher affinity. This reflects the multi-objective nature: aggressive affinity optimization can push sequences toward hydrophobic compositions.
Evaluation before clinical advancement:
In silico validation:
- Molecular dynamics simulations of peptide-SOD1 complexes (starting from AF3 structures) to assess binding stability
- Binding free energy calculations (MM/PBSA or MM/GBSA) for ranking candidates
- Aggregation prediction (AGGRESCAN, TANGO)
In vitro validation:
- Surface plasmon resonance (SPR) or isothermal titration calorimetry (ITC) to measure actual Kd to A4V SOD1
- Hemolysis assay with human red blood cells
- Serum stability to validate half-life predictions
- ThT fluorescence / aggregation assays to test whether the peptide inhibits A4V SOD1 aggregation
Cell-based assays:
- Cell viability (MTT/MTS) to confirm non-cytotoxicity
- Cell-penetrating peptide assessment — SOD1 is cytosolic, so the peptide must enter cells
- Co-immunoprecipitation to confirm peptide-SOD1 interaction in cellular context
In vivo preclinical:
- Pharmacokinetics (bioavailability, clearance, tissue distribution)
- Efficacy testing in SOD1-G93A transgenic ALS mouse model
- Standard safety pharmacology panel
The key bottleneck for peptide therapeutics is typically delivery (cell penetration + proteolytic stability), not binding affinity. Strategies to address this include D-amino acid substitution, cyclization, stapling, or conjugation to cell-penetrating peptide motifs.
Part B: BRD4 Drug Discovery with Boltz Lab
Tutorial designed by Geoffrey Smith, Boltz Lab
Target: BRD4 (Bromodomain-containing protein 4) — an epigenetic reader protein and validated oncology target. BRD4 is a member of the BET (Bromodomain and Extra-Terminal) family. It recognises acetylated lysine residues on histone tails and recruits transcriptional machinery to gene promoters, driving expression of oncogenes including c-Myc. Dysregulated BRD4 activity is implicated in haematological malignancies, solid tumours, and inflammatory disease.
Reference: Filippakopoulos P. et al. Selective inhibition of BET bromodomains. Nature 468, 1067-1073 (2010). Crystal structure PDB: 3MXF
Compound Progression (Hit → Lead → Candidate)
| Stage | Compound | SMILES |
|---|---|---|
| Hit | Stripped Back Core | CC1C2C(=C(SC=2NCCN=1)C)C |
| Lead | Triazole + Acid | O=C(C[C@@H]1N=C(C)C2C(=C(SC=2N2C1=NN=C2C)C)C)O |
| Candidate | (+)-JQ1 | O=C(C[C@H]1C2=NN=C(N2C3=C(C(C4=CC=C(C=C4)Cl)=N1)C(C)=C(S3)C)C)OC(C)(C)C |
Boltz-2 Metrics
| Metric | Range | Meaning | Trust Threshold |
|---|---|---|---|
| Binding Confidence | 0-1 | How confidently Boltz-2 places the ligand in the binding site | > 0.7 reliable; > 0.8 high confidence |
| Optimization Score | 0-1 | Relative affinity ranking for congeneric series | Use for relative ranking |
| Structure Confidence | 0-1 | Confidence in the predicted structure | > 0.8 high confidence |
All three metrics need to be high to trust a prediction.
Run Boltz-2 predictions for the Hit, Lead, and JQ1 against BRD4.
| Compound | Binding Confidence | Optimization Score | Structure Confidence |
|---|---|---|---|
| Hit | 0.43 | 0.22 | 0.93 |
| Lead | 0.74 | 0.27 | 0.98 |
| JQ1 | 0.96 | 0.44 | 0.98 |
Does Binding Confidence increase from hit to clinical candidate?
Yes, Binding Confidence increases monotonically across the drug discovery progression: Hit (0.43) → Lead (0.74) → JQ1 (0.96). This is exactly what we would expect — each optimization stage adds chemical features that improve shape complementarity and specific interactions with the BRD4 acetyl-lysine binding pocket. The Hit (stripped back core) contains only the minimal thienodiazepine scaffold with no substituents to make specific contacts, so Boltz-2 has low confidence in placing it. The Lead adds a triazole and carboxylic acid that mimic the acetyl-lysine pharmacophore, roughly doubling the Binding Confidence. JQ1 adds the chlorophenyl group and tert-butyl ester, filling the WPF shelf and ZA channel of the bromodomain pocket, pushing Binding Confidence to 0.96 — well above the 0.8 high-confidence threshold.
The Structure Confidence is high for all three compounds (0.93-0.98), indicating that the protein structure itself is well-predicted regardless of the ligand. This makes sense since BRD4 is a well-characterized, rigid globular domain.
Inspect the predicted binding pose for JQ1. Can you identify key binding interactions?
JQ1 scores 0.96 Binding Confidence with 0.98 Structure Confidence, indicating a highly reliable predicted pose. Key binding interactions expected from the known crystal structure (PDB: 3MXF) include:
- The triazole ring and methyl group occupy the acetyl-lysine recognition site, forming a hydrogen bond with the conserved asparagine (N140) in the BC loop — the hallmark interaction of BET bromodomain inhibitors
- The chlorophenyl ring packs against the WPF shelf (W81, P82, F83), providing hydrophobic anchoring
- The tert-butyl ester group extends into the ZA channel, contributing additional hydrophobic contacts and shape complementarity
- The thienodiazepine core sits at the mouth of the pocket, bridging the ZA and BC loops
Compare the Optimization Scores. How do JQ1 and the Lead compare?
The Optimization Scores track the same progression: Hit (0.22) → Lead (0.27) → JQ1 (0.44). JQ1’s score (0.44) is roughly 63% higher than the Lead’s (0.27), reflecting the substantial affinity gain from adding the chlorophenyl and tert-butyl ester groups. The Hit-to-Lead jump is more modest (0.22 → 0.27, ~23% increase), consistent with the triazole and acid adding some specific contacts but not yet achieving the full pocket occupancy of the clinical candidate.
Using the categorization thresholds: JQ1 falls squarely in the “high confidence binder” range (Binding Confidence > 0.80, Opt. Score > 0.40). The Lead sits at moderate confidence (Binding Confidence 0.74, Opt. Score 0.27 — both within the 0.65-0.80 and 0.25-0.40 ranges). The Hit falls in the low confidence / non-binder category (Binding Confidence 0.43, Opt. Score 0.22), which aligns with its role as an unoptimized screening hit.
Create a Design Project and run a 1K virtual screen.
A design project was created in Boltz Lab using PDB 3MXF (BRD4 bromodomain 1 co-crystallized with JQ1) as the structural template. JQ1 was specified as the molecular probe to define the acetyl-lysine binding pocket. The platform automatically detected the binding site from the JQ1 co-crystal pose, identifying the key pocket residues including the WPF shelf (W81, P82, F83), BC loop (N140), and ZA channel. Project ID: VS-BRD4WO-5P52.
A virtual screen of 993 AI-designed small molecules was generated from the Enamine REAL chemical space with Drug-Like filtering. All compounds were scored by Boltz-2 against the BRD4 binding pocket.
Score distributions across the library:
| Metric | Min | Max | Mean |
|---|---|---|---|
| Binding Confidence | 0.07 | 0.85 | 0.30 |
| Optimization Score | 0.00 | 0.48 | 0.23 |
| Structure Confidence | >0.84 | >0.96 | ~0.92 |
The vast majority of compounds cluster at low Binding Confidence (<0.40), consistent with the expectation that random chemical space sampling yields few genuine binders. Structure Confidence remains high throughout (>0.84), indicating that the protein structure predictions are reliable regardless of ligand quality.
Top 5 compounds by Binding Confidence:
| Rank | ID | Binding Confidence | Opt. Score | SMILES |
|---|---|---|---|---|
| 1 | SM-AQ8GBD73 | 0.85 | 0.35 | Cc1cc(-c2cc(C)c(Cl)c(C)c2)cc(C)c1O |
| 2 | SM-VP5CRXFK | 0.84 | 0.25 | CN1Cc2c(NC(=O)c3cccnc3)cccc2C1=O |
| 3 | SM-2MZLAGQT | 0.80 | 0.48 | Cc1nc2c(cc1C(=O)Nc1cnn(CC(C)(C)O)c1C)c(C)nn2C |
| 4 | SM-G95H15CR | 0.76 | 0.20 | CCC(=O)N(C)c1ccc2c(c1)CN(C)C2 |
| 5 | SM-1ASUYQAA | 0.74 | 0.34 | CCN(C(=O)C(C)C)c1ccc(Cl)cc1F |
Categorize the results and benchmark against JQ1.
| Category | Criteria | Count | % of Library |
|---|---|---|---|
| High confidence binders | BC > 0.80, OS > 0.40 | 1 | 0.1% |
| Moderate confidence | BC 0.65-0.80, OS 0.25-0.40 | 13 | 1.3% |
| Low confidence / non-binders | BC < 0.65, OS < 0.25 | 979 | 98.6% |
The reference compounds validate the scoring system:
| Compound | Category |
|---|---|
| JQ1 | High confidence binder (0.96 / 0.44) |
| Lead | Moderate confidence (0.74 / 0.27) |
| Hit | Low confidence (0.43 / 0.22) |
The sole high-confidence AI hit:
| ID | Binding Confidence | Opt. Score | Structure Confidence | SMILES |
|---|---|---|---|---|
| SM-2MZLAGQT | 0.80 | 0.48 | 0.92 | Cc1nc2c(cc1C(=O)Nc1cnn(CC(C)(C)O)c1C)c(C)nn2C |
SM-2MZLAGQT contains a pyridazine-pyrazole core with multiple methyl groups and an amide linker to a neopentyl alcohol — structurally distinct from JQ1 but sharing nitrogen-rich heterocyclic character.
How does JQ1 rank alongside the AI-generated library?
JQ1 scores BC=0.96, OS=0.44 — substantially outperforming every AI-generated compound in Binding Confidence. By BC alone, JQ1 ranks #1 by a wide margin (0.96 vs the next-best AI compound SM-AQ8GBD73 at 0.85). No AI-generated molecule approaches JQ1’s level of binding confidence.
However, SM-2MZLAGQT (the only high-confidence AI hit) achieves a higher Optimization Score (0.48) than JQ1 (0.44). This is notable: the Optimization Score reflects relative affinity ranking within a congeneric series, and SM-2MZLAGQT’s higher OS suggests it may achieve comparable or slightly better binding affinity despite lower structural confidence in its predicted pose.
| Compound | BC Rank | OS Rank | BC | OS |
|---|---|---|---|---|
| JQ1 (benchmark) | 1 | 2 | 0.96 | 0.44 |
| SM-2MZLAGQT | 4 | 1 | 0.80 | 0.48 |
| SM-AQ8GBD73 | 2 | 6 | 0.85 | 0.35 |
| SM-VP5CRXFK | 3 | — | 0.84 | 0.25 |
JQ1 does not score as the top compound by Optimization Score, but it dominates Binding Confidence. This is expected — JQ1 is a highly optimized clinical candidate with known high-affinity binding to BRD4, whereas the AI compounds are generated from general chemical space without iterative medicinal chemistry optimization.
How do the top scoring binders compare in binding pose to JQ1?
The top-scoring AI compound SM-2MZLAGQT (Cc1nc2c(cc1C(=O)Nc1cnn(CC(C)(C)O)c1C)c(C)nn2C) contains a fused pyridazine-pyrazole bicyclic core decorated with methyl groups and an amide-linked pyrazole bearing a neopentyl alcohol. Comparing to JQ1’s thienodiazepine scaffold:
Shared pharmacophoric features:
- Both molecules feature nitrogen-rich heterocyclic cores capable of occupying the acetyl-lysine recognition site and forming hydrogen bonds with N140
- Multiple methyl substituents in both compounds provide hydrophobic contacts with the pocket walls
- Both have molecular weights in the drug-like range (SM-2MZLAGQT ~314 Da vs JQ1 ~457 Da)
Key structural differences:
- JQ1 uses a thienodiazepine (7-membered ring with sulfur) whereas SM-2MZLAGQT uses a pyridazine-pyrazole (two fused 6+5 rings with nitrogen)
- JQ1’s chlorophenyl group fills the WPF shelf — SM-2MZLAGQT lacks an equivalent aromatic group, potentially explaining its lower Binding Confidence
- JQ1’s tert-butyl ester extends into the ZA channel; SM-2MZLAGQT’s neopentyl alcohol (CC(C)(C)O) may partially mimic this interaction but with a hydroxyl instead of an ester
- SM-2MZLAGQT is more compact and lacks the extended hydrophobic features that give JQ1 its high shape complementarity
The second-highest BC compound, SM-AQ8GBD73 (Cc1cc(-c2cc(C)c(Cl)c(C)c2)cc(C)c1O), is a simple biaryl phenol with chlorine and methyl substitution — structurally much simpler than JQ1. Its high BC (0.85) but moderate OS (0.35) suggests it may sit in the pocket with good shape complementarity but lack the specific pharmacophoric interactions (N140 hydrogen bond, ZA channel occupancy) that drive high affinity.
Selectivity analysis: BRD4 vs BRD2
This analysis was not performed. A selectivity screen against BRD2 (PDB: 5UEN) would require re-running the top-scoring compounds from the BRD4 screen against the BRD2 bromodomain structure and comparing Binding Confidence and Optimization Scores across the two targets. Compounds scoring highly for BRD4 but poorly for BRD2 would indicate selectivity — a desirable property for reducing off-target effects, since BRD4 and BRD2 share highly conserved acetyl-lysine binding pockets. JQ1 itself is a pan-BET inhibitor (binds BRD2, BRD3, and BRD4), so identifying BRD4-selective compounds from the AI screen would represent a potential advantage over the benchmark.
Resources
| Resource | Link |
|---|---|
| Boltz Lab Platform | docs.boltz.bio |
| Key BRD4 Paper | Filippakopoulos P. et al. Nature 468, 1067-1073 (2010) |
| JQ1 PDB Structure | rcsb.org/structure/3MXF |
Part C: Phage Lysis Protein Design Challenge
L-Protein (Lysis Protein) — 75 residues
- Soluble domain (residues 1-40): Responsible for interaction with DnaJ
- Transmembrane domain (residues 41-75): Affects lysis activity
Engineering goals: (1) DnaJ independence — L-protein folds/functions without requiring DnaJ; (2) Faster or more efficient lysis — reduces the window for E. coli to acquire resistance; (3) Higher L-protein expression — increases the amount of functional protein produced.
Approach: ESM-2 mutational scanning, experimental mutant data from PMC5775895, and conservation analysis via pBLAST + ClustalOmega were integrated to design 5 mutant L-protein sequences.
Generate mutational effect scores with ESM-2.
The ESM-2 protein language model (650M parameters) was run on the 75-residue L-protein sequence. For each position, all 19 alternative amino acid substitutions were scored by computing the log-likelihood ratio (LLR = mutant log probability - wildtype log probability). Results are saved in ms2/mutation_scores.csv (1,425 total mutations across 75 positions).
| Metric | Value |
|---|---|
| Total mutations scored | 1,425 |
| Positions | 75 |
| Soluble region (1-40) | 760 mutations |
| Transmembrane region (41-75) | 665 mutations |
| Positive LLR (predicted beneficial) | 400 (28.1%) |
| Negative LLR (predicted deleterious) | 1,025 (71.9%) |
Top 10 highest-scoring substitutions (positive LLR):
| Mutation | LLR | Region |
|---|---|---|
| C29R | +3.64 | Soluble |
| K50P | +3.56 | TM |
| C29P | +3.17 | Soluble |
| C29Q | +3.06 | Soluble |
| C29S | +3.04 | Soluble |
| K50L | +2.96 | TM |
| C29K | +2.76 | Soluble |
| C29L | +2.74 | Soluble |
| C29A | +2.55 | Soluble |
| C29T | +2.52 | Soluble |
Two positions dominate the positive LLR landscape: C29 (cysteine at position 29 in the soluble domain) and K50 (lysine at position 50 in the TM domain). ESM-2 strongly prefers substituting the cysteine at position 29, likely because free cysteines are rare in most proteins and the model considers them destabilizing. K50 scores highly because the model views a charged residue in a hydrophobic TM context as unfavorable. The most strongly disfavored mutations are all at the initiator methionine (M1).
Review the experimental mutant data.
Experimental mutant data was obtained from PMC5775895 and is stored in ms2/L-Protein Mutants - Sheet1.csv. The dataset contains 139 total entries representing 82 unique mutations across 49 positions in the L-protein.
| Category | Count |
|---|---|
| Total entries | 139 |
| Unique mutations | 82 |
| Missense mutations | 100 entries (59 unique) |
| Stop codon mutations | 39 entries |
| Missense with lysis = 1 (functional) | 35 entries (19 unique) |
| Missense with lysis = 0 (non-functional) | 65 entries (40 unique) |
Soluble domain (residues 1-40): This region is remarkably tolerant of mutation. Substitutions at R18, R19, R20 (arginine-rich region) all retain lysis activity despite dramatically changing the charge profile (R18G, R18I, R19H, R19S, R20L, R20W — all lysis = 1). Positions 23 (K→E) and 25 (E→V, E→G, E→D) are also fully tolerant. Notable exceptions: M1 (initiator Met, essential), P6L (lysis = 0), Q8L (lysis = 0), and Y39H (lysis = 0). C29R retains lysis but C29 itself appears to be non-essential for function despite moderate conservation.
Transmembrane domain (residues 41-75): This region is far less tolerant. Most substitutions abolish lysis. K50 is functionally critical — all four tested substitutions (K50E, K50I, K50N, K50Q) show lysis = 0, yet the protein is still expressed (protein level = 1 for most), indicating K50 is required for the lysis mechanism itself, not for protein stability. Proline substitutions in the TM helix are generally lethal (L48P, L56P, L57P, L60P all lysis = 0). Rare functional TM mutations include L44P and A45P — prolines at the TM boundary are tolerated, possibly because they sit at the helix-membrane interface. Positions 49-53 (S49, K50, F51, T52, N53) form a particularly intolerant stretch.
Does the experimental data correlate with the language model scores?
The ESM-2 log-likelihood ratio (LLR) scores show no meaningful correlation with experimental lysis outcomes.
- Point-biserial correlation: r_pb = -0.041, p = 0.757
- Mann-Whitney U test: U = 421, p = 0.511
- Mean LLR for lysis = 1 (functional): -0.560
- Mean LLR for lysis = 0 (non-functional): -0.433
The correlation is essentially zero and far from statistical significance. If anything, the slight negative trend (functional mutations have marginally lower LLR) contradicts the expected direction. The Mann-Whitney U test confirms that the LLR distributions for functional and non-functional mutations are not distinguishable.
Of the 59 matched mutations, ESM-2 predictions agree with experiment in approximately 30 cases (roughly 50%), which is no better than random chance.
What does this tell you about how well protein language model embeddings capture functional information for the L-protein?
ESM-2’s evolutionary signal does not capture the functional constraints of the L-protein. Several factors explain this:
Extreme sequence rarity. The L-protein is a 75-residue protein encoded by an overlapping reading frame in the MS2 genome. It has very few homologs in sequence databases — only 2-3 close relatives (fr, M12) and a handful of distantly related levivirus lysis proteins. ESM-2 was trained on millions of protein sequences, but its effectiveness depends on having sufficient evolutionary depth to learn residue co-variation. The L-protein’s shallow phylogenetic tree means the model has little evolutionary signal to leverage.
Unusual evolutionary constraints. Because the lysis gene overlaps the coat protein and replicase genes, its evolution is constrained by the reading frames of two other genes. The selective pressures captured in ESM-2’s training data reflect these overlapping constraints, not the intrinsic functional requirements of the L-protein itself.
Non-standard function. The L-protein is a single-pass transmembrane toxin whose function (membrane disruption) may not follow the same structure-function relationships that ESM-2 captures well for globular enzymes and structured proteins.
The protein-level correlation is equally absent (r = 0.039, p = 0.768), confirming that ESM-2 does not predict expression or stability for this protein either.
Where does the model succeed and where does it fail?
Where ESM-2 succeeds:
- Strongly deleterious mutations at conserved positions: M1I and M1T (LLR = -6.13 and -5.63) are correctly predicted as non-functional. The initiator methionine is universally conserved and essential. Similarly, I42N (LLR = -1.43, lysis = 0) and I46N (LLR = -1.43, lysis = 0) in the transmembrane domain are correctly identified — replacing hydrophobic residues with polar asparagine disrupts TM helix packing.
- Proline substitutions in the TM helix: L48P (LLR = -2.31), L56P (LLR = -1.22), L56H (LLR = -2.11), L57P (LLR = -0.42), and L60P (LLR = -0.84) all correctly receive negative LLR and experimentally show no lysis. ESM-2 recognizes that proline is incompatible with alpha-helical transmembrane segments.
Where ESM-2 fails:
- The arginine-rich soluble region (R18, R19, R20): R18G (LLR = -1.02), R18I (LLR = -1.37), R19H (LLR = -1.03), R19S (LLR = -0.30), R20L (LLR = -0.23), and R20W (LLR = -2.30) are all predicted as deleterious, yet every one permits lysis. This is because the soluble N-terminal domain (residues 1-40) is largely dispensable for lysis activity — the amino-terminal half of the protein can tolerate extensive mutation as long as the transmembrane domain is intact. ESM-2 cannot distinguish “conserved for overlapping gene constraints” from “conserved for L-protein function.”
- Position K50 in the TM domain: K50E (LLR = +0.50), K50I (LLR = +2.41), K50N (LLR = +0.86), and K50Q (LLR = +0.78) all receive positive or near-positive LLR scores, yet all four experimentally show no lysis. K50 is a charged residue in the TM domain (“snorkeling lysine”) that is apparently critical for membrane disruption. ESM-2 interprets this unusual charged residue in a hydrophobic context as unfavorable, when in fact it is functionally essential.
- The failure pattern is region-dependent: Per-region analysis shows a slight positive trend in the soluble domain (r_pb = +0.134) but a slight negative trend in the transmembrane domain (r_pb = -0.166). ESM-2 is marginally better at predicting outcomes in the soluble domain but actively misleading in the transmembrane domain, likely because the functional rules for single-pass TM toxins differ from the evolutionary patterns in ESM-2’s training set.
Conservation analysis via pBLAST + ClustalOmega
A pBLAST search of the L-protein sequence identified 10 levivirus lysis protein homologs: fr (CAA33137), M12 (AAF19634), GA (CAA27498), JP34 (AAA72211), KU1 (AAF67675), BZ13 (ACT66727), Hgal1 (YP007237174), C1 (YP007237128), PP7 (NP042306), and PRR1 (YP717670). These were aligned with ClustalOmega and conservation scores were computed per position (stored in ms2/conservation_scores.csv and ms2/alignment.fasta).
The alignment spans 11 sequences total (MS2 L-protein + 10 homologs), though not all sequences cover every position — the N-terminal and C-terminal regions have variable sequence coverage (2-11 sequences per position).
Highly conserved positions (conservation ≥ 0.80):
| Position | Residue | Conservation | Shannon Entropy | Region |
|---|---|---|---|---|
| 1 | M | 1.00 | 0.00 | Soluble |
| 2 | E | 1.00 | 0.00 | Soluble |
| 3 | T | 1.00 | 0.00 | Soluble |
| 4 | R | 1.00 | 0.00 | Soluble |
| 9 | S | 0.80 | 0.72 | Soluble |
| 12 | T | 0.80 | 0.72 | Soluble |
| 29 | C | 0.82 | 0.68 | Soluble |
| 46 | I | 0.82 | 0.87 | TM |
| 48 | L | 0.82 | 0.87 | TM |
| 64 | I | 0.88 | 0.54 | TM |
| 69 | T | 0.88 | 0.54 | TM |
| 70 | L | 0.88 | 0.54 | TM |
| 73 | L | 1.00 | 0.00 | TM |
| 75 | T | 1.00 | 0.00 | TM |
The first four residues (METR) are universally conserved across all homologs. C29 (conservation = 0.82) is notable as the only cysteine in the protein and is highly conserved despite ESM-2 strongly favoring its substitution, highlighting a disconnect between evolutionary conservation and model preferences.
Highly variable positions (conservation ≤ 0.30):
| Position | Residue | Conservation | Most Common AA | Region |
|---|---|---|---|---|
| 6 | P | 0.20 | P | Soluble |
| 17 | N | 0.30 | M | Soluble |
| 18 | R | 0.18 | G | Soluble |
| 19 | R | 0.09 | L | Soluble |
| 25 | E | 0.27 | K | Soluble |
| 26 | D | 0.18 | E | Soluble |
| 28 | P | 0.27 | L | Soluble |
| 30 | R | 0.18 | S | Soluble |
| 37 | T | 0.27 | R | Soluble |
| 41 | L | 0.27 | W | TM |
| 43 | F | 0.27 | A | TM |
| 50 | K | 0.30 | D | TM |
| 53 | N | 0.30 | S | TM |
| 56 | L | 0.30 | S | TM |
| 74 | L | 0.29 | P | TM |
The soluble domain (positions 1-40) shows a gradient: the first four residues are perfectly conserved, then conservation drops substantially in the R18-R20 arginine-rich region (0.09-0.38) and the E25-P28 stretch (0.18-0.27). The transmembrane domain (positions 41-75) has a mix of well-conserved structural residues (I46, L48, I64, T69, L70, L73, T75) and highly variable positions (L41, F43, K50, N53, L56), suggesting that TM helix geometry is maintained but specific side chains can vary.
Design 5 mutant variants.
The variants below were selected by integrating three data sources: ESM-2 LLR scores (predicted mutational effect), conservation analysis (10 levivirus lysis protein homologs aligned via ClustalOmega), and experimental lysis data (59 characterized mutations). Selection criteria: positive LLR, non-conserved position (conservation < 0.8), and experimentally supported where available.
Variant 1: L-K23E
| Item | Value |
|---|---|
| Full mutant sequence | METRFPQQSQQTPASTNRRRPFEHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT |
| Region | Soluble (position 23) |
| Mutations made | K23 → E (lysine to glutamate) |
| Language model score(s) | +0.289 (predicted beneficial) |
| Experimental support | Lysis = 1 (functional), Protein level = 0 (not detected by Western blot) |
| Conservation status | 0.545 (moderately variable; Shannon entropy = 2.05) |
| Criteria met | 3/3 (positive LLR + non-conserved + experimentally supported) |
| Rationale | Charge reversal (positive K to negative E) in the soluble domain’s basic region near the DnaJ interaction interface. Position 23 is moderately conserved but shows high entropy (2.05), indicating tolerance for diverse amino acids across levivirus lysis proteins. The K→E substitution replaces the most common residue at this position with a negatively charged alternative, which may alter the electrostatic interaction surface with DnaJ. Experimentally confirmed to retain lysis activity. |
| Target goal | DnaJ independence — the charge reversal at the chaperone interaction surface may weaken DnaJ binding while the protein retains lysis function through an alternative folding pathway. |
Variant 2: L-E25G
| Item | Value |
|---|---|
| Full mutant sequence | METRFPQQSQQTPASTNRRRPFKHGDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT |
| Region | Soluble (position 25) |
| Mutations made | E25 → G (glutamate to glycine) |
| Language model score(s) | +0.251 (predicted beneficial) |
| Experimental support | Lysis = 1 (functional), Protein level = 0 |
| Conservation status | 0.273 (highly variable; most common AA at this position is K, not E) |
| Criteria met | 3/3 |
| Rationale | Position 25 is poorly conserved (0.273) — across the 11-sequence alignment, this site shows K, E, A, I, R, D, and other amino acids, indicating minimal functional constraint. The E→G substitution removes a bulky charged side chain and introduces maximum backbone flexibility. Experimentally confirmed functional. |
| Target goal | Higher expression — glycine at this unconstrained position may improve co-translational folding efficiency and reduce the protein’s dependence on chaperone-assisted folding. |
Variant 3: L-K50P
| Item | Value |
|---|---|
| Full mutant sequence | METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSPFTNQLLLSLLEAVIRTVTTLQQLLT |
| Region | Transmembrane (position 50) |
| Mutations made | K50 → P (lysine to proline) |
| Language model score(s) | +3.561 (highest LLR of all candidates) |
| Experimental support | No direct data for K50P. Caution: K50E, K50I, K50N, and K50Q all show lysis = 0 experimentally, indicating K50 may be functionally essential. |
| Conservation status | 0.300 (variable; most common AA at this position is D across the alignment) |
| Criteria met | 2/3 (positive LLR + non-conserved; no direct experimental data) |
| Rationale | ESM-2 assigns the highest LLR to this mutation because K50 is a charged residue in a hydrophobic TM context — the model strongly prefers hydrophobic alternatives. However, this represents a known ESM-2 blind spot: K50 appears to be a functionally critical “snorkeling” lysine whose charge is required for membrane disruption. This variant is included as a hypothesis-testing candidate — if K50P retains lysis, it would demonstrate that the helix-breaking property of proline can substitute for the charge-based mechanism. |
| Target goal | Faster/more efficient lysis — if functional, the proline-induced helix kink could create a more aggressive membrane disruption geometry. This is the highest-risk, highest-reward variant. |
Variant 4: L-K50L
| Item | Value |
|---|---|
| Full mutant sequence | METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSLFTNQLLLSLLEAVIRTVTTLQQLLT |
| Region | Transmembrane (position 50) |
| Mutations made | K50 → L (lysine to leucine) |
| Language model score(s) | +2.956 (second highest LLR) |
| Experimental support | No direct data for K50L. Same caution as Variant 3: four other K50 substitutions are non-functional. |
| Conservation status | 0.300 (variable) |
| Criteria met | 2/3 (positive LLR + non-conserved) |
| Rationale | Leucine is the most common residue in alpha-helical TM segments and represents the “default” hydrophobic substitution. Unlike the proline in Variant 3, leucine maintains helix geometry. This variant tests whether the loss of K50’s charge alone abolishes lysis or whether the specific chemistry of K50E/I/N/Q is what fails. Together, Variants 3 and 4 test two hypotheses: (3) can a structural perturbation compensate for charge loss, and (4) is any uncharged residue tolerated? |
| Target goal | Faster/more efficient lysis — if the TM domain can function with a fully hydrophobic helix, this would indicate that membrane insertion efficiency can compensate for the loss of charge-mediated disruption. |
Variant 5: L-E25V
| Item | Value |
|---|---|
| Full mutant sequence | METRFPQQSQQTPASTNRRRPFKHVDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT |
| Region | Soluble (position 25) |
| Mutations made | E25 → V (glutamate to valine) |
| Language model score(s) | +0.152 (predicted mildly beneficial) |
| Experimental support | Lysis = 1 (functional), Protein level = 0 |
| Conservation status | 0.273 (highly variable) |
| Criteria met | 3/3 |
| Rationale | Same position as Variant 2 (E25G) but with a different substitution strategy. While E25G maximizes flexibility, E25V introduces a branched hydrophobic side chain. This provides a paired comparison at a known-tolerant position: flexibility (G) vs. hydrophobicity (V). Position 25 is adjacent to the conserved C29 (conservation = 0.818), so mutations here probe the boundary between the variable N-terminal region and the more constrained core. Both E25G and E25V are experimentally confirmed functional. |
| Target goal | DnaJ independence — replacing the charged glutamate with hydrophobic valine at the soluble-domain surface creates a local hydrophobic patch that may reduce the protein’s requirement for DnaJ-mediated folding assistance. |
Summary
| Variant | Mutation | Region | LLR | Conservation | Exp. Lysis | Target Goal |
|---|---|---|---|---|---|---|
| 1 | K23E | Soluble | +0.289 | 0.545 | Yes | DnaJ independence |
| 2 | E25G | Soluble | +0.251 | 0.273 | Yes | Higher expression |
| 3 | K50P | TM | +3.561 | 0.300 | No data* | Faster lysis |
| 4 | K50L | TM | +2.956 | 0.300 | No data* | Faster lysis |
| 5 | E25V | Soluble | +0.152 | 0.273 | Yes | DnaJ independence |
*Other K50 substitutions (E, I, N, Q) experimentally show no lysis.
Caveats
K50 risk: Variants 3 and 4 target position K50, where 4/4 tested mutations are non-functional. These are included as hypothesis-testing variants, not safe bets. If a lower-risk TM selection is preferred, alternatives include L44P (lysis = 1, LLR = -1.84) or A45P (lysis = 1, LLR = -0.43), though these have negative ESM-2 scores.
Position redundancy: The design includes two mutations at position 25 and two at position 50. This enables paired comparisons (flexibility vs. hydrophobicity at pos 25; helix-breaking vs. helix-maintaining at pos 50) but reduces the diversity of positions tested.
ESM-2 limitations for L-protein: As documented in the correlation analysis, ESM-2 LLR scores do not predict lysis outcomes for this protein (r_pb = -0.041). The conservation analysis and experimental data were therefore weighted more heavily in the final selection.