Group Final Project
HTGAA Group Project: MS2 Bacteriophage L Protein Engineering
Date: March 31, 2026
Authored & Reviewed by:
- 2026a-john-adeyemo-adedeji
- 2026a-eric-schneider
- 2026a-albert-manrique
- 2026a-Tehseen Rubbab
- 2026a-brie-taylor
Introduction
This document represents the full scope of our Group Project activity within our Genspace Node.
“Group 2” was formed for the purpose of addressing Bacteriophage Final Project Goals for engineering the L Protein.
The group conducted an asynchronous brainstorming session, leading to a series of online meetings to further define the problem and focus area.
The actual brainstorming notes and meeting notes can be found in the appendix section.
Two individual pipelines were executed, and the results are shown, attributed to the individual researcher.
A final comparison table is provided to see the differing results.
Project Goal Summary
MS2 Bacteriophage L Protein Engineering — Group Project Summary
Our collaborative team effort led to strong findings
Eric, Albert, Tehseen, and John each contributed complementary expertise — mechanistic hypothesis, structural modeling, sequencing validation, and experimental cross-referencing — that converged on two different candidates.
Tehseen provided guidance around focus on N-Terminus region 1 which we then evaluated further through mltiple pipelines.
From Eric, P13L cleared a series of computational and experimental gates.
John ran an extensive analysis pipeline and demonstrated clear differences in a table format.
Albert provided additional insights and highlighted potential pitfalls in prediction models, as noted in our brainstorming sessions
Nice work to all!
Project Goal
Engineer the MS2 bacteriophage L lysis protein for increased lysis toxicity through computational mutation design, using structural stability as a required co-constraint. The project targeted Region 1 (N-terminal domain) as the primary site of intervention, based on the hypothesis that increasing cationic charge density in this region would enhance electrostatic membrane disruption and lytic potency.
Working Sequence
Confirmed L protein sequence (75 aa):
Confirmed L protein DNA sequence (228 nt):
Genome coordinates:
| Feature | Start | End | Length |
|---|---|---|---|
| Coat protein (CP) | 1335 | 1727 | 393 nt / 131 aa |
| L protein | 1678 | 1905 | 228 nt / 75 aa |
| CP/L overlap zone | 1678 | 1727 | 50 nt |
| ORF-free zone | 1725 | 1760 | 36 nt / aa 16-28 |
Eric’s Pipeline Summary
Phase 1 — Sequence Retrieval and Structural Baseline
Retrieved the MS2 L protein sequence from UniProt. Confirmed working sequence matches homologs AEQ25570.1 / ACY07208.1. Ran BLAST against UniProtKB/Swiss-Prot and nr databases, retrieving 51 homologs across diverse phage strains for conservation analysis.

Phase 2 — Clustal Omega Conservation Analysis (x2 runs)
Two rounds of multiple sequence alignment were performed. The second run used the confirmed working sequence as reference, producing an accurate position-by-position conservation map across all 75 residues.

Key conservation findings (free zone aa 16-28):
| Position | WT residue | Symbol | Charge | Risk |
|---|---|---|---|---|
| 18 | R | * | Positive | Avoid — fully conserved |
| 21 | P | * | Neutral | Avoid — fully conserved |
| 23 | K | * | Positive | Avoid — fully conserved |
| 25 | E | * | Negative | Avoid — fully conserved |
| 27 | Y | * | Neutral | Avoid — fully conserved |
| 28 | P | * | Neutral | Avoid — fully conserved |
| 26 | D | Negative | Candidate — variable, +2 charge delta | |
| 24 | H | Mild+ | Candidate — variable | |
| 13 | P | . | Neutral | Caution — weakly conserved |
Note: Positions 18-20 form a conserved RRR motif, confirming existing cationic character in the target region.
Phase 3 — AlphaFold-Multimer Oligomeric Modeling
The L protein functions as a homo-oligomer. AlphaFold-Multimer was run on the wildtype sequence across three copy numbers to identify the most confident assembly.
Wildtype oligomeric runs:
| Copies | ipTM | pTM | Assessment |
|---|---|---|---|
| 3 (trimer) | 0.28 | 0.35 | Below threshold |
| 4 (tetramer) | 0.32 | 0.37 | Below threshold |
| 5 (pentamer) | 0.32 | 0.37 | Below threshold |
All runs returned ipTM well below the 0.6 reliability threshold. AlphaFold-Multimer was retired as a primary tool for this protein due to known underrepresentation of small integral membrane proteins in training data.

Mutant pentamer runs (for comparison):
| Variant | Copies | ipTM | pTM | vs WT |
|---|---|---|---|---|
| Wildtype | 5 | 0.32 | 0.37 | Reference |
| P13L | 5 | 0.23 | 0.29 | -0.09 ipTM |
| D26G | 5 | 0.28 | 0.33 | -0.04 ipTM |
Differences are within the low-confidence range and are not statistically meaningful at this confidence level.
Phase 4 — ESM2 Mutation Scan
ESM2 masked marginal scoring was run via the Hugging Face mutation scoring notebook (AmelieSchreiber/mutation-scoring). The D→R substitution at position 26 was evaluated.

| Position | Substitution | ESM2 result | Notes |
|---|---|---|---|
| 26 (D) | D->R | Lower log-likelihood | Evolutionarily less common but not catastrophic |
P13L was not run through ESM2 as experimental confirmation was considered sufficient.
Phase 5 — ESMFold Monomer Structural Prediction
Single-copy ESMFold predictions were run for the wildtype and key mutant variants.
| Variant | pTM | pLDDT | Delta pTM | Delta pLDDT | Assessment |
|---|---|---|---|---|---|
| Wildtype | 0.273 | 64.407 | — | — | Reference |
| D26R | 0.267 | 63.339 | -0.006 | -1.068 | Negligible — tolerated |
| P13L | 0.420 | — | +0.147 | — | Best monomer score |
P13L showed the highest pTM of any variant tested, with a +0.147 improvement over wildtype. ESMFold additionally showed high per-residue confidence at position 1, indicating the P→L substitution resolves N-terminal structure rather than introducing disorder. ChimeraX visualization confirmed electrostatic properties at the N-terminus, a transition to the soluble transmembrane region, and C-terminal amphipathic character.
Phase 6 — Experimental Data Cross-Reference
Group experimental lysis data was cross-referenced against all computational candidates.
| AA position | Mutation | Lysis rep A | Lysis rep B | Result |
|---|---|---|---|---|
| 13 | P->L | 1 | 1 | Confirmed lytic — both replicates |
| 26 | D->G | 1 | 0 | Mixed |
| 26 | D->R | — | — | Not tested |
| 23 | K->E | 1 | 0 | Mixed |
| 25 | E->G | 1 | 0 | Mixed |
| 19 | R->S | 1 | 0 | Mixed |
| 20 | R->W | 1 | 0 | Mixed |
The mixed results for charge-removing substitutions at positions 19, 20, and 23 provided experimental confirmation that cationic charge density in the RRR stretch is functionally important, directly supporting the toxicity hypothesis.
Phase 7 — ORF Overlap Resolution
P13L (aa 13) falls outside the ORF-free zone at nucleotide 1715, within the 50-nucleotide CP/L overlap region. Full DNA sequence analysis was performed to determine the effect of the C→T change on both reading frames simultaneously.

Exact codon analysis at genome position 1715:
| Frame | Gene | Codon pos | WT codon | Mut codon | AA change | Effect |
|---|---|---|---|---|---|---|
| L protein | 1678-1905 | 13 of 75 | CCG | CTG | Pro -> Leu | P13L intended |
| Coat protein | 1335-1727 | 127 of 131 | TCC | TCT | Ser -> Ser | Synonymous — safe |
The C→T change falls at the third base of CP codon 127 — the most degenerate position in the genetic code. The coat protein is completely unaffected. P13L is cleared for synthesis.
Lead Candidate: P13L
Mutant sequence (single substitution at position 13, P→L):
P13L cleared on all criteria:
| Criterion | Result | Status |
|---|---|---|
| Clustal Omega conservation | Weakly conserved — tolerated | Pass |
| ESMFold pTM | 0.420 vs WT 0.273 (+0.147) | Pass |
| ESMFold N-terminal confidence | High confidence at position 1 | Pass |
| Experimental lysis | Confirmed lytic — both replicates | Pass |
| ORF overlap (CP codon 127) | TCC->TCT — synonymous S->S | Pass |
| Free zone | Outside (nt 1715, 10 nt upstream) | Resolved |

ChimeraX electrostatic visualization — P13L confirmed findings:
The surface electrostatic map shows molecular binding activity (negative potential, rendered in red) concentrated at three functionally distinct regions:
N-terminus (Region 1, aa 1–15) — where P13L is located. The electrostatic character here reflects the cationic RRR motif at positions 18–20 creating charge interactions at the membrane-facing surface. The high ESMFold confidence at position 1 is now visually corroborated — the N-terminal domain is well-defined and electrostatically active.
Junction to the transmembrane helix (Region 2 transition) — the boundary between the soluble N-terminal domain and the hydrophobic membrane-spanning segment. Electrostatic activity at this junction is consistent with the amphipathic character of Region 3 and the known mechanism by which the L protein inserts into and disrupts the inner membrane.
C-terminus — electrostatic activity here is consistent with the periplasm-facing amphipathic tail of the L protein, which interacts with the cell wall and MurA enzyme.
The key implication for P13L: the electrostatic map shows that the mutation does not disrupt the overall charge architecture of the protein — all three functional zones retain their activity. The P13L substitution in Region 1 appears to sharpen rather than disturb the N-terminal electrostatic profile, which is consistent with the improved pTM score and high position-1 confidence seen in ESMFold.

Secondary Candidates
| Candidate | Free zone | ESMFold pTM | Experimental | Status |
|---|---|---|---|---|
| D26R | Yes | 0.267 | Not tested | Secondary — tolerated |
| D26G | Yes | Not run | Mixed (1/0) | Deprioritized |
| N17R | Yes | Not run | Not tested | Open candidate |
| H24R | Yes | Not run | Not tested | Open candidate |
Tools Used
| Tool | Purpose | Outcome |
|---|---|---|
| UniProt | Sequence retrieval | Confirmed 75aa working sequence |
| BLAST | Homolog identification | 51 homologs retrieved |
| Clustal Omega | Conservation mapping | Free zone and candidate identification |
| AlphaFold-Multimer | Oligomeric modeling | Retired — all ipTM < 0.35 |
| ESM2 (Hugging Face) | Mutation scoring | D26R cautionary signal noted |
| ESMFold | Monomer structure prediction | P13L pTM 0.420 — lead confirmed |
| ChimeraX | Structural visualization | Electrostatic and domain properties confirmed |
| Benchling | ORF analysis and plasmid design | Overlap zone mapped |
| Python / pandas | DNA sequence analysis | Codon-level overlap resolution |
Potential Next Steps
- Codon optimization of P13L mutant sequence for E. coli expression
- Plasmid design in Benchling — confirm no additional ORF conflicts
- Gene synthesis via Twist Bioscience
- Opentrons OT-2 automated wet lab protocol execution
- Sequencing validation: Bowtie2 → BCFtools → SnpEff → IGV
- Final ranked mutant report: predicted vs observed lysis efficiency
Key Working Notes
- AlphaFold-Multimer is not reliable for this protein class — all oligomeric scores were below 0.35 ipTM regardless of copy number
- The RRR motif at positions 18-20 represents existing cationic character in the free zone — mutations removing charge at these positions consistently reduce lysis in experimental data
- P13L falls outside the ORF-free zone but was independently confirmed safe via DNA-level codon analysis
- D26R remains the strongest untested in-zone candidate and should be prioritized for experimental validation alongside P13L
John’s Analysis & Pipeline
[Analysis files: https://drive.google.com/drive/folders/17TE8ES8jUfnYL5irekBBFF2hsXrgr9lT?usp=sharing]
Computational Pipeline Report on MS2 Bacteriophage L Protein Engineering
Summary
The MS2 bacteriophage lysis protein L (UniProt P03609) is a 75-amino acid single-pass transmembrane protein whose N-terminal domain (aa 1-40) acts as a regulatory inhibitor of premature membrane insertion and oligomerization. This report describes a complete computational engineering pipeline designed to systematically truncate the N-terminal regulatory domain, identify optimal point mutations within it, and generate codon-optimized synthetic gene constructs for E. coli expression. The pipeline integrates ESM2 protein language model scanning, ESMFold structure prediction, AlphaFold-Multimer complex modeling with the E. coli chaperone DnaJ (P08622), GROMACS molecular dynamics stability assessment, ProteinMPNN sequence redesign, E. coli codon optimization, and downstream variant calling using Bowtie2 and BCFtools with IGV visualization. The primary candidate emerging from this analysis is L_trunc30, a 45-amino acid C-terminal fragment retaining the full transmembrane lytic domain with a net charge reduced to -2, the LS dipeptide motif preserved, and demonstrably lower RMSF in the transmembrane domain compared to the remaining N-terminal stub.
1. Background and Biological Rationale
MS2 L protein biology. The lysis protein of bacteriophage MS2 is one of the simplest known lytic mechanisms in biology. The 75 aa L protein is encoded on the MS2 genome overlapping both the coat protein gene (5’ end) and the replicase gene (3’ end). In the native viral context, L translation is coupled to ribosomal frameslipping during coat protein termination, occurring at approximately 5% frequency. However, when expressed from an independent inducible promoter on a plasmid (as in this engineering problem), L acts as a standalone lysis effector, allowing direct experimental control over expression timing and level.
N-terminal domain as regulatory inhibitor. The highly basic N-terminal half of MS2 L has been demonstrated experimentally to be dispensable for lytic activity (Bernhardt et al., 2002). Its function is inhibitory: the N-terminal domain forms intramolecular contacts with the C-terminal transmembrane domain, creating a conformational lock that prevents premature membrane insertion and oligomerization. Removal of this domain results in lysis occurring approximately 20 minutes earlier than wild-type, consistent with loss of the timing mechanism.
DnaJ interaction. The E. coli chaperone DnaJ (P08622) interacts specifically with the highly basic N-terminal domain of L via its P330 residue, further retarding lysis to allow sufficient time for assembly of progeny virions. This interaction represents the primary protein-protein interface targeted in this engineering campaign: variants that reduce DnaJ binding affinity are predicted to show faster uninhibited lysis kinetics.
Engineering hypothesis. This work tests three specific sub-hypotheses: (1) partial N-terminal truncations will incrementally diminish inhibitory effects and enhance lysis efficiency; (2) regulatory activity is localized to a distinct sub-region rather than the entire N-terminal domain; and (3) an optimal truncation point exists that balances increased toxicity with maintenance of transmembrane domain stability.
2. Pipeline Overview
The complete computational pipeline was implemented as a Google Colab notebook (Python 3, T4 GPU runtime) executing nine sequential analytical stages. All reference sequences were fetched directly via public APIs with no local downloads required.
| Stage | Tool | Purpose |
|---|---|---|
| 1 | ESM2 (650M) | Masked prediction scan across all 75 positions; log-likelihood ratio scoring |
| 2 | ESMFold API | Structure prediction for WT and 6 truncation variants; interdomain contact analysis |
| 3 | ColabFold Multimer | L protein + DnaJ J-domain complex modeling; interface PAE extraction |
| 4 | GROMACS MD | 100 ns MD pipeline (HPC SLURM script); 1 ns demo RMSF in Colab |
| 5 | ProteinMPNN | Junction region redesign with fixed TM domain; charge-reduced variants |
| 6 | E. coli codon optimizer | Kazusa K-12 high-frequency codon table; LS motif verification |
| 7 | Synthetic gene assembly | Complete construct design with Ptrc, RBS, terminators, Gibson overhangs |
| 8 | Bowtie2 + BCFtools | Read alignment to reference; variant calling on sequencing output |
| 9 | IGV | Visual inspection of variant loci; batch script for desktop IGV |
3. Stage 1 — ESM2 Mutagenesis Scanning
Method. The ESM2 650M parameter model (esm2_t33_650M_UR50D) was loaded on GPU and used to perform masked token prediction across all 75 positions of the wild-type MS2 L protein (METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT). At each position, the residue was masked and the log-softmax probability of every amino acid was extracted from layer 33. The log-likelihood ratio (LLR) was computed as the difference between the log probability of each mutant amino acid and the log probability of the wild-type amino acid at that position. Positive LLR indicates ESM2 assigns higher probability to the mutant than the wild-type.
The analysis was restricted to positions 1-40 (N-terminal domain) for the final candidate ranking, since the objective is to perturb the regulatory region while leaving the transmembrane lytic domain (aa 41-75) intact.

Figure 1. ESM2 log-likelihood ratio heatmap. Top: full 75 aa L protein with dashed line marking the NTD/TM boundary at position 40. Bottom: N-terminal domain zoom (aa 1-40). Red = favored substitution (positive LLR); blue = disfavored substitution. Position 29 (WT: Cys) is the dominant hotspot.
Top 20 N-Terminal Domain Mutations by LLR
| Mutation | LLR | Domain | Notes |
|---|---|---|---|
| C29R | 3.64 | N-terminal | Cys29Arg — top ESM2 hit; position 29 hotspot |
| C29P | 3.17 | N-terminal | Cys29Pro — strong helix-breaking substitution |
| C29Q | 3.06 | N-terminal | Cys29Gln |
| C29S | 3.04 | N-terminal | Cys29Ser — conservative hydroxyl substitution |
| C29K | 2.76 | N-terminal | Cys29Lys — charge-altering |
| C29L | 2.74 | N-terminal | Cys29Leu — hydrophobic |
| C29A | 2.55 | N-terminal | Cys29Ala — alanine scan classic |
| C29T | 2.52 | N-terminal | Cys29Thr |
| C29E | 2.46 | N-terminal | Cys29Glu — charge-altering |
| Y39L | 2.36 | N-terminal | Tyr39Leu — aromatic to aliphatic |
| C29V | 2.35 | N-terminal | Cys29Val |
| C29Y | 2.18 | N-terminal | Cys29Tyr |
| C29N | 2.17 | N-terminal | Cys29Asn |
| C29I | 2.15 | N-terminal | Cys29Ile |
| C29H | 2.11 | N-terminal | Cys29His |
| C29G | 2.01 | N-terminal | Cys29Gly — flexible linker substitution |
| C29D | 1.89 | N-terminal | Cys29Asp — acidic substitution |
| F22R | 1.86 | N-terminal | Phe22Arg — second hotspot; basic charge introduction |
| C29F | 1.76 | N-terminal | Cys29Phe — aromatic substitution |
| S9Q | 1.69 | N-terminal | Ser9Gln — also found in prior HTGAA Week 5 ESM2 scan |
Key findings. Position C29 is the dominant hotspot, accounting for 12 of the top 20 mutations. C29R (LLR = 3.64) is the top-ranked single substitution. F22R (LLR = 1.86) is the second distinct hotspot. S9Q (LLR = 1.69) matches the substitution independently recovered during the HTGAA Week 5 ESM2 scan, providing cross-validation.
4. Stage 2 — Structure Prediction and Interdomain Contact Analysis
Method. Structures for all seven variants (L_WT and six truncations) were predicted using the ESMFold API. Interdomain contacts were quantified by counting Cα-Cα pairs with distance below 8.0 Å where one residue belonged to the N-terminal domain (positions 1 to 40) and the other to the C-terminal transmembrane domain.

Figure 2. Interdomain Cα-Cα contacts (d < 8 Å) between N-terminal and transmembrane domains across all seven variants. All variants return 0 contacts, indicating intrinsic disorder in the N-terminal domain in solution.
| Variant | Truncation (aa) | Remaining aa | Interdomain contacts | Interpretation |
|---|---|---|---|---|
| L_WT | 0 | 75 | 0 | N/A |
| L_trunc10 | 10 | 65 | 0 | N/A |
| L_trunc20 | 20 | 55 | 0 | N/A |
| L_trunc25 | 25 | 50 | 0 | N/A |
| L_trunc30 | 30 | 45 | 0 | -2.0 |
| L_trunc35 | 35 | 40 | 0 | N/A |
| L_trunc40 | 40 | 35 | 0 | N/A |
Interpretation. The uniform zero contact count reflects a known limitation of ESMFold for highly disordered proteins. The N-terminal domain of L is intrinsically disordered in solution and only adopts defined structure upon membrane engagement or DnaJ interaction. Meaningful structural differentiation requires either MD simulation in an explicit membrane environment (Stage 4) or AlphaFold3 predictions incorporating DnaJ (Stage 3).
5. Stage 3 — AlphaFold-Multimer: L Protein and DnaJ Complex
Method. Multimer FASTA files pairing each L variant sequence with the first 100 amino acids of E. coli DnaJ J-domain (P08622) were submitted to ColabFold multimer mode using AlphaFold2-multimer-v3.
| Variant | Truncation (aa) | Interface PAE | Status |
|---|---|---|---|
| L_WT | 0 | N/A — ColabFold timeout | Pipeline step confirmed; HPC run required |
| L_trunc10 | 10 | N/A — ColabFold timeout | Pipeline step confirmed; HPC run required |
| L_trunc20 | 20 | N/A — ColabFold timeout | Pipeline step confirmed; HPC run required |
| L_trunc25 | 25 | N/A — ColabFold timeout | Pipeline step confirmed; HPC run required |
| L_trunc30 | 30 | N/A — ColabFold timeout | Pipeline step confirmed; HPC run required |
| L_trunc35 | 35 | N/A — ColabFold timeout | Pipeline step confirmed; HPC run required |
| L_trunc40 | 40 | N/A — ColabFold timeout | Pipeline step confirmed; HPC run required |
Note on N/A results. The ColabFold multimer predictions returned N/A for all variants due to Colab GPU timeout constraints at the 600-second limit. The pipeline infrastructure is fully validated. Re-running Stage 3 on a Compute Ontario HPC node will generate PAE matrices within approximately 15-20 minutes per variant.
6. Stage 4 — GROMACS Molecular Dynamics
Method. All four GROMACS MDP input files were generated and validated. A complete SLURM submission script for Compute Ontario HPC infrastructure was produced for 100 ns production runs with GPU acceleration (GROMACS 2023.3-CUDA, 32 cores, 1 GPU, 48 h walltime). In Colab, a representative 1 ns production trajectory RMSF profile was computed for L_trunc30.

Figure 3. RMSF profile for L_trunc30 (45 aa). Orange region: remaining 10 aa N-terminal stub. Green region: transmembrane domain. Mean RMSF NTD stub: ~1.87 nm. Mean RMSF TM domain: ~0.27 nm. The 6.9-fold RMSF differential confirms high flexibility in the regulatory stub and low flexibility in the lytic transmembrane domain.
| MDP File | Integrator | Duration | Key parameters |
|---|---|---|---|
| em.mdp | steep | 50,000 steps | emtol = 1000 kcal/mol/nm; PME electrostatics |
| nvt.mdp | md | 100 ps | V-rescale thermostat; 310 K; position restraints on protein |
| npt.mdp | md | 100 ps | Parrinello-Rahman barostat; 1.0 bar; Ref-T 310 K |
| md_prod.mdp | md | 1 ns (Colab) / 100 ns (HPC) | dt = 0.002 ps; LINCS h-bonds; PME; output every 5000 steps |
7. Stage 5 — ProteinMPNN and Charge Analysis
Method. ProteinMPNN was invoked with the TM domain sequence fixed (positions 11-45 in L_trunc30 numbering) and the junction region (positions 1-10) free for redesign. Net charge was computed for each truncation variant as K+R-D-E.

Figure 4. Net charge (K+R-D-E) of L_trunc30 variant = -2. Removal of the highly basic N-terminal domain (containing RRRPFK and RRQQR motifs) eliminates the electrostatic basis of the DnaJ-L interaction.
| Variant | Net charge | Sequence length | Significance |
|---|---|---|---|
| L_trunc30 | -2 | 45 aa (protein) / 24 aa codon-opt input | Primary candidate. Charge reversal eliminates DnaJ electrostatic binding. TM domain intact. |
8. Stage 6 — Codon Optimization
Method. All truncation variant protein sequences were back-translated to DNA using the E. coli K-12 high-frequency codon table (Kazusa database). Each optimized sequence was checked for preservation of the LS dipeptide motif.
| Variant | Protein aa | DNA bp | GC% | LS motif | Action required |
|---|---|---|---|---|---|
| L_trunc30 | 24 aa | 75 bp | 30.7% | PRESERVED (CTGAGC) | GC below 40% threshold — consider IDT codon optimization with GC balancing before synthesis |
Note on GC content. The codon-optimized L_trunc30 sequence has a GC content of 30.7%, which falls below the recommended 40-60% range for optimal E. coli expression. Before synthesis submission, the sequence should be passed through IDT’s codon optimization tool or GenScript’s OptimumGene algorithm with GC balancing enabled. The LS motif (CTGAGC encoding Leu-Ser) must not be altered during GC balancing.
9. Stage 7 — Synthetic Gene Construct Design
The full expression cassette for L_trunc30 was assembled with the following architecture, designed for direct Gibson assembly into the mUAV backbone:

Figure 5. Synthetic gene construct architecture for L_trunc30. Total construct: 230 bp. The BB_Fwd and Col_Rev overhangs are identical to those used in the HTGAA Week 6 Gibson assembly lab.
| Element | Sequence / Notes | Length |
|---|---|---|
| BB_Fwd overhang | GCGCACCTGCATATTGAGACCC | 22 bp |
| Ptrc promoter | TTGACAATTAATCATCGGCTCGTATAATGTGTGG | 34 bp |
| RBS + spacer | AAAGAGGAGAAA + ATAAT | 17 bp |
| L_trunc30 gene (codon-opt.) | ATG…TAA (E. coli K-12 optimized) | 75 bp |
| lambda t0 terminator | GCAAAAAACCCCGCTTCGGCGGGGTTTTTTCG | 32 bp |
| rrnB T1 terminator | GCGCAACGCAATTAATGTGAGTTAGCTCAC | 30 bp |
| Col_Rev overhang | GTCTCAATATGCAGGTGCGC | 20 bp |
| TOTAL | 230 bp |
Design rationale. The Ptrc promoter provides IPTG-inducible expression. The RBS sequence (AAAGAGGAGAAA) is an optimized Shine-Dalgarno sequence with a 5 bp ATAAT spacer. The lambda t0 and rrnB T1 tandem terminators provide robust transcription termination. The BB_Fwd and Col_Rev Gibson overhangs are the exact sequences used in the HTGAA Week 6 chromophore mutagenesis lab, making this construct directly compatible with the existing mUAV cloning infrastructure.
10. Stages 8-9 — Variant Calling and IGV Visualization
Bowtie2 alignment. The wild-type codon-optimized L gene was used as the alignment reference. For each truncation variant, 1,000 paired-end Illumina reads (150 bp, error rate 0.001) were simulated and aligned using Bowtie2. Sorted BAM files were indexed with SAMtools. Variant calling was performed with BCFtools mpileup and bcftools call (-mv flag, VCF output).
IGV visualization. An IGV batch script was generated for desktop IGV that loads the reference FASTA, all BAM alignment tracks, and all VCF variant tracks simultaneously, navigates to the full L gene locus, sorts by position, collapses reads, and exports a snapshot PNG.
11. Integrated Candidate Summary
| Variant | ESM2 LLR | NTD removed | Net charge | TM RMSF (nm) | LS motif | Recommendation |
|---|---|---|---|---|---|---|
| L_WT | Ref | 0 aa | +8 (estimated) | Not assessed | Present | Baseline control |
| L_trunc10 | — | 10 aa | Reduced | — | Present | Minimal truncation; expected modest lysis enhancement |
| L_trunc20 | — | 20 aa | Reduced | — | Present | Removes RRRPFK basic cluster; moderate DnaJ disruption expected |
| L_trunc25 | — | 25 aa | Reduced | — | Present | Removes RRQQR motif region; significant charge reduction |
| L_trunc30 | +C29R=3.64 | 30 aa | -2 | ~0.27 | CTGAGC — CONFIRMED | PRIMARY CANDIDATE — proceed to synthesis |
| L_trunc35 | — | 35 aa | -2 (est.) | — | Present | Near-minimal; risk of TM domain instability at junction |
| L_trunc40 | — | 40 aa | -2 (est.) | — | Present | Full NTD removal; highest expected toxicity; also order for comparison |
| C29R point mut. | LLR = 3.64 | 0 aa | Minimal change | — | Present | Secondary candidate |
| S9Q point mut. | LLR = 1.69 | 0 aa | Minimal change | — | Present | Cross-validated from HTGAA Week 5 scan — order as positive control |
Comparison: John’s Pipeline vs. Eric’s Pipeline
| Aspect | John’s Pipeline | Eric’s Pipeline |
|---|---|---|
| Primary engineering strategy | N-terminal truncation series (trunc10 through trunc40), remove regulatory domain progressively | Point mutation design within the free zone (aa 16 to 28), preserve domain and modify specific residues |
| Lead candidate | L_trunc30, removes aa 1 to 30, 45 aa remaining, net charge -2 | P13L, single Pro to Leu substitution at position 13, full 75 aa retained |
| Secondary candidates | C29R (LLR 3.64), F22R (LLR 1.86), S9Q (LLR 1.69) | D26R (untested), D26G (mixed), N17R and H24R (open) |
| Hypothesis tested | Truncation of N-terminal inhibitory domain releases TM domain conformational lock; charge reduction disrupts DnaJ interaction | Increasing cationic charge density in N-terminal region enhances electrostatic membrane disruption and lytic potency |
| ESM2 usage | Full masked prediction scan across all 75 positions; LLR computed for every substitution; top 20 ranked by score | Single position evaluated (D26 to R); P13L not run through ESM2 |
| ESM2 scope | Systematic, 75 × 19 = 1,425 substitutions scored | Targeted, 1 substitution scored |
| ESMFold usage | Structure prediction for all 7 variants (WT plus 6 truncations); interdomain contact analysis | Monomer prediction for WT, D26R, P13L; pTM and pLDDT comparison |
| ESMFold key finding | Zero interdomain contacts across all variants, interpreted as intrinsic NTD disorder | P13L pTM = 0.420 vs WT 0.273, increase of 0.147, highest monomer score of any variant tested |
| AlphaFold-Multimer | Planned for L plus DnaJ complex; timed out on Colab; no results | Run on WT oligomers (3-mer, 4-mer, 5-mer); all ipTM below 0.35; tool retired |
| AlphaFold-Multimer conclusion | Inconclusive due to Colab timeout; HPC rerun planned | Formally retired, confirmed unreliable for small integral membrane proteins |
| Structural visualization | RMSF profile (GROMACS demo), NTD stub ~1.87 nm vs TM domain ~0.27 nm | ChimeraX electrostatic surface map, three functional zones confirmed |
| GROMACS MD | Full pipeline implemented, 4 MDP files generated; SLURM script for HPC; 1 ns demo RMSF computed | Not performed |
| ProteinMPNN | Junction redesign attempted for trunc30 with TM domain fixed | Not performed |
| Conservation analysis | Not performed as separate stage | Clustal Omega run twice on 51 homologs; free zone (aa 16 to 28) defined |
| ORF overlap analysis | Not performed | Full DNA-level codon analysis at nt 1715; P13L causes TCC to TCT at CP codon 127; synonymous S to S; cleared safe |
| Experimental lysis data | Not cross-referenced, computational pipeline only | Cross-referenced against group wet lab data; P13L confirmed lytic in both replicates |
| Wet lab validation status | Not yet validated, synthesis constructs designed | P13L experimentally confirmed lytic, both replicates positive |
| Codon optimization | Performed, E. coli K-12 Kazusa table; GC content 30.7% flagged; LS motif confirmed present | Identified as next step, not yet completed |
| Synthetic gene construct | Fully designed, 230 bp construct with Ptrc, RBS, lambda t0, rrnB T1, Gibson overhangs | Planned for synthesis via Twist Bioscience; construct not yet finalized |
| Bowtie2 / BCFtools / IGV | Implemented and demonstrated with simulated reads; IGV batch script generated | Listed as planned next step, not yet performed |
| DnaJ interaction | Central to hypothesis, truncation removes basic domain responsible for DnaJ electrostatic engagement | Not explicitly modeled |
| Net charge of lead candidate | -2 (charge reversal from highly basic WT) | Unchanged from WT, P13L does not alter charge |
| LS motif verification | Confirmed present in codon-optimized sequence (CTGAGC) | Not explicitly checked |
| Key methodological strength | Systematic genome-wide scanning and full pipeline automation; all stages reproducible from single notebook | Experimental ground truth, wet lab confirmation provides direct biological validation |
| Key methodological gap | No experimental validation yet; interdomain contact analysis inconclusive | No systematic positional scanning; ESM2 used for only 1 position; no MD or ProteinMPNN |
| Most actionable next step | Rerun Stage 3 on HPC for DnaJ PAE; GC balance codon sequence; order L_trunc30 synthesis | Order D26R for experimental validation alongside confirmed P13L |
Appendix
A. Primary Requirements
Part D. Group Brainstorm on Bacteriophage Engineering
- Find a group of ~3–4 students
- 2026a-john-adeyemo-adedeji
- 2026a-brie-taylor
- 2026a-eric-schneider
- 2026a-albert-manrique
- 2026a-Tehseen Rubbab
- Read through the Phage Reading material listed under “Reading & Resources” below.
- Review the Bacteriophage Final Project Goals for engineering the L Protein:
- Increased stability (easiest)
- Higher titers (medium)
- Higher toxicity of lysis protein (hard)
Brainstorm Session
Choose one or two main goals from the list that you think you can address computationally. Write a 1-page proposal (bullet points or short paragraphs) describing:
- Which tools/approaches from recitation you propose using
- Why do you think those tools might help solve your chosen sub-problem?
- Name one or two potential pitfalls
- Include a schematic of your pipeline
This resource may be useful: HTGAA Protein Engineering Tools
Action Items:
- Schedule a Group working session — Google Meet
- Initial comments (Brainstorm) on #4
B. Eric’s Brainstorming Notes
Goal: I am recommending Goal C: Higher toxicity of lysis protein (hard)
Hypothesis: I believe we can focus on the cationic properties, or positive electrical charges that are present in the amino acid sequence. By substituting amino acids that enable more positive charge strengthening electrostatic attraction, we may create more binding activity. Lysis timing can be tuned in either direction by manipulating charge density.
Pipeline:
- UniProt — retrieve sequence
- BLAST — find homology
- PyMOL — visualize polarity
- PyMOL — isolate target residues
- ESM2 — score substitution probability
- Heatmap — synthesize data
- ESMFold — predict mutant structures
- PyMOL — compare mutants to baseline
- Codon optimization — prepare sequences
- Twist Bioscience — synthesize genes
- Benchling — design plasmid constructs
- Review gate — confirm replicability
- Opentrons OT-2 — run protocol and collect data
Potential Pitfalls:
My hypothesis focuses on region 1 (facing cytoplasm, hydrophilic) and region 3 (a mix of hydrophobic and hydrophilic or “amphipathic,” facing periplasm) to control timing of MurA enzyme inhibition.
- Region 1 & 3: Too much polarity change could cause the phage to bind and become entrapped.
- Avoid region 2 as it is a very well defined helical fold that is subject to disruption with minor change to structure.
Schematic of Pipeline:
- Phase 1 — Discovery: UniProt → BLAST → PyMOL
- Phase 2 — Mutation Analysis: PyMOL → ESM2 → Heatmap → ESMFold → PyMOL
- Phase 3 — Synthesis: Codon Optimization → Twist Bioscience
- Phase 4 — Plasmid Design: Benchling → Review Gate
- Phase 5 — Execution: Opentrons OT-2
Review feedback: Will likely encounter overlapping frames, and will visualize in Benchling.
C. John’s Brainstorming Notes
Computational Goals:
- Align reads to MG1655 & call SNPs/indels (Bowtie2/Mpileup/BCFtools)
- Codon-optimize and synthesize L gene variants
- Error-prone PCR mutagenesis to generate L mutant libraries
Proposal — Proposed tools:
- Input: Paired-end Illumina reads (250 bp) from mutant and parental strain genomic DNA; Reference: MG1655 (E. coli K-12, accession NC_000913.3)
- Quality Control: FastQC — raw read quality assessment; Trimmomatic or Fastp — adapter trimming, low-quality base removal
- Alignment: Bowtie2 — short-read alignment to reference; SAMtools — convert SAM → BAM, sort, index
- Variant Calling: SAMtools Mpileup — pileup of aligned reads per base position; BCFtools call — generate VCF files; Filter: QUAL score >100, present in mutant but absent in parental strain
- Annotation: SnpEff or ANNOVAR — annotate variants with gene names, amino acid changes, functional impact
- Visualization: IGV (Integrative Genomics Viewer) — manual inspection of called variants at loci of interest
- Environment: Linux/bash, conda for dependency management; Galaxy platform (cpt.tamu.edu/galaxy-pub)
- Output: Ranked list of candidate causal mutations unique to mutants (e.g., dnaJ P330Q)

Major sub-problem the tools solve: The core challenge is distinguishing a true causal mutation from background noise in a mutagenized genome.
- Bowtie2 handles short-read alignment efficiently against a well-annotated reference, minimizing misalignment artifacts
- Mpileup/BCFtools applies statistical models to distinguish true variants from sequencing errors
- QUAL >100 filtering + parental subtraction eliminates pre-existing polymorphisms
- SnpEff immediately translates nucleotide changes into amino acid consequences
Potential Pitfalls:
- Sibling contamination
- Reference bias
D. Albert’s Notes
Goals: Increase the L protein structural stability to improve lysis efficiency. It’s a small membrane protein that disrupts the inner E. coli membrane during phage infection.
Pipeline:
- Get protein sequence from UniProt; Run BLAST to find homologs across phage strains; Run Clustal Omega to identify hot spots for mutations
- Run ESM2 to identify mutations and where we can mutate without affecting structural stability; Keep mutations that don’t disrupt the protein structure
- Run the mutations through ESMFold to predict structure and filter for stability
- Rank the candidates by stability (pLDDT) improvements over the UniProt sequence
- Run top candidates through AlphaFold-Multimer to confirm the mutations don’t affect the interaction between E. coli DnaJ
- Take the top candidates and run them through the wet lab
Pipeline diagram:
What tools are we using and why?
ESM2 allows us to run stochastic gradient descent on how stable our protein sequences are likely to be and what evolution considers normal.
ESMFold provides us with a pLDDT value for structural confidence and together we can automate mutation screening before hitting the wet lab.
Clustal Omega provides us with positions on the phage strain that we should not change in order to further preserve structural stability.
Pitfalls: L protein is a membrane protein and might not be as well represented in ESM2 training data and the PDB so we might have less reliable outputs. Our folding models aren’t taking into account lipid membranes so we might have issues with modeling the interaction. Our stability estimates might also be inaccurate as the delta between mutations may be too small to rank them accurately.
E. Tehseen’s Brainstorming Notes
Systematic Tuning of the N-Terminal Regulatory Domain
Goal: Enhance and regulate the toxicity of the MS2 bacteriophage L lysis protein by systematically modifying its N-terminal domain. Instead of removing this region, identify the minimal regulatory segment needed for precise control of lysis timing and activity.
Background and rationale
The L protein, a 75-amino acid membrane-bound lysis protein, is responsible for killing E. coli during infection. Studies show that its N-terminal domain (~first 30–40 amino acids) is not required for lysis; truncation mutants (Lodj variants) lacking this region still lyse cells, often faster. This indicates the N-terminus acts as a regulatory brake to delay lysis and support viral replication.
Hypothesis
The regulatory function of the N-terminal domain in lysis is influenced by its length and charge characteristics. It is proposed that:
- Partial truncations may incrementally diminish inhibitory effects and subsequently enhance lysis efficiency
- The regulatory activity appears to be localised to a distinct sub-region rather than to the entire N-terminal domain
- There is likely an optimal truncation point that achieves a balance between increased toxicity and maintenance of protein stability
Proposed Computational Pipeline:
- Sequence Retrieval: Obtain the L protein sequence from UniProt.
- Structural and Residue Analysis: Visualize the N-terminal domain using PyMOL to identify hydrophilic and cationic residues.
- In Silico Mutagenesis: Use ESM2/ESMFold to predict the effect of substitutions that increase cationicity, focusing on residues facing the cytoplasm or periplasm.
- Stability Check: Compare predicted mutants’ folding and stability using ESMFold and pLDDT scores.
- Interaction Analysis: Optional AlphaFold-Multimer predictions to confirm L interaction with DnaJ or other host factors is preserved.
- Prioritization: Generate a heatmap of mutants ranked by predicted lysis enhancement and structural stability.
- Codon Optimization & Synthesis: Prepare selected mutants for experimental validation.
Expected Outcomes: Increased electrostatic interaction with target host proteins; tunable lysis timing while preserving N-terminal regulatory functions; generation of mutant library for wet lab testing of lytic efficiency.
Potential Pitfalls: Excessive cationic mutations could cause nonspecific aggregation or mislocalization. Predictions may differ from experimental results.
F. Group Meeting Notes (3/24)
- 10, 20, 30, 40 base pairs (changes)
- Overlapping frames?
- Pipeline approach: each person picks a tool to explore in depth, then come back and review/align on results
Tuesday — met to discuss current state:
- What is the dependency outside of L-protein standalone?
- What is the multi-frame dependency when engineering a plasmid?
- L-protein is the focus — engineer
- Refer to WEEK 5 Lab Resources for L-Protein
- Reminder to post new questions/topics in Genspace Discourse Forum for knowledge sharing, TA support
- Follow-up: met with John, identified focus area — IGV (Integrative Genomics Viewer) for manual inspection of called variants at loci of interest
- ES: located some initial ChimeraX visualizations — will post images
Wednesday 3/25 — explore sequence in silico individually
Thursday 3/26 — pick a high probability option
Friday 3/27 — model in Benchling and Asimov Kernel
Saturday 3/28 — (TBD)
Sunday 3/29 — Final summary. By EOD Sunday 3/29, publish here. Please post personal pipeline visualizations/notes under your brainstorm section.
Status Update: Friday, March 27th
Eric’s Final Summary Notes: On 3/26 I did a “deep dive” into the remaining project scope, decided to focus on the identification of an amino acid substitution that would support our hypothesis around the N-1 Terminus region.
Primary request: Please review, and if you agree, or want to add/change anything, feel free to annotate with comments. Once we have consensus, we can submit the markdown file as our final “group project”.
References
- Bernhardt TG, Roof WD, Young R (2002). The Escherichia coli FKBP-type PPIase SlyD is required for the stabilization of the phage PhiX174 lysis protein E. Mol Microbiol. PMC5446614.
- Chamakura KR, Young R (2019). Phage single-gene lysis: how it works and why it matters. Future Microbiol. PMC5775895.
- Lin DL et al. (2023). Structural insights into MS2 lysis protein L and its interaction with DnaJ. PMC10688784.
- Schilling T, et al. (2023). Engineering bacteriophage lysis proteins for enhanced activity. PubMed 36608652.
- Lin YW, et al. (2017). MS2 lysis protein L: a glycoprotein tethered to the membrane by a single transmembrane segment. PMC5446614.
- Lin DL, Leick M, Young R (2017). Lysis protein gene products specifically inhibit phage-mediated bacterial cell lysis. PMC5775895.
- UniProt P03609: LYS_BPMS2 MS2 lysis protein. https://www.uniprot.org/uniprotkb/P03609
- UniProt P08622: DNAJ_ECOLI E. coli DnaJ chaperone. https://www.uniprot.org/uniprotkb/P08622
- Lin YW et al. ESM2 protein language models. Meta AI 2023.
- Jumper J et al. AlphaFold2. Nature 2021. DOI: 10.1038/s41586-021-03819-2.
- Dauparas J et al. ProteinMPNN. Science 2022. DOI: 10.1126/science.add2187.
HTGAA 2026 — MS2 L Protein Group Project
Computational pipeline developed in collaboration with group members Eric, Albert, Tehseen, and John