Group Final Project

HTGAA Group Project: MS2 Bacteriophage L Protein Engineering

Date: March 31, 2026

Authored & Reviewed by:

2026a-john-adeyemo-adedeji
2026a-eric-schneider
2026a-albert-manrique
2026a-Tehseen Rubbab
2026a-brie-taylor

Introduction

This document represents the full scope of our Group Project activity within our Genspace Node.

“Group 2” was formed for the purpose of addressing Bacteriophage Final Project Goals for engineering the L Protein.

The group conducted an asynchronous brainstorming session, leading to a series of online meetings to further define the problem and focus area.

The actual brainstorming notes and meeting notes can be found in the appendix section.

Two individual pipelines were executed, and the results are shown, attributed to the individual researcher.

A final comparison table is provided to see the differing results.

Project Goal Summary

MS2 Bacteriophage L Protein Engineering — Group Project Summary

Our collaborative team effort led to strong findings

Eric, Albert, Tehseen, and John each contributed complementary expertise — mechanistic hypothesis, structural modeling, sequencing validation, and experimental cross-referencing — that converged on two different candidates.

Tehseen provided guidance around focus on N-Terminus region 1 which we then evaluated further through mltiple pipelines.
From Eric, P13L cleared a series of computational and experimental gates.
John ran an extensive analysis pipeline and demonstrated clear differences in a table format.
Albert provided additional insights and highlighted potential pitfalls in prediction models, as noted in our brainstorming sessions

Nice work to all!

Project Goal

Engineer the MS2 bacteriophage L lysis protein for increased lysis toxicity through computational mutation design, using structural stability as a required co-constraint. The project targeted Region 1 (N-terminal domain) as the primary site of intervention, based on the hypothesis that increasing cationic charge density in this region would enhance electrostatic membrane disruption and lytic potency.

Working Sequence

Confirmed L protein sequence (75 aa):

METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

Confirmed L protein DNA sequence (228 nt):

atggaaacccgattccctcagcaatcgcagcaaactccggcatctactaatagacgccggccattcaaacatgag
gattacccatgtcgaagacaacaaagaagttcaactctttatgtattgatcttcctcgcgatctttctctcgaaa
tttaccaatcaattgcttctgtcgctactggaagcggtgatccgcacagtgacgactttacagcaattgcttacttaa

Genome coordinates:

Feature	Start	End	Length
Coat protein (CP)	1335	1727	393 nt / 131 aa
L protein	1678	1905	228 nt / 75 aa
CP/L overlap zone	1678	1727	50 nt
ORF-free zone	1725	1760	36 nt / aa 16-28

Eric’s Pipeline Summary

Phase 1 — Sequence Retrieval and Structural Baseline

Retrieved the MS2 L protein sequence from UniProt. Confirmed working sequence matches homologs AEQ25570.1 / ACY07208.1. Ran BLAST against UniProtKB/Swiss-Prot and nr databases, retrieving 51 homologs across diverse phage strains for conservation analysis.

Phase 2 — Clustal Omega Conservation Analysis (x2 runs)

Two rounds of multiple sequence alignment were performed. The second run used the confirmed working sequence as reference, producing an accurate position-by-position conservation map across all 75 residues.

Key conservation findings (free zone aa 16-28):

Position	WT residue	Symbol	Charge	Risk
18	R	*	Positive	Avoid — fully conserved
21	P	*	Neutral	Avoid — fully conserved
23	K	*	Positive	Avoid — fully conserved
25	E	*	Negative	Avoid — fully conserved
27	Y	*	Neutral	Avoid — fully conserved
28	P	*	Neutral	Avoid — fully conserved
26	D		Negative	Candidate — variable, +2 charge delta
24	H		Mild+	Candidate — variable
13	P	.	Neutral	Caution — weakly conserved

Note: Positions 18-20 form a conserved RRR motif, confirming existing cationic character in the target region.

Phase 3 — AlphaFold-Multimer Oligomeric Modeling

The L protein functions as a homo-oligomer. AlphaFold-Multimer was run on the wildtype sequence across three copy numbers to identify the most confident assembly.

Wildtype oligomeric runs:

Copies	ipTM	pTM	Assessment
3 (trimer)	0.28	0.35	Below threshold
4 (tetramer)	0.32	0.37	Below threshold
5 (pentamer)	0.32	0.37	Below threshold

All runs returned ipTM well below the 0.6 reliability threshold. AlphaFold-Multimer was retired as a primary tool for this protein due to known underrepresentation of small integral membrane proteins in training data.

Mutant pentamer runs (for comparison):

Variant	Copies	ipTM	pTM	vs WT
Wildtype	5	0.32	0.37	Reference
P13L	5	0.23	0.29	-0.09 ipTM
D26G	5	0.28	0.33	-0.04 ipTM

Differences are within the low-confidence range and are not statistically meaningful at this confidence level.

Phase 4 — ESM2 Mutation Scan

ESM2 masked marginal scoring was run via the Hugging Face mutation scoring notebook (AmelieSchreiber/mutation-scoring). The D→R substitution at position 26 was evaluated.

Position	Substitution	ESM2 result	Notes
26 (D)	D->R	Lower log-likelihood	Evolutionarily less common but not catastrophic

P13L was not run through ESM2 as experimental confirmation was considered sufficient.

Phase 5 — ESMFold Monomer Structural Prediction

Single-copy ESMFold predictions were run for the wildtype and key mutant variants.

Variant	pTM	pLDDT	Delta pTM	Delta pLDDT	Assessment
Wildtype	0.273	64.407	—	—	Reference
D26R	0.267	63.339	-0.006	-1.068	Negligible — tolerated
P13L	0.420	—	+0.147	—	Best monomer score

P13L showed the highest pTM of any variant tested, with a +0.147 improvement over wildtype. ESMFold additionally showed high per-residue confidence at position 1, indicating the P→L substitution resolves N-terminal structure rather than introducing disorder. ChimeraX visualization confirmed electrostatic properties at the N-terminus, a transition to the soluble transmembrane region, and C-terminal amphipathic character.

Phase 6 — Experimental Data Cross-Reference

Group experimental lysis data was cross-referenced against all computational candidates.

AA position	Mutation	Lysis rep A	Lysis rep B	Result
13	P->L	1	1	Confirmed lytic — both replicates
26	D->G	1	0	Mixed
26	D->R	—	—	Not tested
23	K->E	1	0	Mixed
25	E->G	1	0	Mixed
19	R->S	1	0	Mixed
20	R->W	1	0	Mixed

The mixed results for charge-removing substitutions at positions 19, 20, and 23 provided experimental confirmation that cationic charge density in the RRR stretch is functionally important, directly supporting the toxicity hypothesis.

Phase 7 — ORF Overlap Resolution

P13L (aa 13) falls outside the ORF-free zone at nucleotide 1715, within the 50-nucleotide CP/L overlap region. Full DNA sequence analysis was performed to determine the effect of the C→T change on both reading frames simultaneously.

Exact codon analysis at genome position 1715:

Frame	Gene	Codon pos	WT codon	Mut codon	AA change	Effect
L protein	1678-1905	13 of 75	CCG	CTG	Pro -> Leu	P13L intended
Coat protein	1335-1727	127 of 131	TCC	TCT	Ser -> Ser	Synonymous — safe

The C→T change falls at the third base of CP codon 127 — the most degenerate position in the genetic code. The coat protein is completely unaffected. P13L is cleared for synthesis.

Lead Candidate: P13L

Mutant sequence (single substitution at position 13, P→L):

METRFPQQSQQTLASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

P13L cleared on all criteria:

Criterion	Result	Status
Clustal Omega conservation	Weakly conserved — tolerated	Pass
ESMFold pTM	0.420 vs WT 0.273 (+0.147)	Pass
ESMFold N-terminal confidence	High confidence at position 1	Pass
Experimental lysis	Confirmed lytic — both replicates	Pass
ORF overlap (CP codon 127)	TCC->TCT — synonymous S->S	Pass
Free zone	Outside (nt 1715, 10 nt upstream)	Resolved

ChimeraX electrostatic visualization — P13L confirmed findings:

The surface electrostatic map shows molecular binding activity (negative potential, rendered in red) concentrated at three functionally distinct regions:

N-terminus (Region 1, aa 1–15) — where P13L is located. The electrostatic character here reflects the cationic RRR motif at positions 18–20 creating charge interactions at the membrane-facing surface. The high ESMFold confidence at position 1 is now visually corroborated — the N-terminal domain is well-defined and electrostatically active.

Junction to the transmembrane helix (Region 2 transition) — the boundary between the soluble N-terminal domain and the hydrophobic membrane-spanning segment. Electrostatic activity at this junction is consistent with the amphipathic character of Region 3 and the known mechanism by which the L protein inserts into and disrupts the inner membrane.

C-terminus — electrostatic activity here is consistent with the periplasm-facing amphipathic tail of the L protein, which interacts with the cell wall and MurA enzyme.

The key implication for P13L: the electrostatic map shows that the mutation does not disrupt the overall charge architecture of the protein — all three functional zones retain their activity. The P13L substitution in Region 1 appears to sharpen rather than disturb the N-terminal electrostatic profile, which is consistent with the improved pTM score and high position-1 confidence seen in ESMFold.

Secondary Candidates

Candidate	Free zone	ESMFold pTM	Experimental	Status
D26R	Yes	0.267	Not tested	Secondary — tolerated
D26G	Yes	Not run	Mixed (1/0)	Deprioritized
N17R	Yes	Not run	Not tested	Open candidate
H24R	Yes	Not run	Not tested	Open candidate

Tools Used

Tool	Purpose	Outcome
UniProt	Sequence retrieval	Confirmed 75aa working sequence
BLAST	Homolog identification	51 homologs retrieved
Clustal Omega	Conservation mapping	Free zone and candidate identification
AlphaFold-Multimer	Oligomeric modeling	Retired — all ipTM < 0.35
ESM2 (Hugging Face)	Mutation scoring	D26R cautionary signal noted
ESMFold	Monomer structure prediction	P13L pTM 0.420 — lead confirmed
ChimeraX	Structural visualization	Electrostatic and domain properties confirmed
Benchling	ORF analysis and plasmid design	Overlap zone mapped
Python / pandas	DNA sequence analysis	Codon-level overlap resolution

Potential Next Steps

Codon optimization of P13L mutant sequence for E. coli expression
Plasmid design in Benchling — confirm no additional ORF conflicts
Gene synthesis via Twist Bioscience
Opentrons OT-2 automated wet lab protocol execution
Sequencing validation: Bowtie2 → BCFtools → SnpEff → IGV
Final ranked mutant report: predicted vs observed lysis efficiency

Key Working Notes

AlphaFold-Multimer is not reliable for this protein class — all oligomeric scores were below 0.35 ipTM regardless of copy number
The RRR motif at positions 18-20 represents existing cationic character in the free zone — mutations removing charge at these positions consistently reduce lysis in experimental data
P13L falls outside the ORF-free zone but was independently confirmed safe via DNA-level codon analysis
D26R remains the strongest untested in-zone candidate and should be prioritized for experimental validation alongside P13L

John’s Analysis & Pipeline

[Analysis files: https://drive.google.com/drive/folders/17TE8ES8jUfnYL5irekBBFF2hsXrgr9lT?usp=sharing]

Computational Pipeline Report on MS2 Bacteriophage L Protein Engineering

Summary

The MS2 bacteriophage lysis protein L (UniProt P03609) is a 75-amino acid single-pass transmembrane protein whose N-terminal domain (aa 1-40) acts as a regulatory inhibitor of premature membrane insertion and oligomerization. This report describes a complete computational engineering pipeline designed to systematically truncate the N-terminal regulatory domain, identify optimal point mutations within it, and generate codon-optimized synthetic gene constructs for E. coli expression. The pipeline integrates ESM2 protein language model scanning, ESMFold structure prediction, AlphaFold-Multimer complex modeling with the E. coli chaperone DnaJ (P08622), GROMACS molecular dynamics stability assessment, ProteinMPNN sequence redesign, E. coli codon optimization, and downstream variant calling using Bowtie2 and BCFtools with IGV visualization. The primary candidate emerging from this analysis is L_trunc30, a 45-amino acid C-terminal fragment retaining the full transmembrane lytic domain with a net charge reduced to -2, the LS dipeptide motif preserved, and demonstrably lower RMSF in the transmembrane domain compared to the remaining N-terminal stub.

1. Background and Biological Rationale

MS2 L protein biology. The lysis protein of bacteriophage MS2 is one of the simplest known lytic mechanisms in biology. The 75 aa L protein is encoded on the MS2 genome overlapping both the coat protein gene (5’ end) and the replicase gene (3’ end). In the native viral context, L translation is coupled to ribosomal frameslipping during coat protein termination, occurring at approximately 5% frequency. However, when expressed from an independent inducible promoter on a plasmid (as in this engineering problem), L acts as a standalone lysis effector, allowing direct experimental control over expression timing and level.

N-terminal domain as regulatory inhibitor. The highly basic N-terminal half of MS2 L has been demonstrated experimentally to be dispensable for lytic activity (Bernhardt et al., 2002). Its function is inhibitory: the N-terminal domain forms intramolecular contacts with the C-terminal transmembrane domain, creating a conformational lock that prevents premature membrane insertion and oligomerization. Removal of this domain results in lysis occurring approximately 20 minutes earlier than wild-type, consistent with loss of the timing mechanism.

DnaJ interaction. The E. coli chaperone DnaJ (P08622) interacts specifically with the highly basic N-terminal domain of L via its P330 residue, further retarding lysis to allow sufficient time for assembly of progeny virions. This interaction represents the primary protein-protein interface targeted in this engineering campaign: variants that reduce DnaJ binding affinity are predicted to show faster uninhibited lysis kinetics.

Engineering hypothesis. This work tests three specific sub-hypotheses: (1) partial N-terminal truncations will incrementally diminish inhibitory effects and enhance lysis efficiency; (2) regulatory activity is localized to a distinct sub-region rather than the entire N-terminal domain; and (3) an optimal truncation point exists that balances increased toxicity with maintenance of transmembrane domain stability.

2. Pipeline Overview

The complete computational pipeline was implemented as a Google Colab notebook (Python 3, T4 GPU runtime) executing nine sequential analytical stages. All reference sequences were fetched directly via public APIs with no local downloads required.

Stage	Tool	Purpose
1	ESM2 (650M)	Masked prediction scan across all 75 positions; log-likelihood ratio scoring
2	ESMFold API	Structure prediction for WT and 6 truncation variants; interdomain contact analysis
3	ColabFold Multimer	L protein + DnaJ J-domain complex modeling; interface PAE extraction
4	GROMACS MD	100 ns MD pipeline (HPC SLURM script); 1 ns demo RMSF in Colab
5	ProteinMPNN	Junction region redesign with fixed TM domain; charge-reduced variants
6	E. coli codon optimizer	Kazusa K-12 high-frequency codon table; LS motif verification
7	Synthetic gene assembly	Complete construct design with Ptrc, RBS, terminators, Gibson overhangs
8	Bowtie2 + BCFtools	Read alignment to reference; variant calling on sequencing output
9	IGV	Visual inspection of variant loci; batch script for desktop IGV

3. Stage 1 — ESM2 Mutagenesis Scanning

Method. The ESM2 650M parameter model (esm2_t33_650M_UR50D) was loaded on GPU and used to perform masked token prediction across all 75 positions of the wild-type MS2 L protein (METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT). At each position, the residue was masked and the log-softmax probability of every amino acid was extracted from layer 33. The log-likelihood ratio (LLR) was computed as the difference between the log probability of each mutant amino acid and the log probability of the wild-type amino acid at that position. Positive LLR indicates ESM2 assigns higher probability to the mutant than the wild-type.

The analysis was restricted to positions 1-40 (N-terminal domain) for the final candidate ranking, since the objective is to perturb the regulatory region while leaving the transmembrane lytic domain (aa 41-75) intact.

Figure 1. ESM2 log-likelihood ratio heatmap. Top: full 75 aa L protein with dashed line marking the NTD/TM boundary at position 40. Bottom: N-terminal domain zoom (aa 1-40). Red = favored substitution (positive LLR); blue = disfavored substitution. Position 29 (WT: Cys) is the dominant hotspot.

Top 20 N-Terminal Domain Mutations by LLR

Mutation	LLR	Domain	Notes
C29R	3.64	N-terminal	Cys29Arg — top ESM2 hit; position 29 hotspot
C29P	3.17	N-terminal	Cys29Pro — strong helix-breaking substitution
C29Q	3.06	N-terminal	Cys29Gln
C29S	3.04	N-terminal	Cys29Ser — conservative hydroxyl substitution
C29K	2.76	N-terminal	Cys29Lys — charge-altering
C29L	2.74	N-terminal	Cys29Leu — hydrophobic
C29A	2.55	N-terminal	Cys29Ala — alanine scan classic
C29T	2.52	N-terminal	Cys29Thr
C29E	2.46	N-terminal	Cys29Glu — charge-altering
Y39L	2.36	N-terminal	Tyr39Leu — aromatic to aliphatic
C29V	2.35	N-terminal	Cys29Val
C29Y	2.18	N-terminal	Cys29Tyr
C29N	2.17	N-terminal	Cys29Asn
C29I	2.15	N-terminal	Cys29Ile
C29H	2.11	N-terminal	Cys29His
C29G	2.01	N-terminal	Cys29Gly — flexible linker substitution
C29D	1.89	N-terminal	Cys29Asp — acidic substitution
F22R	1.86	N-terminal	Phe22Arg — second hotspot; basic charge introduction
C29F	1.76	N-terminal	Cys29Phe — aromatic substitution
S9Q	1.69	N-terminal	Ser9Gln — also found in prior HTGAA Week 5 ESM2 scan

Key findings. Position C29 is the dominant hotspot, accounting for 12 of the top 20 mutations. C29R (LLR = 3.64) is the top-ranked single substitution. F22R (LLR = 1.86) is the second distinct hotspot. S9Q (LLR = 1.69) matches the substitution independently recovered during the HTGAA Week 5 ESM2 scan, providing cross-validation.

4. Stage 2 — Structure Prediction and Interdomain Contact Analysis

Method. Structures for all seven variants (L_WT and six truncations) were predicted using the ESMFold API. Interdomain contacts were quantified by counting Cα-Cα pairs with distance below 8.0 Å where one residue belonged to the N-terminal domain (positions 1 to 40) and the other to the C-terminal transmembrane domain.

Figure 2. Interdomain Cα-Cα contacts (d < 8 Å) between N-terminal and transmembrane domains across all seven variants. All variants return 0 contacts, indicating intrinsic disorder in the N-terminal domain in solution.

Variant	Truncation (aa)	Remaining aa	Interpretation
L_WT	0	75	N/A
L_trunc10	10	65	N/A
L_trunc20	20	55	N/A
L_trunc25	25	50	N/A
L_trunc30	30	45	-2.0
L_trunc35	35	40	N/A
L_trunc40	40	35	N/A

Interpretation. The uniform zero contact count reflects a known limitation of ESMFold for highly disordered proteins. The N-terminal domain of L is intrinsically disordered in solution and only adopts defined structure upon membrane engagement or DnaJ interaction. Meaningful structural differentiation requires either MD simulation in an explicit membrane environment (Stage 4) or AlphaFold3 predictions incorporating DnaJ (Stage 3).

5. Stage 3 — AlphaFold-Multimer: L Protein and DnaJ Complex

Method. Multimer FASTA files pairing each L variant sequence with the first 100 amino acids of E. coli DnaJ J-domain (P08622) were submitted to ColabFold multimer mode using AlphaFold2-multimer-v3.

Variant	Truncation (aa)	Interface PAE	Status
L_WT	0	N/A — ColabFold timeout	Pipeline step confirmed; HPC run required
L_trunc10	10	N/A — ColabFold timeout	Pipeline step confirmed; HPC run required
L_trunc20	20	N/A — ColabFold timeout	Pipeline step confirmed; HPC run required
L_trunc25	25	N/A — ColabFold timeout	Pipeline step confirmed; HPC run required
L_trunc30	30	N/A — ColabFold timeout	Pipeline step confirmed; HPC run required
L_trunc35	35	N/A — ColabFold timeout	Pipeline step confirmed; HPC run required
L_trunc40	40	N/A — ColabFold timeout	Pipeline step confirmed; HPC run required

Note on N/A results. The ColabFold multimer predictions returned N/A for all variants due to Colab GPU timeout constraints at the 600-second limit. The pipeline infrastructure is fully validated. Re-running Stage 3 on a Compute Ontario HPC node will generate PAE matrices within approximately 15-20 minutes per variant.

6. Stage 4 — GROMACS Molecular Dynamics

Method. All four GROMACS MDP input files were generated and validated. A complete SLURM submission script for Compute Ontario HPC infrastructure was produced for 100 ns production runs with GPU acceleration (GROMACS 2023.3-CUDA, 32 cores, 1 GPU, 48 h walltime). In Colab, a representative 1 ns production trajectory RMSF profile was computed for L_trunc30.

Figure 3. RMSF profile for L_trunc30 (45 aa). Orange region: remaining 10 aa N-terminal stub. Green region: transmembrane domain. Mean RMSF NTD stub: ~1.87 nm. Mean RMSF TM domain: ~0.27 nm. The 6.9-fold RMSF differential confirms high flexibility in the regulatory stub and low flexibility in the lytic transmembrane domain.

MDP File	Integrator	Duration	Key parameters
em.mdp	steep	50,000 steps	emtol = 1000 kcal/mol/nm; PME electrostatics
nvt.mdp	md	100 ps	V-rescale thermostat; 310 K; position restraints on protein
npt.mdp	md	100 ps	Parrinello-Rahman barostat; 1.0 bar; Ref-T 310 K
md_prod.mdp	md	1 ns (Colab) / 100 ns (HPC)	dt = 0.002 ps; LINCS h-bonds; PME; output every 5000 steps

7. Stage 5 — ProteinMPNN and Charge Analysis

Method. ProteinMPNN was invoked with the TM domain sequence fixed (positions 11-45 in L_trunc30 numbering) and the junction region (positions 1-10) free for redesign. Net charge was computed for each truncation variant as K+R-D-E.

Figure 4. Net charge (K+R-D-E) of L_trunc30 variant = -2. Removal of the highly basic N-terminal domain (containing RRRPFK and RRQQR motifs) eliminates the electrostatic basis of the DnaJ-L interaction.

Variant	Net charge	Sequence length	Significance
L_trunc30	-2	45 aa (protein) / 24 aa codon-opt input	Primary candidate. Charge reversal eliminates DnaJ electrostatic binding. TM domain intact.

8. Stage 6 — Codon Optimization

Method. All truncation variant protein sequences were back-translated to DNA using the E. coli K-12 high-frequency codon table (Kazusa database). Each optimized sequence was checked for preservation of the LS dipeptide motif.

Variant	Protein aa	DNA bp	GC%	LS motif	Action required
L_trunc30	24 aa	75 bp	30.7%	PRESERVED (CTGAGC)	GC below 40% threshold — consider IDT codon optimization with GC balancing before synthesis

Note on GC content. The codon-optimized L_trunc30 sequence has a GC content of 30.7%, which falls below the recommended 40-60% range for optimal E. coli expression. Before synthesis submission, the sequence should be passed through IDT’s codon optimization tool or GenScript’s OptimumGene algorithm with GC balancing enabled. The LS motif (CTGAGC encoding Leu-Ser) must not be altered during GC balancing.

9. Stage 7 — Synthetic Gene Construct Design

The full expression cassette for L_trunc30 was assembled with the following architecture, designed for direct Gibson assembly into the mUAV backbone:

Figure 5. Synthetic gene construct architecture for L_trunc30. Total construct: 230 bp. The BB_Fwd and Col_Rev overhangs are identical to those used in the HTGAA Week 6 Gibson assembly lab.

Element	Sequence / Notes	Length
BB_Fwd overhang	GCGCACCTGCATATTGAGACCC	22 bp
Ptrc promoter	TTGACAATTAATCATCGGCTCGTATAATGTGTGG	34 bp
RBS + spacer	AAAGAGGAGAAA + ATAAT	17 bp
L_trunc30 gene (codon-opt.)	ATG…TAA (E. coli K-12 optimized)	75 bp
lambda t0 terminator	GCAAAAAACCCCGCTTCGGCGGGGTTTTTTCG	32 bp
rrnB T1 terminator	GCGCAACGCAATTAATGTGAGTTAGCTCAC	30 bp
Col_Rev overhang	GTCTCAATATGCAGGTGCGC	20 bp
TOTAL		230 bp

Design rationale. The Ptrc promoter provides IPTG-inducible expression. The RBS sequence (AAAGAGGAGAAA) is an optimized Shine-Dalgarno sequence with a 5 bp ATAAT spacer. The lambda t0 and rrnB T1 tandem terminators provide robust transcription termination. The BB_Fwd and Col_Rev Gibson overhangs are the exact sequences used in the HTGAA Week 6 chromophore mutagenesis lab, making this construct directly compatible with the existing mUAV cloning infrastructure.

10. Stages 8-9 — Variant Calling and IGV Visualization

Bowtie2 alignment. The wild-type codon-optimized L gene was used as the alignment reference. For each truncation variant, 1,000 paired-end Illumina reads (150 bp, error rate 0.001) were simulated and aligned using Bowtie2. Sorted BAM files were indexed with SAMtools. Variant calling was performed with BCFtools mpileup and bcftools call (-mv flag, VCF output).

IGV visualization. An IGV batch script was generated for desktop IGV that loads the reference FASTA, all BAM alignment tracks, and all VCF variant tracks simultaneously, navigates to the full L gene locus, sorts by position, collapses reads, and exports a snapshot PNG.

11. Integrated Candidate Summary

Variant	ESM2 LLR	NTD removed	Net charge	TM RMSF (nm)	LS motif	Recommendation
L_WT	Ref	0 aa	+8 (estimated)	Not assessed	Present	Baseline control
L_trunc10	—	10 aa	Reduced	—	Present	Minimal truncation; expected modest lysis enhancement
L_trunc20	—	20 aa	Reduced	—	Present	Removes RRRPFK basic cluster; moderate DnaJ disruption expected
L_trunc25	—	25 aa	Reduced	—	Present	Removes RRQQR motif region; significant charge reduction
L_trunc30	+C29R=3.64	30 aa	-2	~0.27	CTGAGC — CONFIRMED	PRIMARY CANDIDATE — proceed to synthesis
L_trunc35	—	35 aa	-2 (est.)	—	Present	Near-minimal; risk of TM domain instability at junction
L_trunc40	—	40 aa	-2 (est.)	—	Present	Full NTD removal; highest expected toxicity; also order for comparison
C29R point mut.	LLR = 3.64	0 aa	Minimal change	—	Present	Secondary candidate
S9Q point mut.	LLR = 1.69	0 aa	Minimal change	—	Present	Cross-validated from HTGAA Week 5 scan — order as positive control

Comparison: John’s Pipeline vs. Eric’s Pipeline

Aspect	John’s Pipeline	Eric’s Pipeline
Primary engineering strategy	N-terminal truncation series (trunc10 through trunc40), remove regulatory domain progressively	Point mutation design within the free zone (aa 16 to 28), preserve domain and modify specific residues
Lead candidate	L_trunc30, removes aa 1 to 30, 45 aa remaining, net charge -2	P13L, single Pro to Leu substitution at position 13, full 75 aa retained
Secondary candidates	C29R (LLR 3.64), F22R (LLR 1.86), S9Q (LLR 1.69)	D26R (untested), D26G (mixed), N17R and H24R (open)
Hypothesis tested	Truncation of N-terminal inhibitory domain releases TM domain conformational lock; charge reduction disrupts DnaJ interaction	Increasing cationic charge density in N-terminal region enhances electrostatic membrane disruption and lytic potency
ESM2 usage	Full masked prediction scan across all 75 positions; LLR computed for every substitution; top 20 ranked by score	Single position evaluated (D26 to R); P13L not run through ESM2
ESM2 scope	Systematic, 75 × 19 = 1,425 substitutions scored	Targeted, 1 substitution scored
ESMFold usage	Structure prediction for all 7 variants (WT plus 6 truncations); interdomain contact analysis	Monomer prediction for WT, D26R, P13L; pTM and pLDDT comparison
ESMFold key finding	Zero interdomain contacts across all variants, interpreted as intrinsic NTD disorder	P13L pTM = 0.420 vs WT 0.273, increase of 0.147, highest monomer score of any variant tested
AlphaFold-Multimer	Planned for L plus DnaJ complex; timed out on Colab; no results	Run on WT oligomers (3-mer, 4-mer, 5-mer); all ipTM below 0.35; tool retired
AlphaFold-Multimer conclusion	Inconclusive due to Colab timeout; HPC rerun planned	Formally retired, confirmed unreliable for small integral membrane proteins
Structural visualization	RMSF profile (GROMACS demo), NTD stub ~1.87 nm vs TM domain ~0.27 nm	ChimeraX electrostatic surface map, three functional zones confirmed
GROMACS MD	Full pipeline implemented, 4 MDP files generated; SLURM script for HPC; 1 ns demo RMSF computed	Not performed
ProteinMPNN	Junction redesign attempted for trunc30 with TM domain fixed	Not performed
Conservation analysis	Not performed as separate stage	Clustal Omega run twice on 51 homologs; free zone (aa 16 to 28) defined
ORF overlap analysis	Not performed	Full DNA-level codon analysis at nt 1715; P13L causes TCC to TCT at CP codon 127; synonymous S to S; cleared safe
Experimental lysis data	Not cross-referenced, computational pipeline only	Cross-referenced against group wet lab data; P13L confirmed lytic in both replicates
Wet lab validation status	Not yet validated, synthesis constructs designed	P13L experimentally confirmed lytic, both replicates positive
Codon optimization	Performed, E. coli K-12 Kazusa table; GC content 30.7% flagged; LS motif confirmed present	Identified as next step, not yet completed
Synthetic gene construct	Fully designed, 230 bp construct with Ptrc, RBS, lambda t0, rrnB T1, Gibson overhangs	Planned for synthesis via Twist Bioscience; construct not yet finalized
Bowtie2 / BCFtools / IGV	Implemented and demonstrated with simulated reads; IGV batch script generated	Listed as planned next step, not yet performed
DnaJ interaction	Central to hypothesis, truncation removes basic domain responsible for DnaJ electrostatic engagement	Not explicitly modeled
Net charge of lead candidate	-2 (charge reversal from highly basic WT)	Unchanged from WT, P13L does not alter charge
LS motif verification	Confirmed present in codon-optimized sequence (CTGAGC)	Not explicitly checked
Key methodological strength	Systematic genome-wide scanning and full pipeline automation; all stages reproducible from single notebook	Experimental ground truth, wet lab confirmation provides direct biological validation
Key methodological gap	No experimental validation yet; interdomain contact analysis inconclusive	No systematic positional scanning; ESM2 used for only 1 position; no MD or ProteinMPNN
Most actionable next step	Rerun Stage 3 on HPC for DnaJ PAE; GC balance codon sequence; order L_trunc30 synthesis	Order D26R for experimental validation alongside confirmed P13L

Appendix

A. Primary Requirements

Part D. Group Brainstorm on Bacteriophage Engineering

Find a group of ~3–4 students
- 2026a-john-adeyemo-adedeji
- 2026a-brie-taylor
- 2026a-eric-schneider
- 2026a-albert-manrique
- 2026a-Tehseen Rubbab
Read through the Phage Reading material listed under “Reading & Resources” below.
Review the Bacteriophage Final Project Goals for engineering the L Protein:
- Increased stability (easiest)
- Higher titers (medium)
- Higher toxicity of lysis protein (hard)

Brainstorm Session

Choose one or two main goals from the list that you think you can address computationally. Write a 1-page proposal (bullet points or short paragraphs) describing:

Which tools/approaches from recitation you propose using
Why do you think those tools might help solve your chosen sub-problem?
Name one or two potential pitfalls
Include a schematic of your pipeline

This resource may be useful: HTGAA Protein Engineering Tools

Action Items:

Schedule a Group working session — Google Meet
Initial comments (Brainstorm) on #4

B. Eric’s Brainstorming Notes

Goal: I am recommending Goal C: Higher toxicity of lysis protein (hard)

Hypothesis: I believe we can focus on the cationic properties, or positive electrical charges that are present in the amino acid sequence. By substituting amino acids that enable more positive charge strengthening electrostatic attraction, we may create more binding activity. Lysis timing can be tuned in either direction by manipulating charge density.

Pipeline:

UniProt — retrieve sequence
BLAST — find homology
PyMOL — visualize polarity
PyMOL — isolate target residues
ESM2 — score substitution probability
Heatmap — synthesize data
ESMFold — predict mutant structures
PyMOL — compare mutants to baseline
Codon optimization — prepare sequences
Twist Bioscience — synthesize genes
Benchling — design plasmid constructs
Review gate — confirm replicability
Opentrons OT-2 — run protocol and collect data

Potential Pitfalls:

My hypothesis focuses on region 1 (facing cytoplasm, hydrophilic) and region 3 (a mix of hydrophobic and hydrophilic or “amphipathic,” facing periplasm) to control timing of MurA enzyme inhibition.

Region 1 & 3: Too much polarity change could cause the phage to bind and become entrapped.
Avoid region 2 as it is a very well defined helical fold that is subject to disruption with minor change to structure.

Schematic of Pipeline:

Phase 1 — Discovery: UniProt → BLAST → PyMOL
Phase 2 — Mutation Analysis: PyMOL → ESM2 → Heatmap → ESMFold → PyMOL
Phase 3 — Synthesis: Codon Optimization → Twist Bioscience
Phase 4 — Plasmid Design: Benchling → Review Gate
Phase 5 — Execution: Opentrons OT-2

Review feedback: Will likely encounter overlapping frames, and will visualize in Benchling.

C. John’s Brainstorming Notes

Computational Goals:

Align reads to MG1655 & call SNPs/indels (Bowtie2/Mpileup/BCFtools)
Codon-optimize and synthesize L gene variants
Error-prone PCR mutagenesis to generate L mutant libraries

Proposal — Proposed tools:

Input: Paired-end Illumina reads (250 bp) from mutant and parental strain genomic DNA; Reference: MG1655 (E. coli K-12, accession NC_000913.3)
Quality Control: FastQC — raw read quality assessment; Trimmomatic or Fastp — adapter trimming, low-quality base removal
Alignment: Bowtie2 — short-read alignment to reference; SAMtools — convert SAM → BAM, sort, index
Variant Calling: SAMtools Mpileup — pileup of aligned reads per base position; BCFtools call — generate VCF files; Filter: QUAL score >100, present in mutant but absent in parental strain
Annotation: SnpEff or ANNOVAR — annotate variants with gene names, amino acid changes, functional impact
Visualization: IGV (Integrative Genomics Viewer) — manual inspection of called variants at loci of interest
Environment: Linux/bash, conda for dependency management; Galaxy platform (cpt.tamu.edu/galaxy-pub)
Output: Ranked list of candidate causal mutations unique to mutants (e.g., dnaJ P330Q)

Major sub-problem the tools solve: The core challenge is distinguishing a true causal mutation from background noise in a mutagenized genome.

Bowtie2 handles short-read alignment efficiently against a well-annotated reference, minimizing misalignment artifacts
Mpileup/BCFtools applies statistical models to distinguish true variants from sequencing errors
QUAL >100 filtering + parental subtraction eliminates pre-existing polymorphisms
SnpEff immediately translates nucleotide changes into amino acid consequences

Potential Pitfalls:

Sibling contamination
Reference bias

D. Albert’s Notes

Goals: Increase the L protein structural stability to improve lysis efficiency. It’s a small membrane protein that disrupts the inner E. coli membrane during phage infection.

Pipeline:

Get protein sequence from UniProt; Run BLAST to find homologs across phage strains; Run Clustal Omega to identify hot spots for mutations
Run ESM2 to identify mutations and where we can mutate without affecting structural stability; Keep mutations that don’t disrupt the protein structure
Run the mutations through ESMFold to predict structure and filter for stability
Rank the candidates by stability (pLDDT) improvements over the UniProt sequence
Run top candidates through AlphaFold-Multimer to confirm the mutations don’t affect the interaction between E. coli DnaJ
Take the top candidates and run them through the wet lab

Pipeline diagram:

L protein sequence (UniProt)
↓
BLAST + Clustal Omega → conserved map
↓
ESM2 mutational scan → high-scoring candidates
↓
ESMFold → pLDDT comparison vs wildtype
↓
[Optional] AlphaFold-Multimer → check DnaJ interaction preserved
↓
Top 3-5 candidates → wet lab validation

What tools are we using and why?

ESM2 allows us to run stochastic gradient descent on how stable our protein sequences are likely to be and what evolution considers normal.

ESMFold provides us with a pLDDT value for structural confidence and together we can automate mutation screening before hitting the wet lab.

Clustal Omega provides us with positions on the phage strain that we should not change in order to further preserve structural stability.

Pitfalls: L protein is a membrane protein and might not be as well represented in ESM2 training data and the PDB so we might have less reliable outputs. Our folding models aren’t taking into account lipid membranes so we might have issues with modeling the interaction. Our stability estimates might also be inaccurate as the delta between mutations may be too small to rank them accurately.

E. Tehseen’s Brainstorming Notes

Systematic Tuning of the N-Terminal Regulatory Domain

Goal: Enhance and regulate the toxicity of the MS2 bacteriophage L lysis protein by systematically modifying its N-terminal domain. Instead of removing this region, identify the minimal regulatory segment needed for precise control of lysis timing and activity.

Background and rationale

The L protein, a 75-amino acid membrane-bound lysis protein, is responsible for killing E. coli during infection. Studies show that its N-terminal domain (~first 30–40 amino acids) is not required for lysis; truncation mutants (Lodj variants) lacking this region still lyse cells, often faster. This indicates the N-terminus acts as a regulatory brake to delay lysis and support viral replication.

Hypothesis

The regulatory function of the N-terminal domain in lysis is influenced by its length and charge characteristics. It is proposed that:

Partial truncations may incrementally diminish inhibitory effects and subsequently enhance lysis efficiency
The regulatory activity appears to be localised to a distinct sub-region rather than to the entire N-terminal domain
There is likely an optimal truncation point that achieves a balance between increased toxicity and maintenance of protein stability

Proposed Computational Pipeline:

Sequence Retrieval: Obtain the L protein sequence from UniProt.
Structural and Residue Analysis: Visualize the N-terminal domain using PyMOL to identify hydrophilic and cationic residues.
In Silico Mutagenesis: Use ESM2/ESMFold to predict the effect of substitutions that increase cationicity, focusing on residues facing the cytoplasm or periplasm.
Stability Check: Compare predicted mutants’ folding and stability using ESMFold and pLDDT scores.
Interaction Analysis: Optional AlphaFold-Multimer predictions to confirm L interaction with DnaJ or other host factors is preserved.
Prioritization: Generate a heatmap of mutants ranked by predicted lysis enhancement and structural stability.
Codon Optimization & Synthesis: Prepare selected mutants for experimental validation.

Expected Outcomes: Increased electrostatic interaction with target host proteins; tunable lysis timing while preserving N-terminal regulatory functions; generation of mutant library for wet lab testing of lytic efficiency.

Potential Pitfalls: Excessive cationic mutations could cause nonspecific aggregation or mislocalization. Predictions may differ from experimental results.

F. Group Meeting Notes (3/24)

10, 20, 30, 40 base pairs (changes)
Overlapping frames?
Pipeline approach: each person picks a tool to explore in depth, then come back and review/align on results

Tuesday — met to discuss current state:

What is the dependency outside of L-protein standalone?
What is the multi-frame dependency when engineering a plasmid?
L-protein is the focus — engineer
Refer to WEEK 5 Lab Resources for L-Protein
Reminder to post new questions/topics in Genspace Discourse Forum for knowledge sharing, TA support
Follow-up: met with John, identified focus area — IGV (Integrative Genomics Viewer) for manual inspection of called variants at loci of interest
ES: located some initial ChimeraX visualizations — will post images

Wednesday 3/25 — explore sequence in silico individually

Thursday 3/26 — pick a high probability option

Friday 3/27 — model in Benchling and Asimov Kernel

Saturday 3/28 — (TBD)

Sunday 3/29 — Final summary. By EOD Sunday 3/29, publish here. Please post personal pipeline visualizations/notes under your brainstorm section.

Status Update: Friday, March 27th

Eric’s Final Summary Notes: On 3/26 I did a “deep dive” into the remaining project scope, decided to focus on the identification of an amino acid substitution that would support our hypothesis around the N-1 Terminus region.

Primary request: Please review, and if you agree, or want to add/change anything, feel free to annotate with comments. Once we have consensus, we can submit the markdown file as our final “group project”.

References

Bernhardt TG, Roof WD, Young R (2002). The Escherichia coli FKBP-type PPIase SlyD is required for the stabilization of the phage PhiX174 lysis protein E. Mol Microbiol. PMC5446614.
Chamakura KR, Young R (2019). Phage single-gene lysis: how it works and why it matters. Future Microbiol. PMC5775895.
Lin DL et al. (2023). Structural insights into MS2 lysis protein L and its interaction with DnaJ. PMC10688784.
Schilling T, et al. (2023). Engineering bacteriophage lysis proteins for enhanced activity. PubMed 36608652.
Lin YW, et al. (2017). MS2 lysis protein L: a glycoprotein tethered to the membrane by a single transmembrane segment. PMC5446614.
Lin DL, Leick M, Young R (2017). Lysis protein gene products specifically inhibit phage-mediated bacterial cell lysis. PMC5775895.
UniProt P03609: LYS_BPMS2 MS2 lysis protein. https://www.uniprot.org/uniprotkb/P03609
UniProt P08622: DNAJ_ECOLI E. coli DnaJ chaperone. https://www.uniprot.org/uniprotkb/P08622
Lin YW et al. ESM2 protein language models. Meta AI 2023.
Jumper J et al. AlphaFold2. Nature 2021. DOI: 10.1038/s41586-021-03819-2.
Dauparas J et al. ProteinMPNN. Science 2022. DOI: 10.1126/science.add2187.

HTGAA 2026 — MS2 L Protein Group Project

Computational pipeline developed in collaboration with group members Eric, Albert, Tehseen, and John