Group Final Project

Phage Therapy

Background: The Antibiotic Resistance Crisis and Phage Therapy

Antibiotic resistance is one of the most urgent threats to global health. At current trends, antimicrobial-resistant infections are projected to cause deaths comparable in scale to cancer within the next 26 years (O’Neill Report, 2016). The overuse and misuse of broad-spectrum antibiotics has accelerated the selection of resistant bacterial strains, while the pipeline for novel antibiotics has nearly run dry. A compelling alternative is phage therapy — the therapeutic use of bacteriophages (phages) to target and kill pathogenic bacteria.

Phages are highly specific: they typically infect only a single species, and sometimes only a single strain, leaving the rest of the microbiome intact. This precision is a major advantage over antibiotics, which disrupt the commensal microbiota alongside the pathogen. The clinical promise of phage therapy has been dramatically illustrated by the case of Tom Patterson, whose pan-drug-resistant Acinetobacter baumannii infection was ultimately resolved only after a cocktail of engineered phages was administered (Schooley et al., 2017).

However, a critical limitation emerged in that case and others: bacteria can acquire resistance to phages rapidly, often within days. Each time Patterson’s bacterial population evolved resistance, a new phage cocktail had to be designed. This highlights the need for proactive phage engineering — designing phages with resistance-resistant properties before bacterial counter-evolution occurs.

This project focuses on MS2 bacteriophage, a well-characterised RNA phage that infects Escherichia coli via the F-pilus, and specifically on engineering its lysis protein L to improve MS2’s ability to kill E. coli even as the host acquires resistance.

The MS2 Bacteriophage and Its Lysis Protein

MS2 is one of the simplest known viruses, with a single-stranded RNA genome encoding only four proteins:

The maturation protein (A)
The coat protein
The lysis protein (L)
The replicase (rep)

The phage infects E. coli by attaching to the F-pilin protein on the host cell surface and injecting its RNA genome. The viral RNA is translated by the host ribosome, producing coat proteins and replicase. After replication and capsid assembly, the lysis protein triggers destruction of the bacterial cell wall, releasing approximately 10,000 new phage particles per lysed cell.

The lysis protein L is a 75-amino acid, predominantly hydrophobic protein that is thought to oligomerise and insert into the host inner membrane, forming pores that disrupt membrane integrity and ultimately cause osmotic lysis (Chamakura et al., 2017). Its exact mechanism remains incompletely understood, but two things are established:

L depends on the host chaperone DnaJ for proper processing and membrane insertion. Chamakura et al. (2017, PMC5446614) showed that E. coli strains with a mutated dnaJ gene are resistant to MS2 infection, because L cannot fold or oligomerise correctly without DnaJ assistance.
Lysis-defective mutations cluster in the transmembrane (TM) domain and the C-terminal region of L, suggesting these regions are essential for membrane integration and pore formation (Chamakura & Young, 2018).

These observations define the two principal vulnerabilities that bacterial resistance exploits, and hence the two engineering targets for this project.

Engineering Goals

We selected two complementary engineering goals for the MS2 lysis protein L:

Goal 1 — Increased Stability (primary): Stabilise L so it remains functional across a wider range of expression conditions and temperatures. A more stable L is less susceptible to premature proteolytic degradation before it can reach the membrane, improving the reproducibility and efficiency of lysis. This goal is also directly relevant to Stage 4 of the group pipeline, where L’s structural integrity is tested using the Nuclera cell-free expression system.
Goal 2 — Resistance to DnaJ-Dependent Inhibition (secondary): Engineer L variants that either:
1. Tighten the L–DnaJ interaction to compensate for partially impaired DnaJ mutants.
2. Reduce L’s dependence on DnaJ altogether, allowing lysis even in E. coli strains that have evolved DnaJ mutations as a resistance mechanism.

This directly addresses the primary route of bacterial resistance identified by Chamakura et al. (2017). These goals are mechanistically coupled: a more stable L is less likely to be prematurely degraded before it can recruit DnaJ, and a redesigned L–DnaJ interface can amplify the lytic effect once L is membrane-inserted.

Computational Pipeline

Step 1 — In Silico Deep Mutational Scan (ESM2)

We used the ESM2 protein language model (650M parameter version; Lin et al., 2023) to compute a zero-shot deep mutational scan of the full 75-amino acid L sequence. For every possible single-point substitution, ESM2 assigns a log-likelihood score reflecting evolutionary tolerance — high scores indicate mutations likely to be structurally or functionally neutral, while very low scores flag mutations that disrupt folding or function.

This produced a 75 × 20 mutational fitness landscape at zero experimental cost. Consistent with the literature, the ESM2 scan was expected to show low tolerance for mutations in the TM domain (residues ~37–52) and C-terminal region, which are essential for membrane integration (Chamakura & Young, 2018). Candidate stabilising substitutions were drawn from positions in the disordered N-terminal region that showed elevated ESM2 scores under alternative amino acids.

Step 2 — Structural Prediction and Inverse Folding (ESMFold + ProteinMPNN)

The wild-type L sequence was folded using ESMFold to generate a predicted 3D structure, with per-residue pLDDT confidence scores used as a proxy for local disorder. The TM helix (residues ~37–52) consistently showed high pLDDT, confirming it as structurally ordered and critical.

ProteinMPNN inverse folding was then applied: the backbone geometry of the WT L structure was fixed, and ProteinMPNN proposed alternative sequences likely to pack into the same fold with improved stability. This is particularly informative for the TM region, where ProteinMPNN can suggest hydrophobic substitutions that improve membrane anchoring without altering helix geometry. Candidate sequences were filtered by:

pLDDT > 70 across the TM domain
RMSD < 1.5 Å versus wild-type backbone

Step 3 — Interaction Modelling (AlphaFold-Multimer)

For the top stability candidates, we modelled the L–DnaJ complex using AlphaFold-Multimer (Evans et al., 2022). DnaJ (UniProt P08622; PDB: 1BQZ) is well-characterised. We compared interface predicted aligned error (PAE) scores and estimated binding energy ($\Delta\Delta G$, computed via FoldX after AF2 modelling) between WT L and the redesigned variants.

Variants showing simultaneously improved pLDDT (stability) and reduced interface PAE (tighter or maintained DnaJ interaction) were prioritised as candidates for experimental validation.

Step 4 — Random Mutagenesis (Complementary Screen)

In parallel with the structure-guided design, we implemented random mutagenesis to generate combinatorial variants outside the hypothesis-driven search space. This approach was guided by the mutational tolerance map generated in Step 1: only residue positions with ESM2 scores above a permissive threshold were included in the random mutation pool, preventing the random screen from exploring lysis-inactivating territory.

Step 5 — Ranking and Selection

Final ranking followed a composite score:

$$\text{Score} = w_1 \times \Delta\text{ESM2_loglik} + w_2 \times \Delta\text{pLDDT} + w_3 \times \Delta\text{interface_PAE_improvement}$$

where weights were tuned to balance sequence novelty against structural confidence. The top 5 variants were taken forward for synthesis and experimental validation.

L-Protein Mutant Variants

Using the random mutagenesis function constrained by the ESM2 mutational landscape, we generated five double-mutant variants of the MS2 L protein.

Wild-Type 75-aa L Sequence: METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

Each variant carries two point mutations selected from permissive positions identified by the ESM2 scan.

Variant 1: S35K, Q71L

Sequence: METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRKSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLLQLLT
Rationale: S35 sits immediately upstream of the transmembrane helix. Introducing a lysine at this position may strengthen membrane tethering through electrostatic interaction with negatively charged phospholipid headgroups, a mechanism observed in other membrane-inserting peptides (von Heijne, 1989). Q71L substitutes a polar glutamine with a hydrophobic leucine in the C-terminal region, potentially increasing the hydrophobic moment of the C-terminus and improving membrane association. Together, these two flanking mutations aim to enhance membrane insertion efficiency without disrupting the core TM domain.
Predicted Impact: Increased membrane affinity; potentially reduced DnaJ-dependence if membrane insertion becomes more spontaneous.

Variant 2: F47I, L44D

Sequence: METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFDAIILSKFTNQLLLSLLEAVIRTVTTLQQLLT
Rationale: Both mutations fall within or adjacent to the TM helix. F47 is a large aromatic residue; the F47I substitution reduces steric bulk, potentially allowing tighter helix packing. L44D introduces an aspartate into the hydrophobic core of the TM helix — a charged residue in a TM segment can serve as a pore-lining residue in channel proteins (White & Wimley, 1999), and may alter pore geometry to accelerate membrane disruption. This variant was co-folded with DnaJ using AlphaFold-Multimer, and the resulting PAE map showed low predicted aligned error at the L–DnaJ interface, indicating that DnaJ interaction is predicted to be maintained or improved despite TM mutations.
Predicted Impact: Modified pore geometry; maintained DnaJ interaction (AF2-Multimer PAE confirmed). This variant was prioritised for experimental follow-up.

Variant 3: V63I, V67I

Sequence: METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAIIRTITTLQQLLT
Rationale: V63 and V67 are both valine residues in the C-terminal amphipathic region. Conservative isoleucine substitutions (V→I) increase side-chain volume by a single methylene group, improving van der Waals packing without introducing steric clashes. This is a classical strategy for thermal stabilisation of hydrophobic cores in membrane proteins (Pace et al., 2011). The double V→I substitution is predicted to increase the thermal melting temperature of the C-terminal region by ~1–2 °C.
Predicted Impact: Improved thermostability; useful for Stage 4 Nuclera testing where cell-free expression under variable temperature conditions is used to assess structural integrity.

Variant 4: R31K, F43P

Sequence: METRFPQQSQQTPASTNRRRPFKHEDYPCRKQQRSSTLYVLIPLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT
Rationale: R31K is a conservative charge-preserving substitution in the N-terminal region, removing the long guanidinium side chain of arginine and replacing it with the shorter lysine $\epsilon$-amino group, potentially reducing electrostatic repulsion between adjacent positive charges in the polybasic N-terminal stretch. F43P introduces a proline at the junction of the pre-TM linker and the TM helix — prolines act as helix-breakers and introduce rigid kinks that can control the angle of membrane insertion. This mutation is predicted to alter the TM helix tilt angle and potentially reduce the DnaJ interaction requirement by promoting a more autonomous membrane-insertion geometry.
Predicted Impact: Altered TM helix tilt angle; potentially reduced DnaJ-dependence for membrane insertion.

Variant 5: F5N, L60C

Sequence: METRNPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLCEAVIRTVTTLQQLLT
Rationale: F5N replaces a hydrophobic phenylalanine with polar asparagine in the extreme N-terminal region, improving the hydrophilic character of the N-terminus and potentially improving solubility during ribosomal translation and DnaJ recruitment. L60C introduces a cysteine in the post-TM region — cysteines can form contacts that stabilise local structure.
Predicted Impact: Enhanced solubility and potential cysteine-mediated stabilisation of the C-terminal region; to be validated by Nuclera cell-free expression.

AlphaFold-Multimer Analysis: Variant 2 × DnaJ

Variant 2 was selected as the priority candidate for AF2-Multimer co-folding based on its TM-domain mutations, which directly probe the interaction between the L protein’s membrane-spanning region and the DnaJ chaperone.

The predicted aligned error (PAE) matrix for the L(F47I, L44D)–DnaJ complex showed:

Low inter-chain PAE values at the predicted interface region, suggesting that DnaJ still recognises and binds Variant 2 despite the TM mutations.
The J-domain of DnaJ (residues 1–75 of DnaJ, including the conserved HPD motif) showed low PAE relative to the C-terminal region of L, consistent with the C-terminus being the primary DnaJ-binding region (Chamakura et al., 2017).

This result supports the hypothesis that mutations in the TM core do not abolish DnaJ recruitment, making Variant 2 a viable candidate for testing both modified pore geometry and maintained chaperone interaction.

Discussion: Connecting Computational Design to Experimental Validation

The five variants described above were generated by a hybrid strategy: ESM2-guided fitness landscape mapping defined the permissive mutation space, ProteinMPNN inverse folding proposed TM-stabilising sequences, and random combinatorial sampling constrained to permissive positions generated diverse double-mutants. This mirrors real-world directed evolution workflows, where computational pre-screening dramatically reduces the experimental search space before library construction.

The key open questions to be resolved in Stages 2–5 of the group pipeline are:

Stage 2 (Synthesis via Twist): The five mutant L gene sequences will be synthesised as codon-optimised synthetic genes. The codon optimisation step is non-trivial for an RNA phage: the wild-type MS2 L sequence is embedded in a region of the genome that overlaps with the replicase reading frame, requiring careful design to ensure mutations affect only L and do not disrupt the overlapping replicase sequence at the RNA level.
Stage 3 (Gibson Assembly): Mutant L genes will be cloned into a plasmid backbone downstream of an inducible promoter (e.g., pBAD or T7) for independent expression in E. coli, decoupled from the rest of the MS2 genome. This allows L’s toxicity to be assessed directly without confounding effects of phage replication.
Stage 4 (Nuclera Cell-Free Testing): The Nuclera eDrop system will be used to express L mutants in cell-free reactions and assess structural integrity. Variants 3 and 5, designed for improved thermostability, are expected to show higher yields and more compact folding in cell-free conditions compared to the wild-type.
Stage 5 (E. coli Lysis Assay): The definitive test: each L variant will be expressed in E. coli (both wild-type DnaJ and DnaJ-mutant strains) and lysis will be quantified by $OD_{600}$ kinetics and plaque assay. Variants 2 and 4, designed to reduce DnaJ-dependence, are predicted to retain lytic activity against DnaJ-mutant E. coli, which would represent a direct demonstration of engineered resistance-evasion.

Integrating Emerging Phage Engineering Frameworks into MS2 L Protein Development

Three recently published phage engineering approaches inform the design strategy of this project and collectively define a computationally guided, cell-free-first development pipeline for MS2 L protein engineering.

The first is a simulation-first design paradigm, wherein AI-powered in silico modeling of phage-host interactions precedes any wet-lab execution. Translating this philosophy here, computational modeling of L protein variants — using structure prediction tools such as AlphaFold2 or ESMFold to assess transmembrane insertion geometry and membrane disruption propensity — can prioritize a ranked synthesis list before any physical construct is ordered. Given that the MS2 L protein spans only ~75 amino acids and that single-residue changes can abolish or enhance lytic activity, computational pre-filtering directly reduces synthesis cost and iteration time, two practical constraints central to this project.

The second framework is PHEIGES (PHage Engineering by In vitro Gene Expression and Selection), which demonstrated that phage genome fragments expressed in E. coli cell-free transcription-translation (TXTL) systems produce functional outputs — including host-toxic products — without requiring full phage assembly or live bacterial passage. Adapting this logic, individual L protein variants can be expressed from linear DNA fragments in TXTL and screened for membrane disruption activity using OD-based lysis proxies or liposome dye-release assays. This decouples L protein functional validation from full MS2 viability, collapsing the screening cycle from days to hours and allowing higher-throughput variant assessment upstream of genome reconstruction.

The third is the High-Complexity Golden Gate Assembly (HC-GGA) system developed by Sikkema et al. (2026) for a Pseudomonas aeruginosa phiKMV-like phage, which achieved near-100% genotype recovery from 28 modular plasmid-held fragments without selectable markers. The MS2 genome at ~3.6 kb is far more tractable than the 43 kb 41S1 system, making a 4–5 fragment HC-GGA design straightforward. By isolating the L gene and its regulatory flanking sequences within a single dedicated fragment, every future variant becomes a single-fragment substitution dropped into a stable master mix — no counterselection engineering, no full re-synthesis. Together, these three frameworks define a unified funnel: computational variant design, cell-free functional screening, and modular genome assembly for high-fidelity phage rescue.