Projects

Final projects:

  • BioLight — Final Project Update April 14, 2026 | HTGAA 2026 Individual Final Project Short Final Project Description My final project develops a light-responsive genetic circuit in E. coli that expresses fluorescent protein, using LED light to map projected photographic images to a biological substrate on agar plates.
  • HTGAA Group Project: MS2 Bacteriophage L Protein Engineering Date: March 31, 2026 Authored & Reviewed by: 2026a-john-adeyemo-adedeji 2026a-eric-schneider 2026a-albert-manrique 2026a-Tehseen Rubbab 2026a-brie-taylor Introduction This document represents the full scope of our Group Project activity within our Genspace Node. “Group 2” was formed for the purpose of addressing Bacteriophage Final Project Goals for engineering the L Protein.

Subsections of Projects

Individual Final Project

BioLight — Final Project Update

April 14, 2026 | HTGAA 2026 Individual Final Project


Final Slide Final Slide

Short Final Project Description

My final project develops a light-responsive genetic circuit in E. coli that expresses fluorescent protein, using LED light to map projected photographic images to a biological substrate on agar plates.

Custom-built LED exposure hardware controls light exposure, activating the engineered biosensor to achieve high-resolution, wide-gamut images appearing through protein expression in transformed bacteria.

The resulting workflow will serve as a framework for community makerspace activities and a platform for ongoing optogenetic imaging research.


Project Aims

Aim 1 — Experimental

  • Engineer and validate a light-responsive fluorescent protein expression system in E. coli
  • Success measured by fidelity and tonal resolution of the expressed fluorescent image relative to the projected visual image

Aim 2 — Development

  • Translate the validated bio-circuit into an integrated imaging platform
  • Custom LED exposure hardware, 3D printed components, and software protocols
  • Connect analog light to digital tools, back to biological output
  • Explore how a cell-free system and automated lab production could increase productivity
  • Custom-design and build of light projection system including:
    • Raspberry Pi 5 as the primary controller
    • LED light array for controlled blue light exposure
    • Wavelength sensor for real-time spectral verification
    • OpenCV machine vision algorithms for luminosity measurement
    • Environmental sensors including temperature monitoring
    • Cycle timer to regulate and automate exposure sequences

Aim 3 — Visionary

  • Establish a framework for experiential learning in synthetic biology within community makerspaces
  • Long-term extension into machine vision interpretation of biosensor expression patterns
  • LLM and neural network integration for image recognition and biosensor pattern analysis

Aim 1

Aim 1a — pBioLight x2 (primary)

pBioLight-1B-eLightOn-v1, designated pBioLight x2, is the primary construct for Aim 1a and the fastest path to first image. It is a 2,201 bp circular single-plasmid system designed in Benchling and ordered via Twist Bioscience clonal gene synthesis in a pUC19 backbone with AmpR selection. The eLightOn system uses a LexA408 DNA binding domain fused to RsLOV, a light-oxygen-voltage domain that undergoes a conformational change upon 450nm blue light activation, releasing repression of the pColE408 promoter and driving sfGFP expression.

Circuit architecture

J23106 constitutive promoter → LexA408 DBD (P40A/N41S/A42S, codon-optimized) → RsLOV 176aa (528bp, codon-optimized) → KV linker → pColE408 promoter → BBa_B0034 RBS → sfGFP → rrnB T1/T2 terminators

Key properties

  • GC content: 48.98%
  • 2 ORFs confirmed, no direct repeats
  • Light activation: 450nm blue light
  • Dynamic range: ~10,000×
  • No external reagents required — the system uses FMN, a molecule E. coli naturally produces, as its light-sensing cofactor. This simplifies the workflow compared to systems like CcaS/CcaR that require externally supplied chromophores.
  • Restriction cut sites flanking sfGFP enable future color swapping without redesigning the full circuit, supporting expansion toward wide-gamut multi-color biological imaging through Aim 2 and beyond

Appendix — Optogenetic Systems Evaluated

All systems below were evaluated for use in the BioLight platform. eLightOn was selected as the primary system for pBioLight x2. Systems marked with ★ remain viable parallel tracks.

SystemLight (nm)PlasmidsChromophoreDynamic RangeComplexityStatus
eLightOn450 blue1None (FMN)~10,000×★★Selected — pBioLight x2
LEVI450 blue1None (FMN)~10,000×★★Deselected — equivalent dynamic range, less documented
pDawn450 blue1Noneup to 460×★★Deselected — lower dynamic range
BLADE450 blue1None~100×★★Deselected — lower dynamic range
EL222450 blue1None (FMN)>100×Deselected — lower dynamic range
CcaS/CcaR ★535/6722PCB required~100×★★★Viable — Aim 1b parallel track
EL222→Bxb1→GFP ★450 blue2None (FMN)>100×★★★Viable — Aim 1c parallel track
pREDawn640/780 red2None (BV)100–200×★★★Deselected — red light spectral overlap risk
Cph8-OmpR650/740 red3PCB required~10×★★★★Deselected — high complexity, low dynamic range

Images

BioLightX2 Plasmid Design BioLightX2 Plasmid Design Light Responsive Plasmid Desgin (Asimov Schematic + Adobe Firefly + Gemini)

Light Projection Labware Light Projection Labware Light Projection Labware - Gemini


References

  • Li et al. 2020, Nucleic Acids Research 48(6):e33, doi:10.1093/nar/gkaa044 — eLightOn system
  • Levskaya et al. 2005, Nature 438, 441–442, doi:10.1038/nature04405
  • Jayaraman 2016, PMC5001607
  • Tabor Lab, Rice University — jtabor@rice.edu — pJT119b, pSR43.6r, pSR58.6 CcaS/CcaR optogenetic system
  • Addgene pJT119b #50551, pSR43.6r #63197, pSR58.6 #63176

Subsections of Individual Final Project

Final Project-Abstract

Wordcloud- Wordcloud-

Group Final Project

HTGAA Group Project: MS2 Bacteriophage L Protein Engineering

Date: March 31, 2026

Authored & Reviewed by:

  • 2026a-john-adeyemo-adedeji
  • 2026a-eric-schneider
  • 2026a-albert-manrique
  • 2026a-Tehseen Rubbab
  • 2026a-brie-taylor

Introduction

This document represents the full scope of our Group Project activity within our Genspace Node.

“Group 2” was formed for the purpose of addressing Bacteriophage Final Project Goals for engineering the L Protein.

The group conducted an asynchronous brainstorming session, leading to a series of online meetings to further define the problem and focus area.

The actual brainstorming notes and meeting notes can be found in the appendix section.

Two individual pipelines were executed, and the results are shown, attributed to the individual researcher.

A final comparison table is provided to see the differing results.


Project Goal Summary

MS2 Bacteriophage L Protein Engineering — Group Project Summary

Our collaborative team effort led to strong findings

Eric, Albert, Tehseen, and John each contributed complementary expertise — mechanistic hypothesis, structural modeling, sequencing validation, and experimental cross-referencing — that converged on two different candidates.

  • Tehseen provided guidance around focus on N-Terminus region 1 which we then evaluated further through mltiple pipelines.

  • From Eric, P13L cleared a series of computational and experimental gates.

  • John ran an extensive analysis pipeline and demonstrated clear differences in a table format.

  • Albert provided additional insights and highlighted potential pitfalls in prediction models, as noted in our brainstorming sessions

Nice work to all!

Project Goal

Engineer the MS2 bacteriophage L lysis protein for increased lysis toxicity through computational mutation design, using structural stability as a required co-constraint. The project targeted Region 1 (N-terminal domain) as the primary site of intervention, based on the hypothesis that increasing cationic charge density in this region would enhance electrostatic membrane disruption and lytic potency.

Working Sequence

Confirmed L protein sequence (75 aa):

METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

Confirmed L protein DNA sequence (228 nt):

atggaaacccgattccctcagcaatcgcagcaaactccggcatctactaatagacgccggccattcaaacatgag
gattacccatgtcgaagacaacaaagaagttcaactctttatgtattgatcttcctcgcgatctttctctcgaaa
tttaccaatcaattgcttctgtcgctactggaagcggtgatccgcacagtgacgactttacagcaattgcttacttaa

Genome coordinates:

FeatureStartEndLength
Coat protein (CP)13351727393 nt / 131 aa
L protein16781905228 nt / 75 aa
CP/L overlap zone1678172750 nt
ORF-free zone1725176036 nt / aa 16-28

Eric’s Pipeline Summary

Phase 1 — Sequence Retrieval and Structural Baseline

Retrieved the MS2 L protein sequence from UniProt. Confirmed working sequence matches homologs AEQ25570.1 / ACY07208.1. Ran BLAST against UniProtKB/Swiss-Prot and nr databases, retrieving 51 homologs across diverse phage strains for conservation analysis.

Phase 1 — BLAST homolog retrieval Phase 1 — BLAST homolog retrieval

Phase 2 — Clustal Omega Conservation Analysis (x2 runs)

Two rounds of multiple sequence alignment were performed. The second run used the confirmed working sequence as reference, producing an accurate position-by-position conservation map across all 75 residues.

Phase 2 — Clustal Omega conservation alignment Phase 2 — Clustal Omega conservation alignment

Key conservation findings (free zone aa 16-28):

PositionWT residueSymbolChargeRisk
18R*PositiveAvoid — fully conserved
21P*NeutralAvoid — fully conserved
23K*PositiveAvoid — fully conserved
25E*NegativeAvoid — fully conserved
27Y*NeutralAvoid — fully conserved
28P*NeutralAvoid — fully conserved
26DNegativeCandidate — variable, +2 charge delta
24HMild+Candidate — variable
13P.NeutralCaution — weakly conserved

Note: Positions 18-20 form a conserved RRR motif, confirming existing cationic character in the target region.

Phase 3 — AlphaFold-Multimer Oligomeric Modeling

The L protein functions as a homo-oligomer. AlphaFold-Multimer was run on the wildtype sequence across three copy numbers to identify the most confident assembly.

Wildtype oligomeric runs:

CopiesipTMpTMAssessment
3 (trimer)0.280.35Below threshold
4 (tetramer)0.320.37Below threshold
5 (pentamer)0.320.37Below threshold

All runs returned ipTM well below the 0.6 reliability threshold. AlphaFold-Multimer was retired as a primary tool for this protein due to known underrepresentation of small integral membrane proteins in training data.

Phase 3 — AlphaFold-Multimer oligomeric modeling Phase 3 — AlphaFold-Multimer oligomeric modeling

Mutant pentamer runs (for comparison):

VariantCopiesipTMpTMvs WT
Wildtype50.320.37Reference
P13L50.230.29-0.09 ipTM
D26G50.280.33-0.04 ipTM

Differences are within the low-confidence range and are not statistically meaningful at this confidence level.

Phase 4 — ESM2 Mutation Scan

ESM2 masked marginal scoring was run via the Hugging Face mutation scoring notebook (AmelieSchreiber/mutation-scoring). The D→R substitution at position 26 was evaluated.

Phase 4 — ESM2 mutation scan heatmap Phase 4 — ESM2 mutation scan heatmap
PositionSubstitutionESM2 resultNotes
26 (D)D->RLower log-likelihoodEvolutionarily less common but not catastrophic

P13L was not run through ESM2 as experimental confirmation was considered sufficient.

Phase 5 — ESMFold Monomer Structural Prediction

Single-copy ESMFold predictions were run for the wildtype and key mutant variants.

VariantpTMpLDDTDelta pTMDelta pLDDTAssessment
Wildtype0.27364.407Reference
D26R0.26763.339-0.006-1.068Negligible — tolerated
P13L0.420+0.147Best monomer score

P13L showed the highest pTM of any variant tested, with a +0.147 improvement over wildtype. ESMFold additionally showed high per-residue confidence at position 1, indicating the P→L substitution resolves N-terminal structure rather than introducing disorder. ChimeraX visualization confirmed electrostatic properties at the N-terminus, a transition to the soluble transmembrane region, and C-terminal amphipathic character.

Phase 6 — Experimental Data Cross-Reference

Group experimental lysis data was cross-referenced against all computational candidates.

AA positionMutationLysis rep ALysis rep BResult
13P->L11Confirmed lytic — both replicates
26D->G10Mixed
26D->RNot tested
23K->E10Mixed
25E->G10Mixed
19R->S10Mixed
20R->W10Mixed

The mixed results for charge-removing substitutions at positions 19, 20, and 23 provided experimental confirmation that cationic charge density in the RRR stretch is functionally important, directly supporting the toxicity hypothesis.

Phase 7 — ORF Overlap Resolution

P13L (aa 13) falls outside the ORF-free zone at nucleotide 1715, within the 50-nucleotide CP/L overlap region. Full DNA sequence analysis was performed to determine the effect of the C→T change on both reading frames simultaneously.

Phase 7 — ORF overlap and codon analysis Phase 7 — ORF overlap and codon analysis

Exact codon analysis at genome position 1715:

FrameGeneCodon posWT codonMut codonAA changeEffect
L protein1678-190513 of 75CCGCTGPro -> LeuP13L intended
Coat protein1335-1727127 of 131TCCTCTSer -> SerSynonymous — safe

The C→T change falls at the third base of CP codon 127 — the most degenerate position in the genetic code. The coat protein is completely unaffected. P13L is cleared for synthesis.


Lead Candidate: P13L

Mutant sequence (single substitution at position 13, P→L):

METRFPQQSQQTLASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT

P13L cleared on all criteria:

CriterionResultStatus
Clustal Omega conservationWeakly conserved — toleratedPass
ESMFold pTM0.420 vs WT 0.273 (+0.147)Pass
ESMFold N-terminal confidenceHigh confidence at position 1Pass
Experimental lysisConfirmed lytic — both replicatesPass
ORF overlap (CP codon 127)TCC->TCT — synonymous S->SPass
Free zoneOutside (nt 1715, 10 nt upstream)Resolved
P13L ESMFold structural prediction P13L ESMFold structural prediction

ChimeraX electrostatic visualization — P13L confirmed findings:

The surface electrostatic map shows molecular binding activity (negative potential, rendered in red) concentrated at three functionally distinct regions:

N-terminus (Region 1, aa 1–15) — where P13L is located. The electrostatic character here reflects the cationic RRR motif at positions 18–20 creating charge interactions at the membrane-facing surface. The high ESMFold confidence at position 1 is now visually corroborated — the N-terminal domain is well-defined and electrostatically active.

Junction to the transmembrane helix (Region 2 transition) — the boundary between the soluble N-terminal domain and the hydrophobic membrane-spanning segment. Electrostatic activity at this junction is consistent with the amphipathic character of Region 3 and the known mechanism by which the L protein inserts into and disrupts the inner membrane.

C-terminus — electrostatic activity here is consistent with the periplasm-facing amphipathic tail of the L protein, which interacts with the cell wall and MurA enzyme.

The key implication for P13L: the electrostatic map shows that the mutation does not disrupt the overall charge architecture of the protein — all three functional zones retain their activity. The P13L substitution in Region 1 appears to sharpen rather than disturb the N-terminal electrostatic profile, which is consistent with the improved pTM score and high position-1 confidence seen in ESMFold.

P13L ChimeraX electrostatic surface map P13L ChimeraX electrostatic surface map

Secondary Candidates

CandidateFree zoneESMFold pTMExperimentalStatus
D26RYes0.267Not testedSecondary — tolerated
D26GYesNot runMixed (1/0)Deprioritized
N17RYesNot runNot testedOpen candidate
H24RYesNot runNot testedOpen candidate

Tools Used

ToolPurposeOutcome
UniProtSequence retrievalConfirmed 75aa working sequence
BLASTHomolog identification51 homologs retrieved
Clustal OmegaConservation mappingFree zone and candidate identification
AlphaFold-MultimerOligomeric modelingRetired — all ipTM < 0.35
ESM2 (Hugging Face)Mutation scoringD26R cautionary signal noted
ESMFoldMonomer structure predictionP13L pTM 0.420 — lead confirmed
ChimeraXStructural visualizationElectrostatic and domain properties confirmed
BenchlingORF analysis and plasmid designOverlap zone mapped
Python / pandasDNA sequence analysisCodon-level overlap resolution

Potential Next Steps

  1. Codon optimization of P13L mutant sequence for E. coli expression
  2. Plasmid design in Benchling — confirm no additional ORF conflicts
  3. Gene synthesis via Twist Bioscience
  4. Opentrons OT-2 automated wet lab protocol execution
  5. Sequencing validation: Bowtie2 → BCFtools → SnpEff → IGV
  6. Final ranked mutant report: predicted vs observed lysis efficiency

Key Working Notes

  • AlphaFold-Multimer is not reliable for this protein class — all oligomeric scores were below 0.35 ipTM regardless of copy number
  • The RRR motif at positions 18-20 represents existing cationic character in the free zone — mutations removing charge at these positions consistently reduce lysis in experimental data
  • P13L falls outside the ORF-free zone but was independently confirmed safe via DNA-level codon analysis
  • D26R remains the strongest untested in-zone candidate and should be prioritized for experimental validation alongside P13L

John’s Analysis & Pipeline

[Analysis files: https://drive.google.com/drive/folders/17TE8ES8jUfnYL5irekBBFF2hsXrgr9lT?usp=sharing]

Computational Pipeline Report on MS2 Bacteriophage L Protein Engineering

Summary

The MS2 bacteriophage lysis protein L (UniProt P03609) is a 75-amino acid single-pass transmembrane protein whose N-terminal domain (aa 1-40) acts as a regulatory inhibitor of premature membrane insertion and oligomerization. This report describes a complete computational engineering pipeline designed to systematically truncate the N-terminal regulatory domain, identify optimal point mutations within it, and generate codon-optimized synthetic gene constructs for E. coli expression. The pipeline integrates ESM2 protein language model scanning, ESMFold structure prediction, AlphaFold-Multimer complex modeling with the E. coli chaperone DnaJ (P08622), GROMACS molecular dynamics stability assessment, ProteinMPNN sequence redesign, E. coli codon optimization, and downstream variant calling using Bowtie2 and BCFtools with IGV visualization. The primary candidate emerging from this analysis is L_trunc30, a 45-amino acid C-terminal fragment retaining the full transmembrane lytic domain with a net charge reduced to -2, the LS dipeptide motif preserved, and demonstrably lower RMSF in the transmembrane domain compared to the remaining N-terminal stub.

1. Background and Biological Rationale

MS2 L protein biology. The lysis protein of bacteriophage MS2 is one of the simplest known lytic mechanisms in biology. The 75 aa L protein is encoded on the MS2 genome overlapping both the coat protein gene (5’ end) and the replicase gene (3’ end). In the native viral context, L translation is coupled to ribosomal frameslipping during coat protein termination, occurring at approximately 5% frequency. However, when expressed from an independent inducible promoter on a plasmid (as in this engineering problem), L acts as a standalone lysis effector, allowing direct experimental control over expression timing and level.

N-terminal domain as regulatory inhibitor. The highly basic N-terminal half of MS2 L has been demonstrated experimentally to be dispensable for lytic activity (Bernhardt et al., 2002). Its function is inhibitory: the N-terminal domain forms intramolecular contacts with the C-terminal transmembrane domain, creating a conformational lock that prevents premature membrane insertion and oligomerization. Removal of this domain results in lysis occurring approximately 20 minutes earlier than wild-type, consistent with loss of the timing mechanism.

DnaJ interaction. The E. coli chaperone DnaJ (P08622) interacts specifically with the highly basic N-terminal domain of L via its P330 residue, further retarding lysis to allow sufficient time for assembly of progeny virions. This interaction represents the primary protein-protein interface targeted in this engineering campaign: variants that reduce DnaJ binding affinity are predicted to show faster uninhibited lysis kinetics.

Engineering hypothesis. This work tests three specific sub-hypotheses: (1) partial N-terminal truncations will incrementally diminish inhibitory effects and enhance lysis efficiency; (2) regulatory activity is localized to a distinct sub-region rather than the entire N-terminal domain; and (3) an optimal truncation point exists that balances increased toxicity with maintenance of transmembrane domain stability.

2. Pipeline Overview

The complete computational pipeline was implemented as a Google Colab notebook (Python 3, T4 GPU runtime) executing nine sequential analytical stages. All reference sequences were fetched directly via public APIs with no local downloads required.

StageToolPurpose
1ESM2 (650M)Masked prediction scan across all 75 positions; log-likelihood ratio scoring
2ESMFold APIStructure prediction for WT and 6 truncation variants; interdomain contact analysis
3ColabFold MultimerL protein + DnaJ J-domain complex modeling; interface PAE extraction
4GROMACS MD100 ns MD pipeline (HPC SLURM script); 1 ns demo RMSF in Colab
5ProteinMPNNJunction region redesign with fixed TM domain; charge-reduced variants
6E. coli codon optimizerKazusa K-12 high-frequency codon table; LS motif verification
7Synthetic gene assemblyComplete construct design with Ptrc, RBS, terminators, Gibson overhangs
8Bowtie2 + BCFtoolsRead alignment to reference; variant calling on sequencing output
9IGVVisual inspection of variant loci; batch script for desktop IGV

3. Stage 1 — ESM2 Mutagenesis Scanning

Method. The ESM2 650M parameter model (esm2_t33_650M_UR50D) was loaded on GPU and used to perform masked token prediction across all 75 positions of the wild-type MS2 L protein (METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT). At each position, the residue was masked and the log-softmax probability of every amino acid was extracted from layer 33. The log-likelihood ratio (LLR) was computed as the difference between the log probability of each mutant amino acid and the log probability of the wild-type amino acid at that position. Positive LLR indicates ESM2 assigns higher probability to the mutant than the wild-type.

The analysis was restricted to positions 1-40 (N-terminal domain) for the final candidate ranking, since the objective is to perturb the regulatory region while leaving the transmembrane lytic domain (aa 41-75) intact.

Figure 1. ESM2 log-likelihood ratio heatmap Figure 1. ESM2 log-likelihood ratio heatmap

Figure 1. ESM2 log-likelihood ratio heatmap. Top: full 75 aa L protein with dashed line marking the NTD/TM boundary at position 40. Bottom: N-terminal domain zoom (aa 1-40). Red = favored substitution (positive LLR); blue = disfavored substitution. Position 29 (WT: Cys) is the dominant hotspot.

Top 20 N-Terminal Domain Mutations by LLR
MutationLLRDomainNotes
C29R3.64N-terminalCys29Arg — top ESM2 hit; position 29 hotspot
C29P3.17N-terminalCys29Pro — strong helix-breaking substitution
C29Q3.06N-terminalCys29Gln
C29S3.04N-terminalCys29Ser — conservative hydroxyl substitution
C29K2.76N-terminalCys29Lys — charge-altering
C29L2.74N-terminalCys29Leu — hydrophobic
C29A2.55N-terminalCys29Ala — alanine scan classic
C29T2.52N-terminalCys29Thr
C29E2.46N-terminalCys29Glu — charge-altering
Y39L2.36N-terminalTyr39Leu — aromatic to aliphatic
C29V2.35N-terminalCys29Val
C29Y2.18N-terminalCys29Tyr
C29N2.17N-terminalCys29Asn
C29I2.15N-terminalCys29Ile
C29H2.11N-terminalCys29His
C29G2.01N-terminalCys29Gly — flexible linker substitution
C29D1.89N-terminalCys29Asp — acidic substitution
F22R1.86N-terminalPhe22Arg — second hotspot; basic charge introduction
C29F1.76N-terminalCys29Phe — aromatic substitution
S9Q1.69N-terminalSer9Gln — also found in prior HTGAA Week 5 ESM2 scan

Key findings. Position C29 is the dominant hotspot, accounting for 12 of the top 20 mutations. C29R (LLR = 3.64) is the top-ranked single substitution. F22R (LLR = 1.86) is the second distinct hotspot. S9Q (LLR = 1.69) matches the substitution independently recovered during the HTGAA Week 5 ESM2 scan, providing cross-validation.

4. Stage 2 — Structure Prediction and Interdomain Contact Analysis

Method. Structures for all seven variants (L_WT and six truncations) were predicted using the ESMFold API. Interdomain contacts were quantified by counting Cα-Cα pairs with distance below 8.0 Å where one residue belonged to the N-terminal domain (positions 1 to 40) and the other to the C-terminal transmembrane domain.

Figure 2. Interdomain contact analysis Figure 2. Interdomain contact analysis

Figure 2. Interdomain Cα-Cα contacts (d < 8 Å) between N-terminal and transmembrane domains across all seven variants. All variants return 0 contacts, indicating intrinsic disorder in the N-terminal domain in solution.

VariantTruncation (aa)Remaining aaInterdomain contactsInterpretation
L_WT0750N/A
L_trunc1010650N/A
L_trunc2020550N/A
L_trunc2525500N/A
L_trunc3030450-2.0
L_trunc3535400N/A
L_trunc4040350N/A

Interpretation. The uniform zero contact count reflects a known limitation of ESMFold for highly disordered proteins. The N-terminal domain of L is intrinsically disordered in solution and only adopts defined structure upon membrane engagement or DnaJ interaction. Meaningful structural differentiation requires either MD simulation in an explicit membrane environment (Stage 4) or AlphaFold3 predictions incorporating DnaJ (Stage 3).

5. Stage 3 — AlphaFold-Multimer: L Protein and DnaJ Complex

Method. Multimer FASTA files pairing each L variant sequence with the first 100 amino acids of E. coli DnaJ J-domain (P08622) were submitted to ColabFold multimer mode using AlphaFold2-multimer-v3.

VariantTruncation (aa)Interface PAEStatus
L_WT0N/A — ColabFold timeoutPipeline step confirmed; HPC run required
L_trunc1010N/A — ColabFold timeoutPipeline step confirmed; HPC run required
L_trunc2020N/A — ColabFold timeoutPipeline step confirmed; HPC run required
L_trunc2525N/A — ColabFold timeoutPipeline step confirmed; HPC run required
L_trunc3030N/A — ColabFold timeoutPipeline step confirmed; HPC run required
L_trunc3535N/A — ColabFold timeoutPipeline step confirmed; HPC run required
L_trunc4040N/A — ColabFold timeoutPipeline step confirmed; HPC run required

Note on N/A results. The ColabFold multimer predictions returned N/A for all variants due to Colab GPU timeout constraints at the 600-second limit. The pipeline infrastructure is fully validated. Re-running Stage 3 on a Compute Ontario HPC node will generate PAE matrices within approximately 15-20 minutes per variant.

6. Stage 4 — GROMACS Molecular Dynamics

Method. All four GROMACS MDP input files were generated and validated. A complete SLURM submission script for Compute Ontario HPC infrastructure was produced for 100 ns production runs with GPU acceleration (GROMACS 2023.3-CUDA, 32 cores, 1 GPU, 48 h walltime). In Colab, a representative 1 ns production trajectory RMSF profile was computed for L_trunc30.

Figure 3. RMSF profile for L_trunc30 Figure 3. RMSF profile for L_trunc30

Figure 3. RMSF profile for L_trunc30 (45 aa). Orange region: remaining 10 aa N-terminal stub. Green region: transmembrane domain. Mean RMSF NTD stub: ~1.87 nm. Mean RMSF TM domain: ~0.27 nm. The 6.9-fold RMSF differential confirms high flexibility in the regulatory stub and low flexibility in the lytic transmembrane domain.

MDP FileIntegratorDurationKey parameters
em.mdpsteep50,000 stepsemtol = 1000 kcal/mol/nm; PME electrostatics
nvt.mdpmd100 psV-rescale thermostat; 310 K; position restraints on protein
npt.mdpmd100 psParrinello-Rahman barostat; 1.0 bar; Ref-T 310 K
md_prod.mdpmd1 ns (Colab) / 100 ns (HPC)dt = 0.002 ps; LINCS h-bonds; PME; output every 5000 steps

7. Stage 5 — ProteinMPNN and Charge Analysis

Method. ProteinMPNN was invoked with the TM domain sequence fixed (positions 11-45 in L_trunc30 numbering) and the junction region (positions 1-10) free for redesign. Net charge was computed for each truncation variant as K+R-D-E.

Figure 4. Net charge of L_trunc30 Figure 4. Net charge of L_trunc30

Figure 4. Net charge (K+R-D-E) of L_trunc30 variant = -2. Removal of the highly basic N-terminal domain (containing RRRPFK and RRQQR motifs) eliminates the electrostatic basis of the DnaJ-L interaction.

VariantNet chargeSequence lengthSignificance
L_trunc30-245 aa (protein) / 24 aa codon-opt inputPrimary candidate. Charge reversal eliminates DnaJ electrostatic binding. TM domain intact.

8. Stage 6 — Codon Optimization

Method. All truncation variant protein sequences were back-translated to DNA using the E. coli K-12 high-frequency codon table (Kazusa database). Each optimized sequence was checked for preservation of the LS dipeptide motif.

VariantProtein aaDNA bpGC%LS motifAction required
L_trunc3024 aa75 bp30.7%PRESERVED (CTGAGC)GC below 40% threshold — consider IDT codon optimization with GC balancing before synthesis

Note on GC content. The codon-optimized L_trunc30 sequence has a GC content of 30.7%, which falls below the recommended 40-60% range for optimal E. coli expression. Before synthesis submission, the sequence should be passed through IDT’s codon optimization tool or GenScript’s OptimumGene algorithm with GC balancing enabled. The LS motif (CTGAGC encoding Leu-Ser) must not be altered during GC balancing.

9. Stage 7 — Synthetic Gene Construct Design

The full expression cassette for L_trunc30 was assembled with the following architecture, designed for direct Gibson assembly into the mUAV backbone:

Figure 5. Synthetic gene construct architecture for L_trunc30 Figure 5. Synthetic gene construct architecture for L_trunc30

Figure 5. Synthetic gene construct architecture for L_trunc30. Total construct: 230 bp. The BB_Fwd and Col_Rev overhangs are identical to those used in the HTGAA Week 6 Gibson assembly lab.

ElementSequence / NotesLength
BB_Fwd overhangGCGCACCTGCATATTGAGACCC22 bp
Ptrc promoterTTGACAATTAATCATCGGCTCGTATAATGTGTGG34 bp
RBS + spacerAAAGAGGAGAAA + ATAAT17 bp
L_trunc30 gene (codon-opt.)ATG…TAA (E. coli K-12 optimized)75 bp
lambda t0 terminatorGCAAAAAACCCCGCTTCGGCGGGGTTTTTTCG32 bp
rrnB T1 terminatorGCGCAACGCAATTAATGTGAGTTAGCTCAC30 bp
Col_Rev overhangGTCTCAATATGCAGGTGCGC20 bp
TOTAL230 bp

Design rationale. The Ptrc promoter provides IPTG-inducible expression. The RBS sequence (AAAGAGGAGAAA) is an optimized Shine-Dalgarno sequence with a 5 bp ATAAT spacer. The lambda t0 and rrnB T1 tandem terminators provide robust transcription termination. The BB_Fwd and Col_Rev Gibson overhangs are the exact sequences used in the HTGAA Week 6 chromophore mutagenesis lab, making this construct directly compatible with the existing mUAV cloning infrastructure.

10. Stages 8-9 — Variant Calling and IGV Visualization

Bowtie2 alignment. The wild-type codon-optimized L gene was used as the alignment reference. For each truncation variant, 1,000 paired-end Illumina reads (150 bp, error rate 0.001) were simulated and aligned using Bowtie2. Sorted BAM files were indexed with SAMtools. Variant calling was performed with BCFtools mpileup and bcftools call (-mv flag, VCF output).

IGV visualization. An IGV batch script was generated for desktop IGV that loads the reference FASTA, all BAM alignment tracks, and all VCF variant tracks simultaneously, navigates to the full L gene locus, sorts by position, collapses reads, and exports a snapshot PNG.

11. Integrated Candidate Summary

VariantESM2 LLRNTD removedNet chargeTM RMSF (nm)LS motifRecommendation
L_WTRef0 aa+8 (estimated)Not assessedPresentBaseline control
L_trunc1010 aaReducedPresentMinimal truncation; expected modest lysis enhancement
L_trunc2020 aaReducedPresentRemoves RRRPFK basic cluster; moderate DnaJ disruption expected
L_trunc2525 aaReducedPresentRemoves RRQQR motif region; significant charge reduction
L_trunc30+C29R=3.6430 aa-2~0.27CTGAGC — CONFIRMEDPRIMARY CANDIDATE — proceed to synthesis
L_trunc3535 aa-2 (est.)PresentNear-minimal; risk of TM domain instability at junction
L_trunc4040 aa-2 (est.)PresentFull NTD removal; highest expected toxicity; also order for comparison
C29R point mut.LLR = 3.640 aaMinimal changePresentSecondary candidate
S9Q point mut.LLR = 1.690 aaMinimal changePresentCross-validated from HTGAA Week 5 scan — order as positive control

Comparison: John’s Pipeline vs. Eric’s Pipeline

AspectJohn’s PipelineEric’s Pipeline
Primary engineering strategyN-terminal truncation series (trunc10 through trunc40), remove regulatory domain progressivelyPoint mutation design within the free zone (aa 16 to 28), preserve domain and modify specific residues
Lead candidateL_trunc30, removes aa 1 to 30, 45 aa remaining, net charge -2P13L, single Pro to Leu substitution at position 13, full 75 aa retained
Secondary candidatesC29R (LLR 3.64), F22R (LLR 1.86), S9Q (LLR 1.69)D26R (untested), D26G (mixed), N17R and H24R (open)
Hypothesis testedTruncation of N-terminal inhibitory domain releases TM domain conformational lock; charge reduction disrupts DnaJ interactionIncreasing cationic charge density in N-terminal region enhances electrostatic membrane disruption and lytic potency
ESM2 usageFull masked prediction scan across all 75 positions; LLR computed for every substitution; top 20 ranked by scoreSingle position evaluated (D26 to R); P13L not run through ESM2
ESM2 scopeSystematic, 75 × 19 = 1,425 substitutions scoredTargeted, 1 substitution scored
ESMFold usageStructure prediction for all 7 variants (WT plus 6 truncations); interdomain contact analysisMonomer prediction for WT, D26R, P13L; pTM and pLDDT comparison
ESMFold key findingZero interdomain contacts across all variants, interpreted as intrinsic NTD disorderP13L pTM = 0.420 vs WT 0.273, increase of 0.147, highest monomer score of any variant tested
AlphaFold-MultimerPlanned for L plus DnaJ complex; timed out on Colab; no resultsRun on WT oligomers (3-mer, 4-mer, 5-mer); all ipTM below 0.35; tool retired
AlphaFold-Multimer conclusionInconclusive due to Colab timeout; HPC rerun plannedFormally retired, confirmed unreliable for small integral membrane proteins
Structural visualizationRMSF profile (GROMACS demo), NTD stub ~1.87 nm vs TM domain ~0.27 nmChimeraX electrostatic surface map, three functional zones confirmed
GROMACS MDFull pipeline implemented, 4 MDP files generated; SLURM script for HPC; 1 ns demo RMSF computedNot performed
ProteinMPNNJunction redesign attempted for trunc30 with TM domain fixedNot performed
Conservation analysisNot performed as separate stageClustal Omega run twice on 51 homologs; free zone (aa 16 to 28) defined
ORF overlap analysisNot performedFull DNA-level codon analysis at nt 1715; P13L causes TCC to TCT at CP codon 127; synonymous S to S; cleared safe
Experimental lysis dataNot cross-referenced, computational pipeline onlyCross-referenced against group wet lab data; P13L confirmed lytic in both replicates
Wet lab validation statusNot yet validated, synthesis constructs designedP13L experimentally confirmed lytic, both replicates positive
Codon optimizationPerformed, E. coli K-12 Kazusa table; GC content 30.7% flagged; LS motif confirmed presentIdentified as next step, not yet completed
Synthetic gene constructFully designed, 230 bp construct with Ptrc, RBS, lambda t0, rrnB T1, Gibson overhangsPlanned for synthesis via Twist Bioscience; construct not yet finalized
Bowtie2 / BCFtools / IGVImplemented and demonstrated with simulated reads; IGV batch script generatedListed as planned next step, not yet performed
DnaJ interactionCentral to hypothesis, truncation removes basic domain responsible for DnaJ electrostatic engagementNot explicitly modeled
Net charge of lead candidate-2 (charge reversal from highly basic WT)Unchanged from WT, P13L does not alter charge
LS motif verificationConfirmed present in codon-optimized sequence (CTGAGC)Not explicitly checked
Key methodological strengthSystematic genome-wide scanning and full pipeline automation; all stages reproducible from single notebookExperimental ground truth, wet lab confirmation provides direct biological validation
Key methodological gapNo experimental validation yet; interdomain contact analysis inconclusiveNo systematic positional scanning; ESM2 used for only 1 position; no MD or ProteinMPNN
Most actionable next stepRerun Stage 3 on HPC for DnaJ PAE; GC balance codon sequence; order L_trunc30 synthesisOrder D26R for experimental validation alongside confirmed P13L

Appendix

A. Primary Requirements

Part D. Group Brainstorm on Bacteriophage Engineering

  • Find a group of ~3–4 students
    • 2026a-john-adeyemo-adedeji
    • 2026a-brie-taylor
    • 2026a-eric-schneider
    • 2026a-albert-manrique
    • 2026a-Tehseen Rubbab
  • Read through the Phage Reading material listed under “Reading & Resources” below.
  • Review the Bacteriophage Final Project Goals for engineering the L Protein:
    • Increased stability (easiest)
    • Higher titers (medium)
    • Higher toxicity of lysis protein (hard)

Brainstorm Session

Choose one or two main goals from the list that you think you can address computationally. Write a 1-page proposal (bullet points or short paragraphs) describing:

  • Which tools/approaches from recitation you propose using
  • Why do you think those tools might help solve your chosen sub-problem?
  • Name one or two potential pitfalls
  • Include a schematic of your pipeline

This resource may be useful: HTGAA Protein Engineering Tools

Action Items:

  1. Schedule a Group working session — Google Meet
  2. Initial comments (Brainstorm) on #4

B. Eric’s Brainstorming Notes

Goal: I am recommending Goal C: Higher toxicity of lysis protein (hard)

Hypothesis: I believe we can focus on the cationic properties, or positive electrical charges that are present in the amino acid sequence. By substituting amino acids that enable more positive charge strengthening electrostatic attraction, we may create more binding activity. Lysis timing can be tuned in either direction by manipulating charge density.

Pipeline:

  1. UniProt — retrieve sequence
  2. BLAST — find homology
  3. PyMOL — visualize polarity
  4. PyMOL — isolate target residues
  5. ESM2 — score substitution probability
  6. Heatmap — synthesize data
  7. ESMFold — predict mutant structures
  8. PyMOL — compare mutants to baseline
  9. Codon optimization — prepare sequences
  10. Twist Bioscience — synthesize genes
  11. Benchling — design plasmid constructs
  12. Review gate — confirm replicability
  13. Opentrons OT-2 — run protocol and collect data

Potential Pitfalls:

My hypothesis focuses on region 1 (facing cytoplasm, hydrophilic) and region 3 (a mix of hydrophobic and hydrophilic or “amphipathic,” facing periplasm) to control timing of MurA enzyme inhibition.

  • Region 1 & 3: Too much polarity change could cause the phage to bind and become entrapped.
  • Avoid region 2 as it is a very well defined helical fold that is subject to disruption with minor change to structure.

Schematic of Pipeline:

  • Phase 1 — Discovery: UniProt → BLAST → PyMOL
  • Phase 2 — Mutation Analysis: PyMOL → ESM2 → Heatmap → ESMFold → PyMOL
  • Phase 3 — Synthesis: Codon Optimization → Twist Bioscience
  • Phase 4 — Plasmid Design: Benchling → Review Gate
  • Phase 5 — Execution: Opentrons OT-2

Review feedback: Will likely encounter overlapping frames, and will visualize in Benchling.


C. John’s Brainstorming Notes

Computational Goals:

  1. Align reads to MG1655 & call SNPs/indels (Bowtie2/Mpileup/BCFtools)
  2. Codon-optimize and synthesize L gene variants
  3. Error-prone PCR mutagenesis to generate L mutant libraries

Proposal — Proposed tools:

  • Input: Paired-end Illumina reads (250 bp) from mutant and parental strain genomic DNA; Reference: MG1655 (E. coli K-12, accession NC_000913.3)
  • Quality Control: FastQC — raw read quality assessment; Trimmomatic or Fastp — adapter trimming, low-quality base removal
  • Alignment: Bowtie2 — short-read alignment to reference; SAMtools — convert SAM → BAM, sort, index
  • Variant Calling: SAMtools Mpileup — pileup of aligned reads per base position; BCFtools call — generate VCF files; Filter: QUAL score >100, present in mutant but absent in parental strain
  • Annotation: SnpEff or ANNOVAR — annotate variants with gene names, amino acid changes, functional impact
  • Visualization: IGV (Integrative Genomics Viewer) — manual inspection of called variants at loci of interest
  • Environment: Linux/bash, conda for dependency management; Galaxy platform (cpt.tamu.edu/galaxy-pub)
  • Output: Ranked list of candidate causal mutations unique to mutants (e.g., dnaJ P330Q)
John’s pipeline schematic John’s pipeline schematic

Major sub-problem the tools solve: The core challenge is distinguishing a true causal mutation from background noise in a mutagenized genome.

  • Bowtie2 handles short-read alignment efficiently against a well-annotated reference, minimizing misalignment artifacts
  • Mpileup/BCFtools applies statistical models to distinguish true variants from sequencing errors
  • QUAL >100 filtering + parental subtraction eliminates pre-existing polymorphisms
  • SnpEff immediately translates nucleotide changes into amino acid consequences

Potential Pitfalls:

  • Sibling contamination
  • Reference bias

D. Albert’s Notes

Goals: Increase the L protein structural stability to improve lysis efficiency. It’s a small membrane protein that disrupts the inner E. coli membrane during phage infection.

Pipeline:

  1. Get protein sequence from UniProt; Run BLAST to find homologs across phage strains; Run Clustal Omega to identify hot spots for mutations
  2. Run ESM2 to identify mutations and where we can mutate without affecting structural stability; Keep mutations that don’t disrupt the protein structure
  3. Run the mutations through ESMFold to predict structure and filter for stability
  4. Rank the candidates by stability (pLDDT) improvements over the UniProt sequence
  5. Run top candidates through AlphaFold-Multimer to confirm the mutations don’t affect the interaction between E. coli DnaJ
  6. Take the top candidates and run them through the wet lab

Pipeline diagram:

L protein sequence (UniProt)
↓
BLAST + Clustal Omega → conserved map
↓
ESM2 mutational scan → high-scoring candidates
↓
ESMFold → pLDDT comparison vs wildtype
↓
[Optional] AlphaFold-Multimer → check DnaJ interaction preserved
↓
Top 3-5 candidates → wet lab validation

What tools are we using and why?

ESM2 allows us to run stochastic gradient descent on how stable our protein sequences are likely to be and what evolution considers normal.

ESMFold provides us with a pLDDT value for structural confidence and together we can automate mutation screening before hitting the wet lab.

Clustal Omega provides us with positions on the phage strain that we should not change in order to further preserve structural stability.

Pitfalls: L protein is a membrane protein and might not be as well represented in ESM2 training data and the PDB so we might have less reliable outputs. Our folding models aren’t taking into account lipid membranes so we might have issues with modeling the interaction. Our stability estimates might also be inaccurate as the delta between mutations may be too small to rank them accurately.


E. Tehseen’s Brainstorming Notes

Systematic Tuning of the N-Terminal Regulatory Domain

Goal: Enhance and regulate the toxicity of the MS2 bacteriophage L lysis protein by systematically modifying its N-terminal domain. Instead of removing this region, identify the minimal regulatory segment needed for precise control of lysis timing and activity.

Background and rationale

The L protein, a 75-amino acid membrane-bound lysis protein, is responsible for killing E. coli during infection. Studies show that its N-terminal domain (~first 30–40 amino acids) is not required for lysis; truncation mutants (Lodj variants) lacking this region still lyse cells, often faster. This indicates the N-terminus acts as a regulatory brake to delay lysis and support viral replication.

Hypothesis

The regulatory function of the N-terminal domain in lysis is influenced by its length and charge characteristics. It is proposed that:

  1. Partial truncations may incrementally diminish inhibitory effects and subsequently enhance lysis efficiency
  2. The regulatory activity appears to be localised to a distinct sub-region rather than to the entire N-terminal domain
  3. There is likely an optimal truncation point that achieves a balance between increased toxicity and maintenance of protein stability

Proposed Computational Pipeline:

  1. Sequence Retrieval: Obtain the L protein sequence from UniProt.
  2. Structural and Residue Analysis: Visualize the N-terminal domain using PyMOL to identify hydrophilic and cationic residues.
  3. In Silico Mutagenesis: Use ESM2/ESMFold to predict the effect of substitutions that increase cationicity, focusing on residues facing the cytoplasm or periplasm.
  4. Stability Check: Compare predicted mutants’ folding and stability using ESMFold and pLDDT scores.
  5. Interaction Analysis: Optional AlphaFold-Multimer predictions to confirm L interaction with DnaJ or other host factors is preserved.
  6. Prioritization: Generate a heatmap of mutants ranked by predicted lysis enhancement and structural stability.
  7. Codon Optimization & Synthesis: Prepare selected mutants for experimental validation.

Expected Outcomes: Increased electrostatic interaction with target host proteins; tunable lysis timing while preserving N-terminal regulatory functions; generation of mutant library for wet lab testing of lytic efficiency.

Potential Pitfalls: Excessive cationic mutations could cause nonspecific aggregation or mislocalization. Predictions may differ from experimental results.


F. Group Meeting Notes (3/24)

  • 10, 20, 30, 40 base pairs (changes)
  • Overlapping frames?
  • Pipeline approach: each person picks a tool to explore in depth, then come back and review/align on results

Tuesday — met to discuss current state:

  • What is the dependency outside of L-protein standalone?
  • What is the multi-frame dependency when engineering a plasmid?
  • L-protein is the focus — engineer
  • Refer to WEEK 5 Lab Resources for L-Protein
  • Reminder to post new questions/topics in Genspace Discourse Forum for knowledge sharing, TA support
  • Follow-up: met with John, identified focus area — IGV (Integrative Genomics Viewer) for manual inspection of called variants at loci of interest
  • ES: located some initial ChimeraX visualizations — will post images

Wednesday 3/25 — explore sequence in silico individually

Thursday 3/26 — pick a high probability option

Friday 3/27 — model in Benchling and Asimov Kernel

Saturday 3/28 — (TBD)

Sunday 3/29 — Final summary. By EOD Sunday 3/29, publish here. Please post personal pipeline visualizations/notes under your brainstorm section.


Status Update: Friday, March 27th

Eric’s Final Summary Notes: On 3/26 I did a “deep dive” into the remaining project scope, decided to focus on the identification of an amino acid substitution that would support our hypothesis around the N-1 Terminus region.

Primary request: Please review, and if you agree, or want to add/change anything, feel free to annotate with comments. Once we have consensus, we can submit the markdown file as our final “group project”.


References

  • Bernhardt TG, Roof WD, Young R (2002). The Escherichia coli FKBP-type PPIase SlyD is required for the stabilization of the phage PhiX174 lysis protein E. Mol Microbiol. PMC5446614.
  • Chamakura KR, Young R (2019). Phage single-gene lysis: how it works and why it matters. Future Microbiol. PMC5775895.
  • Lin DL et al. (2023). Structural insights into MS2 lysis protein L and its interaction with DnaJ. PMC10688784.
  • Schilling T, et al. (2023). Engineering bacteriophage lysis proteins for enhanced activity. PubMed 36608652.
  • Lin YW, et al. (2017). MS2 lysis protein L: a glycoprotein tethered to the membrane by a single transmembrane segment. PMC5446614.
  • Lin DL, Leick M, Young R (2017). Lysis protein gene products specifically inhibit phage-mediated bacterial cell lysis. PMC5775895.
  • UniProt P03609: LYS_BPMS2 MS2 lysis protein. https://www.uniprot.org/uniprotkb/P03609
  • UniProt P08622: DNAJ_ECOLI E. coli DnaJ chaperone. https://www.uniprot.org/uniprotkb/P08622
  • Lin YW et al. ESM2 protein language models. Meta AI 2023.
  • Jumper J et al. AlphaFold2. Nature 2021. DOI: 10.1038/s41586-021-03819-2.
  • Dauparas J et al. ProteinMPNN. Science 2022. DOI: 10.1126/science.add2187.

HTGAA 2026 — MS2 L Protein Group Project

Computational pipeline developed in collaboration with group members Eric, Albert, Tehseen, and John