Week 4 HW: Protein Design Part I

Important

HTGAA | Part A: Protein Analysis and Visualization

Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions:.


Briefly describe the protein you selected and why you selected it

How long is it? What is the most frequent amino acid?.

SOD1 Description

CharacteristicNumber
AA Length53
Most Frequent AAGlycine (25X)
Protein Homologs250

The SOD1 family

The protein Superoxide dismutase [Cu–Zn] (SOD1), UniProt P00441, belongs to the Cu/Zn superoxide dismutase family, a group of antioxidant enzymes widely conserved across species. Members of this family are characterized by a Greek-key β-barrel fold, the coordination of copper and zinc ions at the active site, and a functional homodimeric structure. Their primary role is to protect cells from oxidative stress by catalyzing the conversion of superoxide radicals into molecular oxygen and hydrogen peroxide.

Important

HTGAA | Part D. Group Brainstorm on Bacteriophage Engineering

One-page engineering proposal outlining your computational toolkit, its strategic justification, potential functional pitfalls, and a structured pipeline schematic:


L PROTEIN ENGINEERING PROPOSAL

Two goals addressed in this proposal: increased stability and higher lysis toxicity. These are not treated as separate objectives, they are coupled and pursued through a unified strategy.

The central hypothesis is that of conditional stability: the L protein can be engineered to be more resistant to host proteolytic degradation before membrane insertion, while at the same time more efficient at cooperative oligomerization after insertion. A more stable pre-insertion L survives longer in the bacterial cytoplasm; a better-oligomerizing post-insertion L generates larger membrane lesions more rapidly. Both goals are therefore addressed by the same logic.

Biological Background

A brief summary of what makes L protein unusual, and what makes it difficult to engineer:

L is not a typical lysis protein. Most phage lysis proteins work by blocking bacterial cell wall construction. L does not do that. Instead, it inserts into the inner membrane and assembles into clusters that physically tear it apart (Mezhyrova et al., 2023). The mechanism is closer to a membrane demolition event than a biochemical inhibition.

A bacterial helper protein is required. The host chaperone DnaJ is obligatorily required for L-mediated lysis. When DnaJ carries the P330Q mutation, lysis is completely blocked (Chamakura et al., 2017). The region of L that interacts with DnaJ is therefore treated as a hard design constraint, it cannot be altered.

The gene encodes three proteins simultaneously. The DNA sequence of L overlaps with the coat protein gene (in the +1 reading frame) and with the start of the replicase gene (in frame 0). This means most nucleotide changes that would be desirable in L would simultaneously break one of the other two proteins. Only a small subset of mutations is experimentally accessible.

Computational Approach

The goal of this proposal is to use a series of AI-based computational tools to design improved variants of L before any experimental work be carried out.

Step 1. Structural reference model

The goal of this step is to establish a reliable 3D model of L in its membrane context. AlphaFold3 and Boltz-1 will be used to predict the structure of L in membrane configuration. Per-residue confidence scores (pLDDT) will be evaluated to distinguish structurally reliable regions (the N-terminal cytoplasmic domain: aa ~1–35) from uncertain regions, the C-terminal transmembrane domain (aa ~36–75). Only high-confidence regions will be used as fixed design constraints. Low-confidence regions will be treated as designable positions.

Step 2. Evolutionary landscape mining

The goal of this step is to learn which parts of L can be changed and which cannot. FoldSeek will be used to identify structural homologs of L across Leviviridae and Alloleviviridae phage families. These homologs will then be processed by SaProt, a protein language model that encodes both sequence and structural information. The result is a per-domain map of conserved positions (those functionally critical, not to be mutated) versus variable positions (candidates for redesign). SaProt is preferred over sequence-only models such as ESM-2 for this task because it captures structural context, which is particularly important for transmembrane regions.

Step 3. Generative sequence design

The goal of this step is to generate thousands of candidate L variants that satisfy all design constraints. ProteinMPNN is applied in two parallel layers:

Layer A (N-terminal domain): sequences will be designed to increase thermodynamic stability in solution, with the DnaJ-binding interface and the LS (leu-ser) motif masked as immutable constraints.

Layer B (transmembrane domain): sequences will be designed with symmetry constraints (C₂/C₃) to maximize cooperative oligomerization interfaces in a lipid bilayer environment, since oligomerization is the direct cause of membrane disruption and cell death.

All candidates are then passed through a dual-codon overlap filter, i. e. a dynamic programming algorithm that retains only those nucleotide substitutions that produce the desired amino acid change in L (frame +1) while leaving the coat protein sequence unchanged (frame 0) and preserving the replicase reading frame. Candidates that fail this filter will be discarded regardless of their predicted fitness.

Step 4. Multi-metric fitness scoring

The goal of this step is to rank surviving candidates by three independent fitness metrics:

ESM-2 log-likelihood: a global evolutionary plausibility score

SaProt fitness score: the primary metric for the transmembrane domain, incorporating structural context

Evo (Arc Institute, 2024): a genomic language model trained on millions of phage genomes, will be used to score the full MS2 ssRNA genome after each mutation, penalizing any disruption to the replicase operator, CP initiation site, ribosomal pause sites, or RNA packaging signals

Only candidates ranking in the top 5% across all three metrics simultaneously will be advanced.

Step 5. DnaJ interface validation

The goal of this step is to confirm that top candidates preserve the obligatory interaction with DnaJ. Boltz-2 will be used to model the L::DnaJ complex and predict binding affinity for each candidate. Candidates will be retained only if their predicted binding affinity does not decrease by more than 0.5 kcal/mol relative to wild-type L, and if their interface confidence score (ipTM) remains at or above 90% of the wild-type value.

Potential Pitfalls

*Pitfall 1. Increased stability may reduce membrane insertion efficiency.

A more thermodynamically stable L protein may present a higher energy barrier for membrane insertion, potentially reducing rather than increasing lytic activity. This is an major risk because the tools used to predict folding stability (RosettaΔΔG, ThermoMPNN) model stability in aqueous solution, not insertion energy into a lipid bilayer. This risk will be mitigated by restricting stability design to the N-terminal soluble domain only, leaving the transmembrane domain free to insert. Molecular dynamics simulations in an explicit lipid bilayer (POPE:POPG 3:1, CHARMM-GUI/GROMACS) are planned for top candidates to validate insertion energy before any synthesis is ordered.

*Pitfall 2. The genomic overlap constraint drastically reduces the accessible design space.

Because the L gene is encoded in the same DNA as the coat protein and replicase, most amino acid changes that would be desirable in L are simply not achievable without breaking another gene. In practice, the dual-codon overlap filter is expected to eliminate the vast majority of computationally generated candidates. potentially leaving only a few dozen experimentally accessible variants out of tens of thousands generated. This is an inherent limitation of the MS2 genome architecture, not a failure of the computational approach. It is addressed by applying the overlap filter as early n the pipeline to avoid scoring inaccessible designs.