Week 4 HW: Protein Design

Part A. Conceptual Questions

1. How many amino acid molecules do you take with 500 g of meat?

If we assume that meat is approximately 20% protein, then 500 grams of meat contains about 100 grams of protein. The average molecular weight of an amino acid is roughly 100 Daltons (100 g/mol). Dividing 100 grams by 100 g/mol gives approximately 1 mole of amino acids and one mole contains 6.02 × 10²³ molecules, the Avogadro’s number. Therefore, consuming 500 grams of meat means ingesting on the order of 10²³ amino acid molecules.

2. Why do humans eat beef but do not become a cow?

Humans do not become cows after eating beef because biological identity is not determined by the origin of consumed molecules but by the genetic information and regulatory networks. It is for this reason that proteins from beef are digested into individual amino acids, which are then absorbed and reused by our cells to synthesize human proteins according to our own DNA instructions.

3. Why are there only 20 natural amino acids?

There are only 20 canonical amino acids because they provide sufficient chemical diversity to build functional proteins while maintaining evolutionary stability. These amino acids cover a wide range of chemical properties: hydrophobic, polar, charged, aromatic, flexible, and rigid. Early in evolution, once the translation machinery became established, expanding the genetic code would have introduced significant risk of translational errors.

4. Can you make non - natural amino acids? Design some.

Yes, non - natural amino acids can be synthesized chemically or incorporated through expanded genetic code technologies. For example: A fluorinated amino acid can be designed by replacing a methyl group with a trifluoromethyl group to increase hydrophobicity and protein stability. Another possibility is a photo - switchable amino acid containing an azobenzene group, allowing protein conformation to be controlled with light. A third design could include a redox - active moiety, such as a ferrocene group, enabling electron transfer within engineered proteins.

5. Where did amino acids come from before enzymes and life started?

Amino acids likely originated through prebiotic chemistry before the emergence of life and there are experiments such as the Miller - Urey experiment that demonstrated that amino acids can form spontaneously under simulated early Earth conditions involving simple gases and electrical discharge. Additionally, amino acids have been found in meteorites such as the Murchison meteorite, suggesting extraterrestrial delivery may have contributed to the prebiotic inventory.

6. If you make an α-helix using D - amino acids, what handedness would you expect?

Natural proteins are composed of L - amino acids and form predominantly right - handed α - helices. If a scientist constructs a helix entirely from D - amino acids, the chirality of each residue is inverted, which reverses the allowed backbone dihedral angles. As a result, the helix formed would be left - handed.

7. Why are most molecular helices right - handed?

Most molecular helices in biological systems are right - handed because life is based almost exclusively on L - amino acids because their stereochemistry constrains backbone geometry in a way that energetically favors right - handed helices, minimizing steric clashes and optimizing hydrogen bonding. If life had evolved using D - amino acids instead, left - handed helices would likely dominate.

8. Why do β - sheets tend to aggregate?

- What is the driving force?

β - sheets tend to aggregate because their backbone hydrogen bond donors and acceptors become exposed when proteins partially unfold. These exposed regions seek to form hydrogen bonds to stabilize themselves, often binding to similar β-strands from other molecules. The aggregation is driven by intermolecular hydrogen bonding, hydrophobic interactions and the entropic gain associated with releasing ordered water molecules from hydrophobic surfaces.

9. Why do many amyloid diseases form β-sheets?

- Can you use amyloid β-sheets as materials?

Many amyloid diseases, including Alzheimer’s disease, involve the misfolding of proteins into stable cross - β - sheet fibril because they are particularly prone to forming highly ordered, self templating aggregates that grow through nucleation dependent polymerization. Although pathological in a biological context, these same structural properties make amyloid fibrils attractive as biomaterials and engineered amyloid-like peptides can form hydrogels, scaffolds for tissue engineering and nanostructured materials.

Part B: Protein Analysis and Visualization

1. Briefly describe the protein you selected and why you selected it The protein I selected is the ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (RbcL) with UniProt accession Q8DIS5, entry name RBL_THEVB. This protein comes from the thermophilic cyanobacterium Thermosynechococcus vestitus BP-1. RuBisCO is the central enzyme of the Calvin cycle and is responsible for fixing atmospheric CO₂ into organic carbon during photosynthesis. I selected this protein because carbon fixation underlies all plant biomass production and therefore directly impacts agriculture, food security, and global carbon cycling. Studying RuBisCO at the structural level provides insight into how photosynthetic efficiency might be improved in crops.

https://www.uniprot.org/uniprotkb/Q8DIS5/entry

2. Identify the amino acid sequence of your protein. MAYTQSKSQKVGYQAGVKDYRLTYYTPDYTPKDTDILAAFRVTPQPGVPFEEAAAAVAAESSTGTWTTVWTDLLTDLDRYKGCCYDIEPLPGEDNQFIAYIAYPLDLFEEGSVTNMLTSIVGNVFGFKALKALRLEDLRIPVAYLKTFQGPPHGIQVERDKLNKYGRPLLGCTIKPKLGLSAKNYGRAVYECLRGGLDFTKDDENINSQPFQRWRDRFLFVADAIHKAQAETGEIKGHYLNVTAPTCEEMLKRAEFAKELEMPIIMHDFLTAGFTANTTLSKWCRDNGMLLHIHRAMHAVMDRQKNHGIHFRVLAKCLRMSGGDHIHTGTVVGKLEGDKAVTLGFVDLLRENYIEQDRSRGIYFTQDWASMPGVMAVASGGIHVWHMPALVDIFGDDAVLQFGGGTLGHPWGNAPGATANRVALEACIQARNEGRDLMREGGDIIREAARWSPELAAACELWKEIKFEFEAQDTI

- How long is it? What is the most frequent amino acid? You can use this Colab notebook to count the frequency of amino acids.

The length of the protein is: 475 aminoacids. The most common amino acid is: A (alanine), which appears 47 times.

- How many protein sequence homologs are there for your protein? Hint: Use Uniprot’s BLAST tool to search for homologs.

When the amino acid sequence of Q8DIS5 was analyzed using UniProt’s BLAST tool, the search returned 250 homologous protein sequences. These homologs correspond primarily to RuBisCO large subunits from cyanobacteria and photosynthetic organisms. The presence of numerous homologs indicates that this protein is highly conserved across species, reflecting its essential role in carbon fixation and global photosynthetic metabolism.

- Does your protein belong to any protein family?

Yes, the protein Q8DIS5 (RBL_THEVB) belongs to the RuBisCO superfamily, specifically Form I RuBisCO large subunits (RbcL family). This protein family includes the catalytic large chains of ribulose-1,5-bisphosphate carboxylase/oxygenase enzymes found in cyanobacteria, plants, and algae. Members of this family share highly conserved sequence motifs that are essential for carbon fixation, including residues involved in substrate binding and magnesium coordination at the active site.

3. Identify the structure page of your protein in RCSB

- When was the structure solved? Is it a good quality structure? Good quality structure is the one with good resolution. Smaller the better (Resolution: 2.70 Å)

The structure 2YBV was solved by X-ray crystallography, with:

Deposited: March 10, 2011

Released: March 28, 2012

Resolution: 2.30 Å

A resolution of 2.30 Å is considered high quality for structural analysis and visualization because at this resolution, the positions of amino acid side chains, cofactors, and active-site features are clearly defined.

- Are there any other molecules in the solved structure apart from protein?

Apart from the protein chains, crystallized RuBisCO structures typically include magnesium ions (Mg²⁺), which are essential cofactors for catalysis, as well as substrate analogs and water molecules that help stabilize the active site.

- Does your protein belong to any structure classification family? Structurally, the RuBisCO large subunit belongs to the alpha/beta protein class with a conserved α/β barrel–like fold characteristic of the RuBisCO superfamily, illustrating both its functional role in catalysis and its classification within established structural families.

4. Open the structure of your protein in any 3D molecule visualization software: PyMol Tutorial Here (hint: ChatGPT is good at PyMol commands)

  • Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.
  • Color the protein by secondary structure. Does it have more helices or sheets?
  • Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?

  • Visualize the surface of the protein. Does it have any “holes” (aka binding pockets)?

When I visualized in cartoon representation, the protein displays a mixed α/β architecture. A central beta-sheet core is surrounded by multiple alpha helices, forming the characteristic α/β fold typical of RuBisCO large subunits. The beta sheets form the structural core of the enzyme, while the helices surround and stabilize this framework. Overall, the structure contains both helices and sheets, with a prominent beta-sheet catalytic core.

Moreover, when colored by secondary structure, the beta sheets appear concentrated in the central region, while alpha helices are distributed around the periphery. This confirms that the enzyme follows a classical α/β barrel-like organization commonly observed in metabolic enzymes. On the other hand, coloring the structure by residue type reveals that hydrophobic residues are predominantly located in the interior of the protein, forming a stable hydrophobic core. In contrast, hydrophilic residues are mostly exposed on the surface, consistent with a soluble enzyme functioning in the chloroplast stroma. This distribution reflects proper protein folding and stability in aqueous environments.

Surface visualization of 2YBV reveals clear cavities and clefts between structural domains, indicating the presence of binding pockets. However, this specific crystal structure represents an apo form of the enzyme, as it does not contain a modeled Mg²⁺ ion or bound substrate. The only heteroatoms present in this structure are water molecules (HOH).

To visualize the catalytic metal ion, an additional RuBisCO structure (PDB ID 4RUB) was examined. In this structure, the Mg²⁺ ion appears as a green sphere located within a deep catalytic pocket and coordinated by nearby residues, confirming the structural location of the active site. Although Mg²⁺ is essential for enzymatic activity, it is not present in the 2YBV model itself.

Part C. Using ML-Based Protein Design Tools

C1. Protein Language Modeling

Deep Mutational Scans

a) Use ESM2 to generate an unsupervised deep mutational scan of your protein based on language model likelihoods.

b) Can you explain any particular pattern? (choose a residue and a mutation that stands out)

Distinct vertical bands of strongly negative scores are observed at several positions, indicating that most substitutions at these sites are predicted to be unfavorable. These residues appear to be highly constrained, suggesting structural or functional importance. For example, substitution of a hydrophobic wild-type residue with a chemically dissimilar charged residue at one of these constrained positions results in a pronounced decrease in log - likelihood. This pattern is consistent with disruption of structural stability, particularly if the residue contributes to hydrophobic packing or functional integrity. In contrast, other positions exhibit relatively mild score variations across multiple substitutions, indicating higher mutational tolerance. These sites likely correspond to surface-exposed or flexible regions of the protein.

Latent Space Analysis

a) Use the provided sequence dataset to embed proteins in reduced dimensionality.

b) Analyze the different formed neighborhoods: do they approximate similar proteins?

c) Place your protein in the resulting map and explain its position and similarity to its neighbors.

To evaluate the biological relevance of the learned embedding space, my protein of interest, RuBisCO (PDB ID: 2YBV), was embedded using the same ESM-2 pipeline applied to the SCOPe dataset. The resulting embedding was projected into the same reduced-dimensional space and analyzed using cosine similarity in the original high-dimensional representation.

The nearest neighbor to my protein (cosine similarity = 0.9904) corresponds to another ribulose-1,5-bisphosphate carboxylase-oxygenase from Oryza sativa. This extremely high similarity value indicates that the embedding space accurately captures functional and structural conservation. Given that RuBisCO is a highly conserved enzyme involved in carbon fixation, this result is biologically consistent and expected.

Beyond the closest homolog, additional neighboring proteins include large metabolic enzymes such as aconitase, nitrogenase, and ornithine decarboxylase. Many of these proteins belong to α/β structural classes in SCOPe, suggesting that the latent space organizes proteins not only by specific biochemical function but also by shared structural architecture.

C2. Protein Folding

Folding a protein

  1. Fold your protein with ESMFold. Do the predicted coordinates match your original structure?

Total sequence length: 475 Running ESMFold inference for sequence with length 475… Prediction complete. ptm: 0.919 plddt: 90.429 Results saved to test_d3e9a/ CPU times: user 1min 33s, sys: 8.76 s, total: 1min 41s Wall time: 2min 15s

ExecutiveAlign: 3395 atoms aligned. ExecutiveRMS: 132 atoms rejected during cycle 1 (RMSD=1.74). ExecutiveRMS: 183 atoms rejected during cycle 2 (RMSD=1.04). ExecutiveRMS: 97 atoms rejected during cycle 3 (RMSD=0.85). ExecutiveRMS: 46 atoms rejected during cycle 4 (RMSD=0.79). ExecutiveRMS: 22 atoms rejected during cycle 5 (RMSD=0.77). Executive: RMSD = 0.761 (2868 to 2868 atoms)

The amino acid sequence of the protein (475 residues) was folded using ESMFold to predict its three-dimensional structure. The predicted model showed high confidence with a pTM score of 0.919 and a pLDDT value of approximately 90, indicating reliable structural prediction. The predicted structure was then aligned with the experimentally determined structure PDB 2YBV using PyMOL. The alignment resulted in an RMSD of 0.761 Å across 2868 atoms, indicating that the predicted coordinates match the original structure very closely.

  1. Try changing the sequence, first try some mutations, then large segments. Is your protein structure resilient to mutations?

To evaluate the effect of mutations on the protein structure, several point mutations were introduced into the amino acid sequence, including substitutions such as V->A, L->A, F->Y, G->A, and I-> A at different positions along the 475-residue protein. The modified sequence was then folded using ESMFold and compared with the experimental structure PDB 2YBV using PyMOL. Structural alignment produced an RMSD of 0.761 Å, indicating very high structural similarity between the mutated and original structures. These results suggest that the protein fold is highly conserved and resilient to sequence mutations, maintaining its overall three-dimensional structure despite several amino-acid substitutions.

Total sequence length: 475 Running ESMFold inference for sequence with length 475… Prediction complete. ptm: 0.921 plddt: 90.388 Results saved to test_cdeb6/ CPU times: user 1min 45s, sys: 8.49 s, total: 1min 53s Wall time: 2min 23s

Match: read scoring matrix. Match: assigning 475 x 5591 pairwise scores. MatchAlign: aligning residues (475 vs 5591)… MatchAlign: score 2538.000 ExecutiveAlign: 3395 atoms aligned. ExecutiveRMS: 132 atoms rejected during cycle 1 (RMSD=1.74). ExecutiveRMS: 183 atoms rejected during cycle 2 (RMSD=1.04). ExecutiveRMS: 97 atoms rejected during cycle 3 (RMSD=0.85). ExecutiveRMS: 46 atoms rejected during cycle 4 (RMSD=0.79). ExecutiveRMS: 22 atoms rejected during cycle 5 (RMSD=0.77). Executive: RMSD = 0.761 (2868 to 2868 atoms)

modified sequence = “MAYTQSKSQKVGYQAGVKDYRLTYYTPDYTPKDTDILAAFRVTPQPGVPFEEAAAAVAAESSTGTWTTVWTDLLTDLDRYKGCCYDIEPLPGEDNQFIAYIAYPLDLFEEGSVTNMLTSIVGNVFGFKALKALRLEDLRIPVAYLKTFQGPPHGIQVERDKLNKYGRPLLGCTIKPKLGLSAKNYGRAVYECLRGGLDFTKDDENINSQPFQRWRDRFLFVADAIHKAQAETGEIKGHYLNVTAPTCEEMLKRAEFAKELEMPIIMHDFLTAGFTANTTLSKWCRDNGMLLHIHRAMHAVMDRQKNHGIHFRVLAKCLRMSGGDHIHTGTVVGKLEGDKAVTLGFVDLLRENYIEQDRSRGIYFTQDWASMPGVMAVASGGIHVWHMPALVDIFGDDAVLQFGGGTLGHPWGNAPGATANRVALEACIQARNEGRDLMREGGDIIREAARWSPELAAACELWKEIKFEFEAQDTI”

C3. Protein Generation

Inverse-Folding a protein: Let’s now use the backbone of your chosen PDB to propose sequence candidates via ProteinMPNN

  1. Analyze the predicted sequence probabilities and compare the predicted sequence vs the original one.

The amino acid probability heatmap generated by ProteinMPNN shows the likelihood of each amino acid at every position of the protein backbone. In the heatmap, several positions display strong probabilities for a single amino acid, which appear as bright signals for specific residues. These positions likely correspond to structurally constrained residues that are important for maintaining the stability of the protein fold. Other positions show more distributed probabilities across multiple amino acids, suggesting that these regions are more flexible and can tolerate substitutions.

When comparing the predicted sequence to the original sequence from PDB 2YBV, the sequence recovery was approximately 47.2%, meaning that about half of the residues match the native sequence while many positions were redesigned by the model. This observation is consistent with the heatmap, which indicates that while some residues are strongly constrained by the structure, other positions allow multiple amino acids. Overall, this result demonstrates that several different amino acid sequences can be compatible with the same protein backbone.

Generating sequences…

>2YBV, score=1.5070, fixed_chains=[], designed_chains=[‘A’], model_name=v_48_020 TYYTPDYTPKDTDILAAFRVTPQPGVPFEEAAAAVAAESSTGXXXXXWTDLLTDLDRYKGCCYDIEPLPGEDNQFIAYIAYPLDLFEEGSVTNMLTSIVGNVFGFKALKALRLEDLRIPVAYLKTFQGPPHGIQVERDKLNKYGRPLLGCTIKPKLGLSAKNYGRAVYECLRGGLDFTKDDENINSQPFQRWRDRFLFVADAIHKAQAETGEIKGHYLNVTAPTCEEMLKRAEFAKELEMPIIMHDFLTAGFTANTTLSKWCRDNGMLLHIHRAMHAVMDRQKNHGIHFRVLAKCLRMSGGDHIHTGTVVGKLEGDKAVTLGFVDLLRENYIEQDRSRGIYFTQDWASMPGVMAVASGGIHVWHMPALVDIFGDDAVLQFGXXXXGHPWGNAPGATANRVALEACIQARNEGRDLMREGGDIIREAARWSPELAAAC

>T=0.1, sample=0, score=0.7574, seq_recovery=0.4720 KYLRLDYVPKDDDILAAFLVTPKPGVPPEEAAAAVARESSVGXXXXXEELFEIDLEKYRAVCYKITPLPGADNRFLAYVAYPLSLFKPGSVTDLWNTIVGRVFEHPLLAALKLLDLRIPPSFIATFPGPPLGIEAVRARLGIHGRPLLGFAIRPREGLSPAEYGERARAILEGGADFTFDHGSVRDHPWAKFEDRAAAVAEAIRAAEAATGRRKAHAVNISAPTLEEALARADLCKSLGLPMISFNYLRDGFDAAAAIAAYCREHGIILYAGTGGISKVSSDPDRGVDRRVYYKLIRLTGADAIAVGSLKGKDPETRAKILGTVRLITEDYVEADPSKGIYFSQDWAGLPGIIPVVAGGLDVTDIPALVAEWGDDAILLFGXXXXAHPLGLRAGAAARRAALDAAVAARKAGVDLAKDGAAVLAAAAATNPELAAML

  1. Input this sequence into ESMFold and compare the predicted structure to your original.

The sequence generated by ProteinMPNN was then folded using ESMFold to predict its three-dimensional structure. The predicted structure was subsequently compared to the original structure from PDB 2YBV.

Despite the differences between the designed sequence and the native sequence, the predicted structure maintained a similar overall fold. This suggests that the designed sequence is structurally compatible with the original backbone. These results illustrate an important principle of protein design: multiple distinct amino acid sequences can adopt very similar three-dimensional structures when the structural constraints of the protein backbone are preserved.

Generating sequences…

>2YBV, score=1.5417, fixed_chains=[], designed_chains=[‘A’], model_name=v_48_020 TYYTPDYTPKDTDILAAFRVTPQPGVPFEEAAAAVAAESSTGXXXXXWTDLLTDLDRYKGCCYDIEPLPGEDNQFIAYIAYPLDLFEEGSVTNMLTSIVGNVFGFKALKALRLEDLRIPVAYLKTFQGPPHGIQVERDKLNKYGRPLLGCTIKPKLGLSAKNYGRAVYECLRGGLDFTKDDENINSQPFQRWRDRFLFVADAIHKAQAETGEIKGHYLNVTAPTCEEMLKRAEFAKELEMPIIMHDFLTAGFTANTTLSKWCRDNGMLLHIHRAMHAVMDRQKNHGIHFRVLAKCLRMSGGDHIHTGTVVGKLEGDKAVTLGFVDLLRENYIEQDRSRGIYFTQDWASMPGVMAVASGGIHVWHMPALVDIFGDDAVLQFGXXXXGHPWGNAPGATANRVALEACIQARNEGRDLMREGGDIIREAARWSPELAAAC

>T=0.1, sample=0, score=0.7618, seq_recovery=0.4720 KYYRPDYVPKPDDILVAFLVVPKPGVPPEEAAASVAAKSSVGXXXXXLLALLIDLEKYRAVCYEIRPLPGAENRFLAYVAYPLSRFKPGSVTDLFNTIFGNVFSHPDLEALRLLDIRIPPAFIATFPGPPLGIDAVREKLGIYGRPLLAFSIRPRLGLSAAERGERAYEILKGGADITKDPSSLTDQPFCPFADLAREVAAAIRRAEAETGRKKAHAVNISAPDLAAALARLDLCVALGLPMVKFNFLTDGFEAAAAVARRCRETGVILYADTTGISAVSSDPLRGIDKRVYYKLIRLTGADAIEVGTLKGKDPETQAKILGTIDLITKDRVEKDESKGIYFTQDWAGLPGIIPVVSGGLTVKDIPYLVDAYGDNAIIEFGXXXXAHPKGLAAGAKAYKTALDAAVAAKKAGVDLKKDGEKVLADAAKTNPELAAML

Part D. Group Brainstorm on Bacteriophage Engineering

  1. Find a group of ~3–4 students

  2. Read through the Phage Reading material listed under “Reading & Resources” below.

  3. Review the Bacteriophage Final Project Goals for engineering the L Protein:

  • Increased stability (easiest)
  • Higher titers (medium)
  • Higher toxicity of lysis protein (hard)
  1. Brainstorm Session
  • Choose one or two main goals from the list that you think you can address computationally (e.g., “We’ll try to stabilize the lysis protein,” or “We’ll attempt to disrupt its interaction with E. coli DnaJ.”).

  • Write a 1-page proposal (bullet points or short paragraphs) describing:

    • Which tools/approaches from recitation you propose using (e.g., “Use Protein Language Models to do in silico mutagenesis, then AlphaFold-Multimer to check complexes.”).

    • Why do you think those tools might help solve your chosen sub-problem?

    • Name one or two potential pitfalls (e.g., “We lack enough training data on phage–bacteria interactions.”).

    • Include a schematic of your pipeline.

  • This resource may be useful: HTGAA Protein Engineering Tools

  1. Each individually put your plan on your HTGAA website
  • Include your group’s short plan for engineering a bacteriophage

PROJECT OBJECTIVE

  • Engineer the L protein of the MS2 phage to increase structural stability.
  • Disrupt or reduce its interaction with the bacterial chaperone DnaJ.
  • Preserve the C-terminal lysis domain to maintain lytic function.
  • Avoid mutations that interfere with structurally or evolutionarily coupled residues.

Phase 1: Mapping the DnaJ Interaction Interface

Since the exact binding interface between the L protein and DnaJ is unknown, the first step is to identify it computationally rather than introducing arbitrary mutations.

  • Use AlphaFold-Multimer to model the complex between L protein and DnaJ.
  • Generate multiple structural predictions and select the top-ranked models.
  • Identify consensus interface residues that consistently appear in the predicted binding interface.
  • Perform in silico alanine scanning of the N-terminal residues in the complex to determine which residues significantly contribute to binding energy (ΔΔG).
  • Analyze whether the N-terminal region resembles known DnaJ-binding motifs, typically hydrophobic residues flanked by basic amino acids.

This phase defines which residues are critical for interaction and should not be mutated randomly.

Phase 2: Targeted N-Terminal Redesign

Instead of deleting regions or performing extensive random substitutions, introduce controlled chemical modifications to disrupt interaction while preserving structural stability.

  • Focus on charge inversion strategies:

    • Basic residues (K, R) → Acidic residues (E, D)
    • Acidic residues (E, D) → Basic residues (K, R)
  • Disrupt hydrophobic interaction patches:

    • Hydrophobic residues (L, I, V, F) → Polar residues (S, T, N, Q)
    • Aromatic residues (F, Y, W) → Aliphatic or small residues
  • Generate a graded library of variants:

    • Minor charge modifications
    • Moderate interface perturbations
    • Strong hydrophobic disruption

This creates a Pareto front of variants balancing reduced DnaJ interaction and preserved protein stability.

Phase 3: Stability and Functional Filtering

To ensure that redesigned variants remain structurally viable and functionally relevant:

  • Use Rosetta or FoldX to calculate ΔΔG and verify that mutations do not destabilize the overall protein fold.

  • Confirm that mutations in the N-terminal region do not propagate structural stress toward the C-terminal lysis domain.

  • Perform co-evolutionary analysis (e.g., EVcouplings):

    • Identify residue pairs that co-evolved between the N-terminal and C-terminal regions.
    • Avoid mutating co-evolved residues independently to prevent functional disruption.
  • Evaluate aggregation propensity using tools such as Aggrescan3D to ensure that mutations do not create exposed hydrophobic patches leading to cytoplasmic aggregation.

  • Assess sequence plausibility using protein language models such as ESM to filter out unlikely or non-natural variants.

Key Limitations:

  • The DnaJ binding mode may be transient or dynamic, reducing AlphaFold-Multimer accuracy.
  • Protein language model scores do not guarantee in vivo functionality.
  • Intrinsically disordered regions may not be accurately modeled.
  • Computational predictions must ultimately be validated experimentally.