Week 5 HW: Protein Design II

Part 1 1. Retrieve SOD1 sequence and introduce A4V

MATKVCVVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLACGVIGIAQ

2. PepMLM-generated 12-aa peptides + known binder

(lower = higher model confidence in the binder).

ID	Peptide (12 aa)	Source	Perplexity (PepMLM)
P1	WHSPVAAARLKE	PepMLM	11.721713
P2	KRYGAAAARHKK	PepMLM	11.211369
P3	WRYPVAGLALKE	PepMLM	13.068802
P4	WHSPPAAVALGE	PepMLM	12.159801
Ref	FLYRWLPSRRGG	known SOD1 binder	N/A

Observation (PepMLM confidence by perplexity):
Among generated candidates, P2 (KRYGAAAARHKK) has the lowest perplexity (highest PepMLM confidence), while P3 (WRYPVAGLALKE) has the highest perplexity (lowest confidence among the four).

Part 2 – Evaluate binders with AlphaFold3

Peptide	Sequence	Binding description (1–2 sentences)	Near N-terminus / A4V?
P1	WHSPVAAARLKE	surface-bound	no
P2	KRYGAAAARHKK	surface-bound	no
P3	WRYPVAGLALKE	surface-bound	no
P4	WHSPPAAVALGE	surface-bound	yes
Ref	FLYRWLPSRRGG	Surface-associated	no

Part 3 – Evaluate Properties of Generated Peptides in the PeptiVerse

Peptide	Solubility	Solubility Score	Hemolysis	Hemolysis Score	Length (aa)	Molecular Weight (Da)	Net Charge (pH 7)	Isoelectric Point (pH)	Hydrophobicity (GRAVY)
WHSPVAAARLKE	Soluble	1.000	Non-hemolytic	0.013	12	1364.6	0.85	8.76	-0.42
KRYGAAAARHKK	Soluble	1.000	Non-hemolytic	0.009	12	1356.6	4.84	11.17	-1.53
WRYPVAGLALKE	Soluble	1.000	Non-hemolytic	0.025	12	1402.6	0.77	8.59	-0.06
WHSPPAAVALGE	Soluble	1.000	Non-hemolytic	0.041	12	1234.4	-1.14	5.47	0.12

From the AlphaFold3 models, all of the peptides showed relatively low and fairly similar ipTM values, and structurally they all appeared to bind in a shallow, surface associated manner rather than forming a clearly buried or highly specific interface on mutant SOD1. This means that higher ipTM did not translate into an obviously much stronger or more convincing binding pose in the structural models. When compared with the PeptiVerse predictions, the four generated peptides were all predicted to be fully soluble and non hemolytic, so none of the better candidates raised an immediate concern for poor solubility or red blood cell toxicity. Among them, KRYGAAAARHKK provides the best overall balance. It had the lowest PepMLM perplexity, indicating the highest generation confidence, and it also showed excellent therapeutic property predictions, including full solubility, very low hemolysis risk, and a strongly positive net charge that could support interaction with exposed regions of SOD1.

I would advance KRYGAAAARHKK. Although its AlphaFold3 model did not show a dramatically stronger binding geometry than the others, it combines the best overall computational profile across generation confidence, structural plausibility, solubility, and safety related properties. In other words, it is the most balanced candidate for further optimization and experimental validation.

Part 4 – Generate Optimized Peptides with moPPIt

Binder	Hemolysis	Solubility	Motif
CTWVKKTKKQVT	0.979838	0.833333	0.809806
GYKQKTCNTVKW	0.96261	0.833333	0.782062
STAEFTRQTKKM	0.954297	0.75	0.839537
RGKTTTQNGKVI	0.978479	0.833333	0.820856
GFGTQKKTKCG	0.964351	0.916667	0.849918
RTDQGGVKITLE	0.969846	0.833333	0.905129
ETKKRQKFKTDF	0.974053	0.833333	0.887031
KGETTDKIQKTM	0.975992	0.833333	0.841838
RETVGKKTQTKC	0.981714	0.916667	0.81274

The moPPIt peptides differ noticeably from the PepMLM peptides in both composition and design emphasis. The PepMLM peptides I generated for SOD1 were short 12 amino acid candidates with relatively simple compositions, and several were enriched in alanine, valine, lysine, and histidine, which made them look like general target conditioned binders. In contrast, the moPPIt peptides appear more motif driven and more strongly enriched in lysine and threonine, with some cysteine, glycine, phenylalanine, and aspartate included in specific patterns. This makes them look more constrained by the optimization objectives, especially motif retention and therapeutic property balancing, rather than just sequence likelihood conditioned on the target. In other words, PepMLM gave plausible binder candidates, while moPPIt seems to generate peptides that are more explicitly shaped by multiple design goals.

Before advancing any of these peptides toward clinical studies, I would evaluate them in several stages. First, I would confirm binding experimentally using assays such as surface plasmon resonance, biolayer interferometry, or microscale thermophoresis to measure affinity and specificity for mutant SOD1 compared with wild type SOD1 and unrelated proteins. Second, I would test whether the peptides actually improve a disease relevant phenotype, for example by reducing SOD1 aggregation, toxicity, or misfolding in cell based ALS models. Third, I would assess safety and developability, including hemolysis, cytotoxicity, serum stability, protease resistance, immunogenicity risk, and off target effects. Fourth, I would evaluate pharmacology, including uptake, biodistribution, half life, and whether the peptide can reach the relevant tissues, especially the central nervous system. Only peptides that show convincing binding, functional benefit, acceptable safety, and realistic delivery potential should move forward into animal studies and later clinical development.

L-Protein Engineering | Option 3: Random Mutagenesis

Goal

In this option, I generated random multi-site L-protein mutants using mutation information from prior mutational analysis experiments, then selected candidates for co-folding with DnaJ using AF2 Multimer. The goal is to explore combinations of tolerated or favorable mutations rather than relying only on single-site manual design. The L protein contains a soluble N-terminal domain followed by a transmembrane region in the last 35 residues, so both structural context and prior mutation evidence were considered when building the mutation pool. :contentReference[oaicite:1]{index=1}

Strategy

Instead of introducing fully random amino acid substitutions, I used a constrained random mutagenesis strategy. I first built a mutation pool from mutations that were supported by prior mutational analysis experiments or appeared tolerated or favorable in the provided mutation data. Then I randomly sampled combinations containing at least 2 mutated positions. This makes the search more biologically grounded than unconstrained random mutagenesis.

A good mutant should not simply contain many mutations. I define an effective mutant as one that satisfies several criteria:

It preserves or improves the foldability and structural plausibility of the L protein.
It does not obviously disrupt the soluble N-terminal region that is implicated in DnaJ interaction.
It avoids strongly unfavorable substitutions in conserved or functionally critical positions.
It is compatible with productive interaction behavior when co-folded with DnaJ.
Ideally, it improves properties related to lysis efficiency, DnaJ independence, or expression, which are the broader goals of this assignment. :contentReference[oaicite:2]{index=2}

Python function for constrained random mutagenesis

import random
from typing import Dict, List, Tuple

L_PROTEIN_WT = "METRFPQQSQQTPASTNRRRPFKHEDYPCRRQQRSSTLYVLIFLAIFLSKFTNQLLLSLLEAVIRTVTTLQQLLT"

# Example mutation pool format:
# key = 1-indexed residue position
# value = (wildtype_residue, [allowed_mutant_residues])
#
# You should replace these example entries with positions supported by your
# mutational analysis sheet or notebook results.

mutation_pool: Dict[int, Tuple[str, List[str]]] = {
    8:  ("Q", ["K", "R"]),
    14: ("A", ["S", "T"]),
    24: ("E", ["D", "Q"]),
    31: ("R", ["K"]),
    35: ("S", ["T", "A"]),
    41: ("L", ["I", "V"]),
    44: ("A", ["V"]),
    58: ("L", ["I", "V"]),
    63: ("A", ["V", "L"]),
    69: ("V", ["I", "L"]),
}

def apply_mutations(sequence: str, mutations: List[Tuple[int, str]]) -> str:
    seq_list = list(sequence)
    for pos, new_res in mutations:
        seq_list[pos - 1] = new_res
    return "".join(seq_list)

def format_mutation_name(sequence: str, mutations: List[Tuple[int, str]]) -> str:
    names = []
    for pos, new_res in mutations:
        wt = sequence[pos - 1]
        names.append(f"{wt}{pos}{new_res}")
    return ",".join(names)

def generate_random_mutant(
    sequence: str,
    pool: Dict[int, Tuple[str, List[str]]],
    min_sites: int = 2,
    max_sites: int = 4
):
    n_sites = random.randint(min_sites, max_sites)
    chosen_positions = random.sample(list(pool.keys()), n_sites)

    mutations = []
    for pos in chosen_positions:
        wt_expected, allowed = pool[pos]
        wt_actual = sequence[pos - 1]
        if wt_actual != wt_expected:
            raise ValueError(f"WT mismatch at position {pos}: expected {wt_expected}, found {wt_actual}")
        new_res = random.choice(allowed)
        mutations.append((pos, new_res))

    mutations = sorted(mutations, key=lambda x: x[0])
    mutant_seq = apply_mutations(sequence, mutations)
    mutant_name = format_mutation_name(sequence, mutations)

    return {
        "mutations": mutant_name,
        "sequence": mutant_seq,
        "num_mutations": len(mutations)
    }

def generate_unique_mutants(
    sequence: str,
    pool: Dict[int, Tuple[str, List[str]]],
    n_mutants: int = 10,
    min_sites: int = 2,
    max_sites: int = 4
):
    seen = set()
    results = []

    while len(results) < n_mutants:
        mutant = generate_random_mutant(sequence, pool, min_sites, max_sites)
        key = mutant["mutations"]
        if key not in seen:
            seen.add(key)
            results.append(mutant)

    return results

# Example usage
random.seed(42)
mutants = generate_unique_mutants(
    sequence=L_PROTEIN_WT,
    pool=mutation_pool,
    n_mutants=10,
    min_sites=2,
    max_sites=4
)

for i, m in enumerate(mutants, 1):
    print(f"Mutant {i}")
    print("Mutations:", m["mutations"])
    print("Sequence :", m["sequence"])
    print("Count    :", m["num_mutations"])
    print()

How I would use this function

I would first replace the example mutation_pool with residue positions and substitutions supported by the provided L-protein mutational analysis data. Then I would generate a small library of random double, triple, or quadruple mutants. From that library, I would select a few candidates that look structurally reasonable and co-fold them with DnaJ using AF2 Multimer.

Co-folding setup with DnaJ

For each selected mutant, I would submit the mutant L-protein sequence together with the DnaJ sequence as separate chains in AF2 Multimer. The purpose is to compare whether different mutation combinations change the predicted interaction pattern between the L-protein soluble region and DnaJ. The assignment notes that this kind of prediction may be difficult, especially for membrane related systems, so these structures should be interpreted cautiously.

How I define a “good” mutant

I define a good or effective mutant as one that balances multiple factors rather than optimizing only one metric. First, it should remain structurally plausible as an L-protein variant and avoid obviously destabilizing substitutions. Second, it should preserve or improve a meaningful interaction pattern with DnaJ in the soluble region if the biological mechanism still depends on that interaction. Third, if the long-term goal is DnaJ independence, then a good mutant could also be one that remains well folded without requiring the same interaction geometry. Finally, the mutant should be supported by prior mutational evidence and should not rely on highly disruptive changes at conserved positions. In practice, the best mutant is the one that shows a reasonable structural model, uses biologically tolerated substitutions, and best aligns with the assignment goals of improved folding, lysis efficiency, or reduced dependence on bacterial chaperones.