ML-Based Protein Design

1. Protein Language Modeling

1a. Mutational Scans

The heat map was generated using the ESM-2 t6 8M UR50D model.

One can observe darker rows for tryptophan (W) and cysteine (C). These lower model scores (< -5) across all residues indicate that these two amino acids have a low probability of being used as substitutions in a mutational model without altering the spatial configuration of the protein. This may be due to the larger size of W and the unique chemical properties of C, which make them difficult to substitute into positions not specifically adapted for them.

Another observable pattern is the presence of darker columns around positions 197–208 and 338–344, indicating that these regions are highly conserved throughout evolution and that any mutations introduced there may lead to critical alterations in the structure and function of the luciferase. In fact, the positions 338–344 correspond to residues directly involved in the binding of the enzyme to the cofactor ATP and the substrate luciferin.

Reference: A View on the Active Site of Firefly Luciferase

1b. Latent Space Analysis

In this representation, each node corresponds to a protein, and the t-SNE axes represent a multidimensional matrix reduced to only three dimensions (t-SNE1, t-SNE2, and t-SNE3). Proteins positioned close together share similar sequence features and often exhibit related structural, functional, or evolutionary properties. They form clusters of proteins that belong to the same class or family, or that share similar structural folds.

Process: I first tried to identify a cluster of oxidoreductases (same class as luciferase) and the 4 luciferase proteins from the dataset by manually navigating the map using the t-SNE3 color coding, but this proved to be too time-consuming (see Documentation below).

Next step: Incorporate additional code in Colab to highlight the Firefly Luciferase and related proteins on the map.

2. Protein Folding

2a. Native protein

Luciferase structure determined experimentally

Image source: RCSB PDB 1LCI

Luciferase structure predicted ESMFold:

Result: The structure predicted by ESMFold looks similar to the one determined experimentally (RCSB PDB).

2b. Mutated proteins

Original Sequence

MEDAKNIKKGPAPFYPLEDGTAGEQLHKAMKRYALVPGTIAFTDGHIEVNITYAEYFEMSVRLAEAMKRYGLNTNHRIVVCSENSLQFFMPVLGALFIGVAVAPANDIYNERELLNSMNISQPTVVFVSKKGLQKILNVQKKLPIIQKIIIMDSKTDYQGFQSMYTFVTSHLPPGFNEYDFVPESFDRDKTIALIMNSSGSTGLPKGVALPHRTACVRFSHARDPIFGNQIIPDTAILSVVPFHHGFGMFTTLGYLICGFRVVLMYRFEEELFLRSLQDYKIQSALLVPTLFSFFAKSTLIDKYDLSNLHEIASGGAPLSKEVGEAVAKRFHLPGIRQGYGLTETTSAILITPEGDDKPGAVGKVVPFFEAKVVDLDTGKTLGVNQRGELCVRGPMIMSGYVNNPEATNALIDKDGWLHSGDLAYWDEDEHFFIVGRLKSLIKYKGYQVAPAELESILLQHPNIFDAGVAGLPDDDAGELPAAVVVLEHGKTMTEKEIVDYVASQVTTAKKLRGGVVFVDEVPKGLTGKRDARKIREILIKAKKGGKSKL

Total sequence length: 550 ptm: 0.910 plddt: 90.645

Confidence native Firefly Luciferase:


Mutation 01 : A45G

MEDAKNIKKGPAPFYPLEDGTAGEQLHKAMKRYALVPGTIAFTDGHIEVNITYAEYFEMSVRLAEAMKRYGLNTNHRIVVCSENSLQFFMPVLGALFIGVAVAPANDIYNERELLNSMNISQPTVVFVSKKGLQKILNVQKKLPIIQKIIIMDSKTDYQGFQSMYTFVTSHLPPGFNEYDFVPESFDRDKTIALIMNSSGSTGLPKGVALPHRTACVRFSHARDPIFGNQIIPDTAILSVVPFHHGFGMFTTLGYLICGFRVVLMYRFEEELFLRSLQDYKIQSALLVPTLFSFFAKSTLIDKYDLSNLHEIASGGAPLSKEVGEAVAKRFHLPGIRQGYGLTETTSAILITPEGDDKPGAVGKVVPFFEAKVVDLDTGKTLGVNQRGELCVRGPMIMSGYVNNPEATNALIDKDGWLHSGDLAYWDEDEHFFIVGRLKSLIKYKGYQVAPAELESILLQHPNIFDAGVAGLPDDDAGELPAAVVVLEHGKTMTEKEIVDYVASQVTTAKKLRGGVVFVDEVPKGLTGKRDARKIREILIKAKKGGKSKL

Total sequence length: 550 ptm: 0.910 plddt: 90.645

Confidence Mutation A45G:


Mutation 02 : H76D

MEDAKNIKKGPAPFYPLEDGTAGEQLHKAMKRYALVPGTIAFTDAHIEVNITYAEYFEMSVRLAEAMKRYGLNTNDRIVVCSENSLQFFMPVLGALFIGVAVAPANDIYNERELLNSMNISQPTVVFVSKKGLQKILNVQKKLPIIQKIIIMDSKTDYQGFQSMYTFVTSHLPPGFNEYDFVPESFDRDKTIALIMNSSGSTGLPKGVALPHRTACVRFSHARDPIFGNQIIPDTAILSVVPFHHGFGMFTTLGYLICGFRVVLMYRFEEELFLRSLQDYKIQSALLVPTLFSFFAKSTLIDKYDLSNLHEIASGGAPLSKEVGEAVAKRFHLPGIRQGYGLTETTSAILITPEGDDKPGAVGKVVPFFEAKVVDLDTGKTLGVNQRGELCVRGPMIMSGYVNNPEATNALIDKDGWLHSGDLAYWDEDEHFFIVGRLKSLIKYKGYQVAPAELESILLQHPNIFDAGVAGLPDDDAGELPAAVVVLEHGKTMTEKEIVDYVASQVTTAKKLRGGVVFVDEVPKGLTGKRDARKIREILIKAKKGGKSKL

Total sequence length: 550 ptm: 0.911 plddt: 90.796

Confidence Mutation H76D:


Mutation 03: Substitution 196-206 “MNSSGSTGLPK”>“WMHWPIGFCHK”

MEDAKNIKKGPAPFYPLEDGTAGEQLHKAMKRYALVPGTIAFTDGHIEVNITYAEYFEMSVRLAEAMKRYGLNTNHRIVVCSENSLQFFMPVLGALFIGVAVAPANDIYNERELLNSMNISQPTVVFVSKKGLQKILNVQKKLPIIQKIIIMDSKTDYQGFQSMYTFVTSHLPPGFNEYDFVPESFDRDKTIALIWMHWPIGFCHKGVALPHRTACVRFSHARDPIFGNQIIPDTAILSVVPFHHGFGMFTTLGYLICGFRVVLMYRFEEELFLRSLQDYKIQSALLVPTLFSFFAKSTLIDKYDLSNLHEIASGGAPLSKEVGEAVAKRFHLPGIRQGYGLTETTSAILITPEGDDKPGAVGKVVPFFEAKVVDLDTGKTLGVNQRGELCVRGPMIMSGYVNNPEATNALIDKDGWLHSGDLAYWDEDEHFFIVGRLKSLIKYKGYQVAPAELESILLQHPNIFDAGVAGLPDDDAGELPAAVVVLEHGKTMTEKEIVDYVASQVTTAKKLRGGVVFVDEVPKGLTGKRDARKIREILIKAKKGGKSKL

Total sequence length: 550 ptm: 0.945 plddt: 93.928

Confidence Mutation 03:



4. Documentation

4a. Reference Firefly Luciferase sequence

4b.Mutation Scans

4c. Latent Space Exploration