ML-Based Protein Design
1. Protein Language Modeling
1a. Mutational Scans

The heat map was generated using the ESM-2 t6 8M UR50D model.
One can observe darker rows for tryptophan (W) and cysteine (C). These lower model scores (< -5) across all residues indicate that these two amino acids have a low probability of being used as substitutions in a mutational model without altering the spatial configuration of the protein. This may be due to the larger size of W and the unique chemical properties of C, which make them difficult to substitute into positions not specifically adapted for them.
Another observable pattern is the presence of darker columns around positions 197–208 and 338–344, indicating that these regions are highly conserved throughout evolution and that any mutations introduced there may lead to critical alterations in the structure and function of the luciferase. In fact, the positions 338–344 correspond to residues directly involved in the binding of the enzyme to the cofactor ATP and the substrate luciferin.
Reference: A View on the Active Site of Firefly Luciferase
1b. Latent Space Analysis

In this representation, each node corresponds to a protein, and the t-SNE axes represent a multidimensional matrix reduced to only three dimensions (t-SNE1, t-SNE2, and t-SNE3). Proteins positioned close together share similar sequence features and often exhibit related structural, functional, or evolutionary properties. They form clusters of proteins that belong to the same class or family, or that share similar structural folds.
Process: I first tried to identify a cluster of oxidoreductases (same class as luciferase) and the 4 luciferase proteins from the dataset by manually navigating the map using the t-SNE3 color coding, but this proved to be too time-consuming (see Documentation below).
Next step: Incorporate additional code in Colab to highlight the Firefly Luciferase and related proteins on the map.
2. Protein Folding
2a. Native protein
Luciferase structure determined experimentally

Image source: RCSB PDB 1LCI
Luciferase structure predicted ESMFold:

Result: The structure predicted by ESMFold looks similar to the one determined experimentally (RCSB PDB).
2b. Mutated proteins
Original Sequence
MEDAKNIKKGPAPFYPLEDGTAGEQLHKAMKRYALVPGTIAFTDGHIEVNITYAEYFEMSVRLAEAMKRYGLNTNHRIVVCSENSLQFFMPVLGALFIGVAVAPANDIYNERELLNSMNISQPTVVFVSKKGLQKILNVQKKLPIIQKIIIMDSKTDYQGFQSMYTFVTSHLPPGFNEYDFVPESFDRDKTIALIMNSSGSTGLPKGVALPHRTACVRFSHARDPIFGNQIIPDTAILSVVPFHHGFGMFTTLGYLICGFRVVLMYRFEEELFLRSLQDYKIQSALLVPTLFSFFAKSTLIDKYDLSNLHEIASGGAPLSKEVGEAVAKRFHLPGIRQGYGLTETTSAILITPEGDDKPGAVGKVVPFFEAKVVDLDTGKTLGVNQRGELCVRGPMIMSGYVNNPEATNALIDKDGWLHSGDLAYWDEDEHFFIVGRLKSLIKYKGYQVAPAELESILLQHPNIFDAGVAGLPDDDAGELPAAVVVLEHGKTMTEKEIVDYVASQVTTAKKLRGGVVFVDEVPKGLTGKRDARKIREILIKAKKGGKSKL
Total sequence length: 550 ptm: 0.910 plddt: 90.645
Confidence native Firefly Luciferase:

Mutation 01 : A45G
MEDAKNIKKGPAPFYPLEDGTAGEQLHKAMKRYALVPGTIAFTDGHIEVNITYAEYFEMSVRLAEAMKRYGLNTNHRIVVCSENSLQFFMPVLGALFIGVAVAPANDIYNERELLNSMNISQPTVVFVSKKGLQKILNVQKKLPIIQKIIIMDSKTDYQGFQSMYTFVTSHLPPGFNEYDFVPESFDRDKTIALIMNSSGSTGLPKGVALPHRTACVRFSHARDPIFGNQIIPDTAILSVVPFHHGFGMFTTLGYLICGFRVVLMYRFEEELFLRSLQDYKIQSALLVPTLFSFFAKSTLIDKYDLSNLHEIASGGAPLSKEVGEAVAKRFHLPGIRQGYGLTETTSAILITPEGDDKPGAVGKVVPFFEAKVVDLDTGKTLGVNQRGELCVRGPMIMSGYVNNPEATNALIDKDGWLHSGDLAYWDEDEHFFIVGRLKSLIKYKGYQVAPAELESILLQHPNIFDAGVAGLPDDDAGELPAAVVVLEHGKTMTEKEIVDYVASQVTTAKKLRGGVVFVDEVPKGLTGKRDARKIREILIKAKKGGKSKL
Total sequence length: 550 ptm: 0.910 plddt: 90.645

Confidence Mutation A45G:

Mutation 02 : H76D
MEDAKNIKKGPAPFYPLEDGTAGEQLHKAMKRYALVPGTIAFTDAHIEVNITYAEYFEMSVRLAEAMKRYGLNTNDRIVVCSENSLQFFMPVLGALFIGVAVAPANDIYNERELLNSMNISQPTVVFVSKKGLQKILNVQKKLPIIQKIIIMDSKTDYQGFQSMYTFVTSHLPPGFNEYDFVPESFDRDKTIALIMNSSGSTGLPKGVALPHRTACVRFSHARDPIFGNQIIPDTAILSVVPFHHGFGMFTTLGYLICGFRVVLMYRFEEELFLRSLQDYKIQSALLVPTLFSFFAKSTLIDKYDLSNLHEIASGGAPLSKEVGEAVAKRFHLPGIRQGYGLTETTSAILITPEGDDKPGAVGKVVPFFEAKVVDLDTGKTLGVNQRGELCVRGPMIMSGYVNNPEATNALIDKDGWLHSGDLAYWDEDEHFFIVGRLKSLIKYKGYQVAPAELESILLQHPNIFDAGVAGLPDDDAGELPAAVVVLEHGKTMTEKEIVDYVASQVTTAKKLRGGVVFVDEVPKGLTGKRDARKIREILIKAKKGGKSKL
Total sequence length: 550 ptm: 0.911 plddt: 90.796
Confidence Mutation H76D:

Mutation 03: Substitution 196-206 “MNSSGSTGLPK”>“WMHWPIGFCHK”
MEDAKNIKKGPAPFYPLEDGTAGEQLHKAMKRYALVPGTIAFTDGHIEVNITYAEYFEMSVRLAEAMKRYGLNTNHRIVVCSENSLQFFMPVLGALFIGVAVAPANDIYNERELLNSMNISQPTVVFVSKKGLQKILNVQKKLPIIQKIIIMDSKTDYQGFQSMYTFVTSHLPPGFNEYDFVPESFDRDKTIALIWMHWPIGFCHKGVALPHRTACVRFSHARDPIFGNQIIPDTAILSVVPFHHGFGMFTTLGYLICGFRVVLMYRFEEELFLRSLQDYKIQSALLVPTLFSFFAKSTLIDKYDLSNLHEIASGGAPLSKEVGEAVAKRFHLPGIRQGYGLTETTSAILITPEGDDKPGAVGKVVPFFEAKVVDLDTGKTLGVNQRGELCVRGPMIMSGYVNNPEATNALIDKDGWLHSGDLAYWDEDEHFFIVGRLKSLIKYKGYQVAPAELESILLQHPNIFDAGVAGLPDDDAGELPAAVVVLEHGKTMTEKEIVDYVASQVTTAKKLRGGVVFVDEVPKGLTGKRDARKIREILIKAKKGGKSKL
Total sequence length: 550 ptm: 0.945 plddt: 93.928

Confidence Mutation 03:

4. Documentation
4a. Reference Firefly Luciferase sequence

4b.Mutation Scans

4c. Latent Space Exploration


