Week 4 HW: Protein Design part I
PART A — Conceptual Questions
- How many molecules of amino acids are in 500 g of meat?
If we assume 500 g of protein: 500g÷100g/mol = 5 mol
Total of molecules: 5×6.022×10²³
≈ 3 × 10²⁴ amino acid molecules
- Why do humans eat beef but do not become a cow, and eat fish but do not become fish?
Proteins from food are not absorbed as intact proteins. Instead, they are broken down during digestion:
Dietary proteins are digested into individual amino acids.
These amino acids enter the bloodstream.
Human cells use them as building blocks to synthesize human proteins according to human DNA instructions.
Why are there only 20 natural amino acids?
The genetic code evolved to encode 20 canonical amino acids through the translation machinery involving:
ribosomes
tRNA molecules
aminoacyl-tRNA synthetases
These 20 amino acids provide sufficient chemical diversity to build complex protein structures while maintaining translation accuracy and efficiency.
- Can we make non-natural amino acids?
Yes. Scientists can design and synthesize non-natural amino acids.
Examples include:
Fluorinated amino acids –> These improve protein stability.
- Where did amino acids come from before enzymes and life existed?
Several hypotheses explain the origin of amino acids:
Prebiotic chemistry
The Miller–Urey experiment demonstrated that amino acids can form from simple gases under conditions resembling early Earth.
Meteorites
Some meteorites contain organic molecules including amino acids.
- If an α-helix were made from D-amino acids, what handedness would it have?
Natural proteins consist of L-amino acids, which form right-handed α-helices.
If the helix were made entirely from D-amino acids, the stereochemistry would reverse, producing a left-handed α-helix.
- Why are most molecular helices right-handed?
Most biological helices are right-handed because proteins are composed almost exclusively of L-amino acids.
- Why do β-sheets tend to aggregate?
β-sheets have a flat structure and strong backbone hydrogen bonding.
Because of this:
Multiple β-sheets can stack together
Intermolecular hydrogen bonds stabilize aggregates
- Why do many amyloid diseases form β-sheets?
In diseases such as Alzheimer’s disease, proteins misfold into structures dominated by β-sheet stacking.
These structures form amyloid fibrils, which are highly stable and resistant to degradation, leading to toxic protein aggregates.
PART B — Protein Analysis and Visualization
Selected Protein
The selected protein is Green Fluorescent Protein (GFP) from the jellyfish Aequorea victoria.
This protein was chosen because:
It is widely used as a fluorescent reporter in molecular biology
Its structure is well characterized
It is commonly used in protein engineering studies
Amino Acid Sequence
GFP consists of 238 amino acids.
Frequently occurring residues include:
Glycine
Leucine
Serine
Glycine is common because it contributes flexibility to protein structures.
Bioinformatic Analysis of MS2 Phage Lysis Protein
Sequence retrieval ↓ BLAST homolog search ↓ Multiple sequence alignment (Clustal Omega) ↓ Structure prediction (ESMFold) ↓ Functional site prediction (PeSTo) ↓ Structure similarity search (FoldSeek) ↓ Sequence optimization (ProteinMPNN)
- Sequence Retrieval
Objective
To obtain the amino acid sequence of the MS2 bacteriophage lysis protein that will serve as the reference sequence for downstream bioinformatic analyses.
Input
Protein sequence database entry of the MS2 phage lysis protein (FASTA format).
Output
The primary amino acid sequence of the MS2 lysis protein in FASTA format, which will be used as the query sequence for further analyses.
- Homology Search
Tool used: BLAST
Objective To identify homologous proteins from other bacteriophages or organisms in order to understand evolutionary relationships and detect conserved regions within the lysis protein family.
Input
Query amino acid sequence of the MS2 lysis protein (FASTA format).
Output
A list of homologous protein sequences with similarity scores, E-values, and alignment statistics.
Selected homologous sequences for further comparative analysis.
- Multiple Sequence Alignment
Tool used: Clustal Omega
Objective To compare multiple homologous protein sequences and identify conserved residues that may be functionally or structurally important for the lysis protein.
Input
A set of homologous protein sequences obtained from the BLAST search.
Output
Multiple sequence alignment showing conserved and variable residues.
Identification of highly conserved amino acid positions that may represent functional hotspots.
- Protein Structure Prediction
Tool used: ESMFold
Objective To predict the three-dimensional structure of the MS2 lysis protein, which provides structural context for understanding protein function and interaction sites.
Input
Amino acid sequence of the MS2 lysis protein in FASTA format.
Output
Predicted 3D protein structure in PDB format.
Confidence metrics indicating the reliability of the predicted structural regions.
- Functional Site Prediction
Tool used: PeSTo
Objective To identify residues within the protein structure that are likely involved in molecular interactions, such as membrane association or protein-protein interactions.
Input
Predicted 3D structure of the lysis protein in PDB format.
Output
Predicted interaction residues or functional sites within the protein structure.
Structural regions potentially involved in the lysis mechanism.
- Structural Similarity Search
Tool used: FoldSeek
Objective To identify proteins with similar three-dimensional structures, even if their sequences are not closely related, which may provide insights into structural conservation and potential functional analogs.
Input
Predicted protein structure (PDB format).
Output
A list of proteins with structurally similar folds.
Structural alignment statistics and similarity scores.
- Sequence Optimization
Tool used: ProteinMPNN
Objective To generate alternative amino acid sequences that are predicted to fold into the same backbone structure, potentially improving protein stability or functional properties.
Input
Protein backbone structure in PDB format.
Output
Optimized or alternative protein sequences predicted to maintain the same structural fold.
Candidate sequences for potential protein engineering or stability improvement.