Week 4 HW: Protein Design part I

PART A — Conceptual Questions

How many molecules of amino acids are in 500 g of meat?

If we assume 500 g of protein: 500g÷100g/mol = 5 mol

Total of molecules: 5×6.022×10²³

≈ 3 × 10²⁴ amino acid molecules

Why do humans eat beef but do not become a cow, and eat fish but do not become fish?

Proteins from food are not absorbed as intact proteins. Instead, they are broken down during digestion:

Dietary proteins are digested into individual amino acids.
These amino acids enter the bloodstream.
Human cells use them as building blocks to synthesize human proteins according to human DNA instructions.
Why are there only 20 natural amino acids?

The genetic code evolved to encode 20 canonical amino acids through the translation machinery involving:

ribosomes
tRNA molecules
aminoacyl-tRNA synthetases

These 20 amino acids provide sufficient chemical diversity to build complex protein structures while maintaining translation accuracy and efficiency.

Can we make non-natural amino acids?

Yes. Scientists can design and synthesize non-natural amino acids.

Examples include:

Fluorinated amino acids –> These improve protein stability.

Where did amino acids come from before enzymes and life existed?

Several hypotheses explain the origin of amino acids:

Prebiotic chemistry

The Miller–Urey experiment demonstrated that amino acids can form from simple gases under conditions resembling early Earth.

Meteorites

Some meteorites contain organic molecules including amino acids.

If an α-helix were made from D-amino acids, what handedness would it have?

Natural proteins consist of L-amino acids, which form right-handed α-helices.

If the helix were made entirely from D-amino acids, the stereochemistry would reverse, producing a left-handed α-helix.

Why are most molecular helices right-handed?

Most biological helices are right-handed because proteins are composed almost exclusively of L-amino acids.

Why do β-sheets tend to aggregate?

β-sheets have a flat structure and strong backbone hydrogen bonding.

Because of this:

Multiple β-sheets can stack together
Intermolecular hydrogen bonds stabilize aggregates

Why do many amyloid diseases form β-sheets?

In diseases such as Alzheimer’s disease, proteins misfold into structures dominated by β-sheet stacking.

These structures form amyloid fibrils, which are highly stable and resistant to degradation, leading to toxic protein aggregates.

PART B — Protein Analysis and Visualization

Selected Protein

The selected protein is Green Fluorescent Protein (GFP) from the jellyfish Aequorea victoria.

This protein was chosen because:

It is widely used as a fluorescent reporter in molecular biology
Its structure is well characterized
It is commonly used in protein engineering studies

Amino Acid Sequence

GFP consists of 238 amino acids.

Frequently occurring residues include:

Glycine
Leucine
Serine

Glycine is common because it contributes flexibility to protein structures.

Bioinformatic Analysis of MS2 Phage Lysis Protein

Sequence retrieval ↓ BLAST homolog search ↓ Multiple sequence alignment (Clustal Omega) ↓ Structure prediction (ESMFold) ↓ Functional site prediction (PeSTo) ↓ Structure similarity search (FoldSeek) ↓ Sequence optimization (ProteinMPNN)

Sequence Retrieval

Objective

To obtain the amino acid sequence of the MS2 bacteriophage lysis protein that will serve as the reference sequence for downstream bioinformatic analyses.

Input

Protein sequence database entry of the MS2 phage lysis protein (FASTA format).

Output

The primary amino acid sequence of the MS2 lysis protein in FASTA format, which will be used as the query sequence for further analyses.

Homology Search

Tool used: BLAST

Objective To identify homologous proteins from other bacteriophages or organisms in order to understand evolutionary relationships and detect conserved regions within the lysis protein family.

Input

Query amino acid sequence of the MS2 lysis protein (FASTA format).

Output

A list of homologous protein sequences with similarity scores, E-values, and alignment statistics.

Selected homologous sequences for further comparative analysis.

Multiple Sequence Alignment

Tool used: Clustal Omega

Objective To compare multiple homologous protein sequences and identify conserved residues that may be functionally or structurally important for the lysis protein.

Input

A set of homologous protein sequences obtained from the BLAST search.

Output

Multiple sequence alignment showing conserved and variable residues.

Identification of highly conserved amino acid positions that may represent functional hotspots.

Protein Structure Prediction

Tool used: ESMFold

Objective To predict the three-dimensional structure of the MS2 lysis protein, which provides structural context for understanding protein function and interaction sites.

Input

Amino acid sequence of the MS2 lysis protein in FASTA format.

Output

Predicted 3D protein structure in PDB format.

Confidence metrics indicating the reliability of the predicted structural regions.

Functional Site Prediction

Tool used: PeSTo

Objective To identify residues within the protein structure that are likely involved in molecular interactions, such as membrane association or protein-protein interactions.

Input

Predicted 3D structure of the lysis protein in PDB format.

Output

Predicted interaction residues or functional sites within the protein structure.

Structural regions potentially involved in the lysis mechanism.

Structural Similarity Search

Tool used: FoldSeek

Objective To identify proteins with similar three-dimensional structures, even if their sequences are not closely related, which may provide insights into structural conservation and potential functional analogs.

Input

Predicted protein structure (PDB format).

Output

A list of proteins with structurally similar folds.

Structural alignment statistics and similarity scores.

Sequence Optimization

Tool used: ProteinMPNN

Objective To generate alternative amino acid sequences that are predicted to fold into the same backbone structure, potentially improving protein stability or functional properties.

Input

Protein backbone structure in PDB format.

Output

Optimized or alternative protein sequences predicted to maintain the same structural fold.

Candidate sequences for potential protein engineering or stability improvement.