Week 4 HW: Protein design part 1

Part A. Conceptual Questions
1.Why do humans eat beef but do not become a cow, eat fish but do not become fish?
Because the human body completely breaks down exogenous proteins into free amino acids. And the DNA re-directs these amino acids to be arranged in the human sequence, rather than the sequence of cows or fish.
2.Why are there only 20 natural amino acids?
The 20 amino acids can already fulfill the basic chemical requirements such as hydrophobicity, hydrophilicity and electronegativity. Adding more amino acid types would require corresponding tRNAs and synthetases, and the evolutionary cost would be too high.
3.Can you make other non-natural amino acids? Design some new amino acids.
Possible: By expanding the genetic code, scientists have synthesized over 100 types of non-natural amino acids. For example, photosensitive amino acids: The side chains carry light-sensitive groups, allowing the proteins to be controlled by ultraviolet light.
4.Where did amino acids come from before enzymes that make them, and before life started?
Miller-Urey experiment: In the original atmospheric environment (lightning + water vapor + methane + ammonia), simple amino acids (such as glycine and alanine) can be spontaneously synthesized.
5.If you make an α-helix using D-amino acids, what handedness (right or left) would you expect?
Left-handed
6.Can you discover additional helices in proteins?
π helix: wider and flatter, often found near functional sites.
7.Why are most molecular helices right-handed?
The steric hindrance of the side chains of L-amino acids determines that they have lower energy and are more stable when forming right-handed helices.
8.Why do β-sheets tend to aggregate? What is the driving force for β-sheet aggregation?
Open hydrogen bond network: The peptide bonds (backbone) at the edges of β-sheets have unpaired hydrogen bond donors (-NH) and acceptors (-C=O). The main driving force: Inter-strand Hydrogen Bonding.
Physical mechanism: Since the hydrogen bonds at the edge do not get completely counteracted as in the α-helix, they act like “nylon snaps”, constantly attracting and capturing adjacent peptide chains to lower the system energy.
9.Why do many amyloid diseases form β-sheets? Can you use amyloid β-sheets as materials?
Causes of the disease: Extremely high thermodynamic stability: The β-sheet can be infinitely extended through the “cross-β” structure, forming stable fibers with extremely low energy. Error folding cascade: A misfolded β-template will induce the normal protein to also transform into β-folding, resulting in a domino-like aggregation effect, which can lead to diseases such as Alzheimer’s or Parkinson’s.
As for material application: Feasibility: Absolutely possible. Amyloid β-fibers possess astonishing mechanical strength (comparable to steel wires or silk) and biocompatibility. Usage: It has currently been applied in the design of biological nanoscaffolds, conductive nanowires, and controlled-release drug carriers.
Part B: Protein Analysis and Visualization
Protein:Green fluorescent protein
Reason:
GFP is an ideal model for studying protein “folding resilience”. Its compact “cylindrical” structure ensures that the overall structure remains stable even when there are significant mutations on the surface. As a researcher focusing on species design and life information, GFP is not only a fluorescence labeling tool but also a symbol of information visualization in synthetic biology. It demonstrates how the genetic code can directly be transformed into visible colors for the naked eye.
Sequence:
MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK
Family:
Using the UniProt BLAST tool, I identified 205 sequence homologs for the GFP sequence (P42212). The protein is a member of the GFP family (Pfam: PF01353), characterized by its conserved structural and functional domains.

Identify the structure page of your protein in RCSB:
It was published by J AM Chem at 2006,really good quality structure only 1.2 Å
![]()
- Structure Quality: This structure was solved on April 18, 2006, using X-ray diffraction. It is of exceptional quality with a resolution of 1.20 , which is significantly better than the 2.70
- Ligands and Molecules: Apart from the protein chain, the solved structure contains a Magnesium ion (MG) as a unique ligand, as well as several water molecules.
- Classification and Family: The protein belongs to the GFP family (Pfam: PF01353) and is classified under the GFP-like structural family. It features a distinct “light-can” fold that protects the internal chromophore.
3D molecule visualization:
Visualize the protein as “cartoon”, “ribbon” and “ball and stick”.

These structures have been particularly helpful in allowing me to understand the different structural perspectives of proteins.
Visualizing the protein in different modes allows for a comprehensive understanding of its multi-level organization. The cartoon mode highlights the global architecture (the β-barrel), the ribbon mode tracks the path of the polypeptide backbone, and the ball-and-stick model reveals the intricate atomic interactions and the precise positioning of side chains within the internal chromophore environment.
Secondary Structure:

- Observation result: In the rendered image of PyMOL (such as Secondary Structure.jpg), the protein is colored as purple β-sheet and light green α-helix.
- Structural comparison: This protein clearly possesses a significantly greater number of folding sheets.
- Detailed analysis: As a classic β-barrel structure, it is formed by 11 antiparallel β-sheet chains, enclosing a closed cylindrical shell. The α-helices are only present in the center of the barrel and in the short loops connecting the folding segments.
Hole:

The surface of GFP is highly compact with no functional pores or binding pockets. The cyan-colored indentations are simply the result of the 11 β-strands packing tightly together. This ‘molecular cage’ creates a shielded, hydrophobic core that protects the chromophore from external quenchers, ensuring stable fluorescence。
Part C. Using ML-Based Protein Design Tools
C1. Protein Language Modeling

One noticeable pattern in the heatmap is that mutations to cysteine (C) (W) across many positions show very low scores (dark purple). This suggests that introducing cysteine residues is generally unfavorable for this protein. A possible explanation is that cysteine can form disulfide bonds, which may disrupt the existing protein structure if introduced at inappropriate positions.
I was not able to clearly locate my specific sequence in the map, the overall distribution indicates that the sequences occupy a shared region of latent space, consistent with proteins that have related sequence characteristics.
C2. Protein Folding


C3. Protein Generation


