Week 10 — Advanced Imaging & Measurement Technology

Final Project

For this project, several elements will be measured across the experimental, computational, and synthetic biology stages in order to evaluate the performance of the proposed platform. Because the project is structured as a pipeline, the measurable outputs include nucleic acid quality, sequence-derived features, predicted protein properties, and candidate prioritization metrics.

1. Metagenomic DNA quality and quantity

The first elements to be measured are the concentration, purity, and integrity of extracted metagenomic DNA obtained from Andean environmental samples. These measurements are essential to ensure that the genetic material is suitable for sequencing and downstream bioinformatic analysis.

Quantity and purity will be measured using spectrophotometry or fluorometry, depending on instrument availability.
Integrity will be evaluated by agarose gel electrophoresis, which allows visualization of DNA fragmentation or degradation.

That way it is ensured only high-quality samples move forward into sequencing workflows.

2. Metagenomic sequence data and predicted ORFs

A central output of this project is the set of nucleotide sequences and protein-coding open reading frames (ORFs) recovered from the metagenomic datasets. At this stage, what will be measured is not a physical biomarker, but rather the presence, number, and characteristics of predicted coding sequences.

DNA sequencing will be the main technology used to generate raw sequence data, either from real samples or curated public datasets.

After sequencing, bioinformatic preprocessing will measure:

Number of reads,
Read quality,
Assembly statistics such as contig length and coverage,
Number of predicted ORFs.

3. Protein sequence features and functional predictions

Once protein sequences are predicted, the project will measure sequence derived features associated with antimicrobial potential. These include properties such as sequence length, amino acid composition, charge, hydrophobicity, and similarity or divergence relative to known proteins. These measurements will be performed computationally using:

Protein language models such as ESM or ProtBERT for sequence embeddings,
Machine learning classification tools to estimate antimicrobial potential,
Clustering or dimensionality reduction methods such as PCA or UMAP to detect novelty in latent space.

4. Structural properties of selected protein candidates

For prioritized candidates, the project will measure predicted structural stability and the presence of functional motifs relevant to antimicrobial activity. These measurements will be obtained through:

Computational protein structure prediction, such as AlphaFold,
And structural inspection tools for identifying motifs, folds, or possible interaction surfaces.

The measurable outputs may include:

Predicted three dimensional structure,
Confidence metrics from structural models,
And inferred features related to protein stability or function.

Technologies to be used

The main technologies used in this project will include:

DNA extraction protocols for environmental samples
Spectrophotometry or fluorometry for DNA quantification and purity assessment
Agarose gel electrophoresis for evaluating DNA integrity
DNA sequencing for generating metagenomic datasets
Bioinformatic assembly and ORF prediction tools for recovering coding sequences
Protein language models and machine learning tools for functional prediction
Dimensionality reduction and clustering methods for novelty detection
Protein structure prediction tools such as AlphaFold for evaluating candidate proteins

Waters Part I — Molecular Weight

Based on the predicted amino acid sequence of eGFP and any known modifications, what is the calculated molecular weight?

28006.60 Da

Calculate the molecular weight of the eGFP using the adjacent charge state approach described in the recitation

Selected Values

m1 = 875.4421
m2 = 903.7148

Determination of charge state z

The charge state is calculated using: z = (m2 - H) / (m2 - m1)

where:

H = 1.0073 Da (mass of a proton)

Substituting values:

z = (903.7148 - 1.0073) / (903.7148 - 875.4421)
z = 902.7075 / 28.2727 ≈ 31.93 ≈ 32

Therefore:

m1 = 875.4421 → z = 32
m2 = 903.7148 → z = 31

Molecular weight calculation

The molecular weight is calculated using:

MW = z x (m/z - H)

Using m1 = 875.4421, z = 32:

MW = 32 x (875.4421 - 1.0073)
MW = 32 x 874.4348 ≈ 27981.9 Da

Using m2 = 903.7148, z = 31:

MW = 31 x (903.7148 - 1.0073)
MW = 31 x 902.7075 ≈ 27983.9 Da

Final experimental molecular weight: MW ≈ 27.98 kDa

Accuracy calculation

The theoretical molecular weight is:

MW_theory = 28006.60 Da

Accuracy is calculated as:

Accuracy = |MW_experiment - MW_theory| / MW_theory
Accuracy = |27983 - 28006.60| / 28006.60
Accuracy = 23.60 / 28006.60 ≈ 0.00084

Final accuracy: 0.084 % error

Can you observe the charge state for the zoomed-in peak in the mass spectrum for the intact eGFP? If yes, what is it? If no, why not? No, the exact charge state cannot be confidently observed from the zoomed in peak alone. Although its high m/z suggests a low charge state, the peak is too weak and lacks a clearly resolved neighboring charge-state or isotopic pattern needed for definite assignment.

Waters Part II — Secondary/Tertiary structure

Based on learnings in the lab, please explain the difference between native and denatured protein conformations. For example, what happens when a protein unfolds? How is that determined with a mass spectrometer? What changes do you see in the mass spectrum between the native and denatured protein analyses?

Native proteins are compact and have fewer accessible ionizable sites, resulting in lower charge states and peaks at higher m/z values. When proteins denature, they unfold and expose more residues, allowing them to acquire more charges during ionization. This leads to a broader charge distribution with peaks at lower m/z values. Thus, the shift in charge state distribution in the mass spectrum reflects protein unfolding.

Zooming into the native mass spectrum of eGFP from the Waters Xevo G3 QTof MS (see Figure 3), can you discern the charge state of the peak at ~2800 m/z? What is the charge state? How can you tell?

The charge state of the peak at approximately 2800 m/z can be determined by analyzing the spacing between the isotopic peaks in the zoomed in spectrum.

In mass spectrometry, the spacing between isotopic peaks is equal to:

Δ(m/z) = 1 / z

From the zoomed in region, the distance between adjacent peaks is approximately 0.33 m/z.

Using this relationship:

z = 1 / Δ(m/z) z ≈ 1 / 0.33 ≈ 3

Therefore, the charge state of the peak is:

z = 3

Peptide Mapping - primary structure

How many Lysines (K) and Arginines (R) are in eGFP? Please circle or highlight them in the eGFP sequence given in Waters Part I question 1 above. (Note: adding the sequence to Benchling as an amino acid file and clicking biochemical properties tab will show you a count for each amino acid).
How many peptides will be generated from tryptic digestion of eGFP?

Navigate to https://web.expasy.org/peptide_mass/
Copy/paste the sequence above into the input box in the PeptideMass tool to generate expected list of peptides.
Use Figure 4 below as a guide for the relevant parameters to predict peptides from eGFP.
Click “Perform the Cleavage” button in the PeptideMass tool and report the number of peptides generated when using trypsin to perform the digest.

Based on the LC MS data for the Peptide Map data generated in lab (please use Figure 5a as a reference) how many chromatographic peaks do you see in the eGFP peptide map between 0.5 and 6 minutes? You may count all peaks that are >10% relative abundance Between 0.5 and 6 minutes, approximately 18 - 19 chromatographic peaks above 10% relative abundance can be observed in the eGFP peptide map. Only prominent peaks were counted, while smaller signals near the baseline were excluded. The exact number may vary slightly depending on the threshold interpretation, but the total is approximately 18 peaks.
Assuming all the peaks are peptides, does the number of peaks match the number of peptides predicted from question 2 above? Are there more peaks in the chromatogram or fewer?

The predicted number of peptides was 19, which is approximately consistent with the number of chromatographic peaks observed above the selected threshold. Therefore, the chromatogram shows about the same number of peaks as the predicted peptides. Any small difference would likely be due to co elution, low-abundance peptides, or non peptide signals.

Identify the mass to charge (m/z) of the peptide shown in Figure 5b. What is the charge (z) of the most abundant charge state of the peptide (use the separation of the isotopes to determine the charge state). Calculate the mass of the singly charged form of the peptide ([ M + H ]+) based on its m/z and z.

The peptide has a most abundant peak at m/z 525.76712. The isotope spacing is approximately 0.5 m/z, which indicates a charge state of z = 2 because delta(m/z) = 1/z. Using this charge state, the singly charged form is calculated as (M+H)+ = 2 x 525.76712 - 1.0073 = 1050.53 Da.

Identify the peptide based on comparison to expected masses in the PeptideMass tool. What is mass accuracy of measurement? Please calculate the error in ppm.

The experimental mass (~1050.53 Da) matches the theoretical peptide mass of 1050.5214 Da corresponding to the sequence FEGDTLVNR. The mass error is calculated as:

error = ((1050.5269 - 1050.5214) / 1050.5214) x 10^6 ≈ 5.27 ppm

What is the percentage of the sequence that is confirmed by peptide mapping?

The percentage of the sequence confirmed by peptide mapping is 88%, as indicated by the coverage map in Figure 6.

Oligomers

The requested oligomeric species are located approximately at:

7FU Decamer: 3.4 MDa
8FU Didecamer: 8.33 MDa
8FU 3-Decamer: 12.67 MDa
8FU 4-Decamer: 16 - 17 MDa

The peaks do not fall exactly on the theoretical masses, but they align closely enough to assign those oligomeric states.

| | Theoretical |Observed (LC-MS)| PPM Error | |Molecular weight (kDa) | 28.0066 |27.983 |~840 ppm |