Week 10 HW: Advanced Imaging

Homework: Final Project

Q1. Please identify at least one (ideally many) aspect(s) of your project that you will measure.

This project has four distinct measurable outputs that span computational filtering, protein expression, antimicrobial activity, and drug synergy:

Peptide physicochemical properties (computational, pre-synthesis): During AI candidate selection, I’ll measure charge, amphipathicity, and hydrophobic moment of ~2,000 AMP-Diffusion candidates, as well as CLIP binding scores for PepPrCLIP candidates against E. coli FtsZ and LpxC targets. PeptiVerse provides predicted hemolysis probability, solubility, and toxicity scores for all final candidates.
Bacterial growth inhibition ($\text{OD}_{600}$): This is the core experimental measurement. After expressing each peptide via cell-free protein synthesis, I’ll read optical density at 600 nm on both E. coli ATCC 25922 and B. subtilis ATCC 6633 plates after overnight incubation. Each peptide’s $\text{OD}_{600}$ is compared to the scrambled-peptide negative control to calculate percent growth inhibition, producing a 2D activity matrix (peptide $\times$ organism).
Fractional Inhibitory Concentration Index (FICI) for synergy: For the top 5–6 active peptides, I’ll measure $\text{OD}_{600}$ of co-expressed pairs (both DNA templates at half-dose in one CFPS reaction) vs. each peptide expressed alone at half-dose. FICI classifies each pair as synergistic ($\leq 0.5$), additive ($0.5$–$1.0$), or indifferent/antagonistic ($> 1.0$). This is the measurement that directly answers the central hypothesis about whether cross-method AMP pairs are more synergistic than within-method pairs.
Gram-selectivity profiles: Running every peptide against both organisms generates a selectivity ratio (% inhibition on E. coli vs. B. subtilis). This is especially important for Group C constructs; if MadSBM becomes available, the 25/50/75% interpolants between magainin-2 (gram-negative) and HNP-1 (gram-positive) should show a measurable shift in this ratio.

Q2. Please describe all of the elements you would like to measure, and furthermore describe how you will perform these measurements.

Computational measurements are performed before any wet lab work. AMP-Diffusion generates ~2,000 candidate sequences at lengths 20/25/30/35 amino acids; I filter these programmatically by physicochemical properties (removing sequences with unfavorable charge, low amphipathicity, or homopolymer runs) and select the top 6 diverse candidates plus 3 fallbacks. PepPrCLIP ranks ~100K candidates per target by CLIP binding score, and I take the top 2 per target. PeptiVerse runs as a HuggingFace web app and returns developability predictions per peptide.

OD600 growth inhibition assay follows a standard broth microdilution format. I dilute overnight cultures of each organism to ~5 x 10^5 CFU/mL in Mueller-Hinton broth, dispense 100 uL per well of a 96-well flat-bottom plate, then add 5 uL of crude CFPS reaction to each well. After overnight incubation at 37 °C, I read absorbance at 600 nm using a plate reader. Three biological replicates per construct (45 reactions for 15 constructs, plus 9 control reactions) enable statistical comparison. The same CFPS reactions are split across two plates (one per organism) so expression variability is controlled between the two bacterial targets.

Synergy measurement uses the same OD600 readout but with modified CFPS input: two DNA templates at half-dose (25–50 ng each) in a single 20 uL reaction, alongside single-agent half-dose controls. I then calculate FICI from the resulting inhibition values, separately for each organism. Cross-method pairings (e.g., AMP-Diffusion generalist + PepPrCLIP targeted binder) are prioritized because they test the central synergy hypothesis most directly.

Gram-selectivity measurement is a derived metric: no separate experiment is needed. By reading the same CFPS reactions against both organisms in parallel, every peptide’s selectivity ratio drops out of the primary screen data automatically.

Q3. What are the technologies you will use? Describe in detail.

Cell-free protein synthesis (CFPS): The Ginkgo Bioworks E. coli cell-free kit (BL21 Star DE3 lysate, T7 RNA polymerase-driven) is the expression platform. Linear Twist gene fragments serve directly as templates, with no cloning required. Each construct carries a T7 promoter, strong RBS, the codon-optimized peptide ORF, and a T7 terminator. NEBExpress GamS Nuclease Inhibitor (NEB #P0774S) is added at ~0.6 $\mu\text{g}$ per 20 $\mu\text{L}$ reaction to protect linear DNA from RecBCD exonuclease degradation in the crude lysate. Reactions run at 30 $^\circ\text{C}$ for 4 hours.
Synthetic gene fragments (Twist Bioscience): 15 linear DNA constructs ($\geq 300$ bp each, padded with inert flanking sequence to meet Twist’s minimum) are ordered as gene fragments. This is DNA synthesis, not cloning; the fragments arrive ready for direct use in CFPS.
$\text{OD}_{600}$ plate reader (spectrophotometry): A standard microplate reader measuring optical density at 600 nm is the primary analytical instrument. It quantifies bacterial growth in 96-well format, enabling high-throughput comparison of all peptides and combinations across both organisms in a single read.
AI/ML peptide design tools: AMP-Diffusion (diffusion-based generative model for antimicrobial peptide sequences), PepPrCLIP (CLIP-based peptide design using the 650M-parameter ESM-2 protein language model, run on Google Colab with GPU), and potentially MadSBM (latent-space interpolation between known AMPs). These are the computational “technologies” that generate the candidate peptides before any synthesis.
Codon optimization: Selected peptide sequences are reverse-translated and codon-optimized for E. coli expression (likely using IDT or Benchling codon optimization tools) to maximize translational efficiency in the BL21-derived CFPS lysate.
Standard microbiology (Mueller-Hinton broth microdilution): This CLSI-standard antimicrobial susceptibility testing method uses the two reference strains E. coli ATCC 25922 and B. subtilis ATCC 6633, both standard quality-control organisms for susceptibility testing, in 96-well format.

Part I — Molecular Weight

Q1. Calculate the theoretical molecular weight of eGFP from the amino acid sequence using ExPASy ProtParam.

The full eGFP construct (247 amino acids, including the C-terminal LE linker and $\text{His}_6$ tag) was submitted to ExPASy ProtParam:

MVSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTL
VTTLTYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVN
RIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLAD
HYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITLGMDELYKLE
HHHHHH

ProtParam reports:

Property	Value
Number of amino acids	247
Average molecular weight	28,006.60 Da
Monoisotopic molecular weight	27,988.96 Da
Theoretical pI	~6.2

The average molecular weight (28,006.60 Da) is the reference value used below for accuracy calculations. Note that this theoretical value does not account for eGFP chromophore maturation, which removes approximately 20 Da (one water loss + one oxidation) via autocatalytic cyclization of residues Thr65–Tyr66–Gly67. The mature chromophore mass would be closer to $28{,}006.60 - 20.03 \approx 27{,}986.57$ Da.

Q2. Determine the molecular weight from the LC-MS charge state envelope using the adjacent-charge-state method.

In electrospray ionization, a protein of mass $M$ carrying $z$ protons (each of mass $H = 1.00728$ Da) appears at:

$$\frac{m}{z} = \frac{M + z \cdot H}{z}$$

For two adjacent peaks where Peak A has charge $z$ and Peak B has charge $z - 1$:

$$z = \frac{(m/z)_B - H}{(m/z)_B - (m/z)_A}$$

and then:

$$M = z \left[\left(\frac{m}{z}\right)_A - H\right]$$

Worked example: using the peaks at $m/z$ = 903.7148 (Peak A) and 933.8044 (Peak B):

$$z_A = \frac{933.8044 - 1.00728}{933.8044 - 903.7148} = \frac{932.797}{30.090} = 30.998 \approx 31$$

$$M = 31 \times (903.7148 - 1.00728) = 31 \times 902.708 = 27{,}983.9 ;\text{Da}$$

Cross-check table (five adjacent pairs from the denatured LC-MS spectrum):

$(m/z)_A$	$(m/z)_B$	$z_A$ (calc → round)	$M$ (Da)
757.3019	778.2277	37.14 → 37	27,982.9
903.7148	933.8044	31.00 → 31	27,983.9
933.8044	965.9584	30.01 → 30	27,983.9
965.9584	1000.5021	28.99 → 29	27,983.6
1000.5021	1037.4423	28.04 → 28	27,985.9

$$M_{\text{experiment}} = \text{mean of five values} \approx \mathbf{27{,}984.0 ;\text{Da}}$$

Q2 (cont.). Calculate the accuracy of the measurement.

$$\text{Accuracy} = \frac{|M_{\text{exp}} - M_{\text{theory}}|}{M_{\text{theory}}} = \frac{|27{,}984.0 - 28{,}006.6|}{28{,}006.6} = \frac{22.6}{28{,}006.6} = 0.081% \approx \mathbf{810 ;\text{ppm}}$$

This is relative to the average theoretical mass from ProtParam. If instead we compare to the monoisotopic mass (27,988.96 Da), the error drops to $|27{,}984.0 - 27{,}988.96|/27{,}988.96 \approx 177$ ppm, and if we further account for the ~20 Da chromophore maturation ($M_\text{theory,mature} \approx 27{,}986.6$ Da), the agreement improves to roughly 90 ppm. The remaining discrepancy is well within the expected accuracy of intact-protein ESI-MS deconvolution.

Q3. Can you observe the charge state from the zoomed-in peak? If yes, what is it? If no, why not?

Whether the charge state can be read directly from a single peak depends on mass resolving power. For eGFP at $z \approx 30$, adjacent isotope peaks in the isotopic envelope are separated by:

$$\Delta(m/z) = \frac{1.003}{z} \approx \frac{1.003}{30} \approx 0.033 ;\text{Da}$$

Resolving this requires $R = m/z / \Delta \approx 1{,}000 / 0.033 \approx 30{,}000$. If the instrument (e.g., Orbitrap) achieves this resolution, the isotope peaks are resolved and the charge state can be determined by:

$$z = \frac{1.003}{\text{spacing between adjacent isotope peaks}}$$

If the zoomed-in inset shows resolved isotope peaks with spacing $\sim$0.033 Da, then $z = 1.003/0.033 \approx 30$, confirming the charge state directly.

If the instrument resolution is insufficient (e.g., a low-resolution QTOF), the isotope peaks merge into a single broad hump and the charge state cannot be determined from that peak alone, so the adjacent-charge-state method (Q2) must be used instead.

Part II — Secondary and Tertiary Structure

Q1. Explain the difference between native and denatured protein conformations as seen in mass spectrometry.

In denatured ESI-MS (Figure 2, top panel), the protein is unfolded by organic solvent and acid. The extended chain exposes many basic residues (Lys, Arg, His) to solution, each of which can accept a proton. This produces a broad charge state distribution at high charge states ($z \approx 27$–$37$ for eGFP), so the peaks appear at relatively low $m/z$ values (~750–1050). The wide, multi-peak envelope is a hallmark of a disordered, extended conformation.

In native ESI-MS (Figure 2, bottom panel), the protein is sprayed from a near-physiological buffer (typically ammonium acetate, pH ~7). The protein remains compactly folded, burying most ionizable side chains in its interior. This results in fewer, lower charge states ($z \approx 9$–$11$ for eGFP), so the peaks appear at high $m/z$ values (~2500–3100). The narrow charge state distribution (often only two or three peaks) directly reflects the compact, globular conformation.

Key insight: the charge state distribution is a proxy for protein conformation. Compact → fewer charges → higher $m/z$; unfolded → more charges → lower $m/z$.

Q2. Zooming into the native mass spectrum at ~2800 m/z (Figure 3), can you discern the charge state? What is it?

Yes. At $m/z \approx 2800$ for a protein of mass ~28,000 Da, the charge state is:

$$z = \frac{M}{m/z} \approx \frac{28{,}000}{2{,}800} = \mathbf{10}$$

This can be confirmed from the isotopic fine structure. If the inset shows resolved isotope peaks, the spacing between adjacent isotopic peaks is:

$$\Delta(m/z) = \frac{1.003}{z} = \frac{1.003}{10} = 0.1003 ;\text{Da}$$

Counting approximately 10 isotope peaks per 1 Da interval, or measuring the spacing directly and computing $z = 1.003 / \Delta$, confirms $z = 10$. Resolving this spacing requires $R = 2800 / 0.10 = 28{,}000$, which is achievable on modern Orbitrap and FT-ICR instruments.

As a consistency check: $(28{,}006.6 + 10 \times 1.007)/10 = 2{,}801.7$ $m/z$, which matches the observed peak position.

Part III — Peptide Mapping

Q1. How many Lysines (K) and Arginines (R) are in the eGFP sequence?

The eGFP construct contains 20 Lysines (K) and 6 Arginines (R), for a total of 26 tryptic cleavage sites.

Highlighted sequence (K and R in bold):

MVSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTL VTTLTYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVN RIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLAD HYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEK RDHMVLLEFVTAAGITLGMDELYKLE HHHHHH

Q2. How many peptides are expected from a tryptic digest? How many have $[\text{M+H}]^+$ > 500 Da?

Using ExPASy PeptideMass with trypsin (cleaves after K and R, no missed cleavages), the 26 cleavage sites produce 27 tryptic peptides.

Of these 27, 19 peptides have a monoisotopic $[\text{M+H}]^+ > 500$ Da and are therefore likely to be detected by LC-MS. The remaining 8 are very small (1–4 residues) and typically fall below the practical detection or retention limit.

Representative predicted peptides (monoisotopic $[\text{M+H}]^+$):

#	Residues	Sequence	$[\text{M+H}]^+$ (Da)
1	1–4	MVSK	464.25
2	5–27	GEELFTGVVPILVELDGDVNGHK	2437.26
3	28–42	FSVSGEGEGDATYGK	1503.66
5	47–53	FICTTGK	769.39
6	54–74	LPVPWPTLVTTLTYGVQCFSR	2378.26
9	87–97	SAMPEGYVQER	1266.58
14	115–123	FEGDTLVNR	1050.52
17	133–141	EDGNILGHK	982.50
23	170–210	HNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSK	4472.18
26	217–239	DHMVLLEFVTAAGITLGMDELYK	2566.29
27	240–247	LEHHHHHH	1083.50

Q3. How many chromatographic peaks are visible in the peptide map between 0.5 and 6.0 minutes (Figure 5a)?

Counting the labeled peaks in Figure 5a with retention times between 0.5 and 6.0 minutes and relative intensity above ~10%:

0.61, 0.79, 1.20, 1.43, 1.80, 1.85, 1.93, 2.17, 2.26, 2.54, 2.78, 3.27, 3.53, 3.59, 3.70, 4.30, 4.48, 4.64, 4.87, 5.06, 5.43

Approximately 19–21 peaks, depending on the intensity threshold used and whether closely spaced doublets (e.g., 1.80/1.85 and 3.53/3.59) are counted as one or two.

Q4. Does the number of chromatographic peaks match the predicted number of tryptic peptides?

The counts roughly agree but are not identical. We predicted 19 peptides with $[\text{M+H}]^+ > 500$ Da and observe ~19–21 chromatographic peaks. The differences arise from:

Very small peptides not detected: R (175 Da), QK (275 Da), TR (276 Da), IR (288 Da), and a few other small fragments elute in the void volume or fall below the detection limit, reducing the observed count.
Co-elution: Some peptides with similar hydrophobicity may co-elute and appear as a single peak, further reducing the count.
Modifications or partial cleavages: Oxidized or miscleaved forms of some peptides can produce extra peaks.

Overall, the observed ~19–21 peaks are consistent with the predicted 19 detectable tryptic peptides.

Q5. Identify the peptide in Figure 5b. What is the charge state, and what is the $[\text{M+H}]^+$ mass?

The dominant peak in Figure 5b is at $m/z = 525.76712$. A second peak is visible at $m/z = 1050.52438$.

The relationship between these two peaks reveals the charge:

$$2 \times 525.767 = 1051.534 \approx 1050.524 + 1.007$$

The 525.767 peak is the doubly charged $[\text{M+2H}]^{{2+}$ ion, and the 1050.524 peak is the singly charged $[\text{M+H}]}{+}$ ion. Therefore $z = 2$.

$$[\text{M+H}]^+ = z \times (m/z) - (z-1) \times H = 2 \times 525.76712 - 1 \times 1.00728 = \mathbf{1050.527 ;\text{Da}}$$

This can be confirmed by the direct singly charged peak at $m/z = 1050.524$ Da.

Q6. Identify the peptide by comparison with PeptideMass, and calculate the mass accuracy in ppm.

Comparing the observed $[\text{M+H}]^+ = 1050.527$ Da against the PeptideMass output, the match is peptide FEGDTLVNR (residues 115–123), with a predicted monoisotopic $[\text{M+H}]^+ = 1050.5214$ Da.

$$\text{ppm error} = \frac{|1050.527 - 1050.5214|}{1050.5214} \times 10^{6 = \frac{0.0056}{1050.5214} \times 10}6 \approx \mathbf{5.3 ;\text{ppm}}$$

Using the singly charged peak directly ($m/z = 1050.52438$):

$$\text{ppm error} = \frac{|1050.52438 - 1050.5214|}{1050.5214} \times 10^6 \approx \mathbf{2.8 ;\text{ppm}}$$

Both values represent excellent mass accuracy, typical of Orbitrap instruments (specification $\leq 5$ ppm).

Q7. What percentage of the eGFP sequence is confirmed by peptide mapping?

From Figure 6, 88% of the eGFP sequence was identified with high confidence by peptide mapping. The unconfirmed 12% corresponds primarily to the very small tryptic fragments (R, QK, TR, IR) that are too small to be retained or detected, and possibly the large 41-residue peptide (HNIEDGSVQLAD…SALSK) which may have had poor chromatographic recovery.

Part IV — KLH Oligomers by CDMS

Identify the KLH oligomeric species on the CDMS spectrum (Figure 7).

Keyhole limpet hemocyanin (KLH) is built from two subunit types: a 7-functional-unit (7FU) monomer of 340 kDa and an 8-functional-unit (8FU) monomer of 400 kDa. These assemble into decamers and higher-order multimers.

CDMS Peak (MDa)	Assignment	Expected Mass	Calculation	Match
3.4	7FU Decamer	3.40 MDa	$10 \times 340;\text{kDa}$	exact
4.01	8FU Decamer	4.00 MDa	$10 \times 400;\text{kDa}$	0.3%
8.33	8FU Didecamer	8.00 MDa	$20 \times 400;\text{kDa}$	4.1%
12.67	8FU 3-Decamer	12.00 MDa	$30 \times 400;\text{kDa}$	5.6%
—	8FU 4-Decamer	16.00 MDa	$40 \times 400;\text{kDa}$	not visible

The 7FU Decamer ($10 \times 340 = 3{,}400$ kDa) matches the 3.4 MDa peak precisely. The 8FU Didecamer ($20 \times 400 = 8{,}000$ kDa) corresponds to the ~8.33 MDa peak, and the 8FU 3-Decamer ($30 \times 400 = 12{,}000$ kDa) corresponds to the ~12.67 MDa peak. The slight upward mass shifts in the didecamer and 3-decamer peaks likely reflect associated solvent, salt, or lipid.

The 8FU 4-Decamer ($40 \times 400 = 16{,}000$ kDa = 16.0 MDa) is not clearly visible on the spectrum, suggesting it is either absent from this preparation, present at very low abundance, or beyond the measured mass range.

Additional peaks visible in Figure 7 at ~0.79 and ~1.52 MDa likely correspond to sub-decameric fragments (dimers and tetramers of 7FU or 8FU subunits).

Part V — Did I Make GFP?

Property	Theoretical	Observed (Intact LC-MS)	PPM Error
Molecular weight (kDa)	28.007	~27.984	~820
Peptide mapping coverage	100%	88%	—
Peptide FEGDTLVNR $[\text{M+H}]^+$ (Da)	1050.5214	1050.5270	~5

Conclusion: Yes. The intact mass agrees with the theoretical eGFP mass to within ~820 ppm (largely explained by GFP chromophore maturation, which removes ~20 Da and is not reflected in the ProtParam theoretical value). The tryptic peptide map confirms 88% of the amino acid sequence with sub-5 ppm peptide mass accuracy. Together, the intact mass and sequence-level peptide coverage provide strong orthogonal confirmation that the expressed protein is eGFP.