
B-dereplicate-bruker-maldi-biotyper-spectra
Source:vignettes/b-dereplicate-bruker-maldi-biotyper-spectra.Rmd
b-dereplicate-bruker-maldi-biotyper-spectra.RmdBacterial colony identification with the Bruker MALDI Biotyper is a high-throughput method with the built-in tools, provided that the selected bacteria belong to the internal database.
Scientific projects where the number of unknown bacteria is expected to be high needs reference-free methods to be able to reduce the redundancy of isolated bacterial colonies, a process called dereplication.
Strejcek et
al. (2018) proposed such a method by processing the spectra and
suggest similarity thresholds between spectra above which spectra, and
therefore the measured bacterial colonies, can be considered identical
at a given taxonomic rank. Their processing procedure is implemented in
the {maldipickr}
package and illustrated in the following vignette.
In addition, we provide functions to enable the dereplication of different batches of Bruker MALDI Biotyper runs and combine the results, in order to be able to delineate the clusters from a common similarity matrix.
More importantly, we provide a function to select a spectra to be picked in each cluster, a process called cherry-picking, depending on external metadata and potential out-groups to be excluded for the current cherry-picking steps.
Process Bruker MALDI Biotyper spectra
From the imported raw data from the Bruker MALDI Biotyper, the processing of the spectra is based on the original implementation, and run the following tasks:
- Square-root transformation
- Mass range trimming to 4-10 kDa as they were deemed most determinant by Strejcek et al. (2018)
- Signal smoothing using the Savitzky-Golay method and a half window size of 20
- Baseline correction with the SNIP procedure
- Normalisation by Total Ion Current
- Peak detection using the SuperSmoother procedure and with a signal-to-noise ratio above 3
- Peak filtering. This step has been added to discard peaks with a negative signal-to-noise ratio probably due to being on the edge of the mass range.
The full procedure is illustrated in the example below. While in this
case, all the resulting processed spectra, peaks and final spectra
metadata are stored in-memory, the process_spectra()
function enables storing these files locally for scalable
high-throughput analyses.
# Get an example directory of six Bruker MALDI Biotyper spectra
directory_biotyper_spectra <- system.file(
"toy-species-spectra",
package = "maldipickr"
)
# Import the six spectra
spectra_list <- import_biotyper_spectra(directory_biotyper_spectra)
# Transform the spectra signals according to Strejcek et al. (2018)
processed <- process_spectra(spectra_list)
# Overview of the list architecture that is returned
# with the list of processed spectra, peaks identified and the
# metadata table
str(processed, max.level = 2)
#> List of 3
#> $ spectra :List of 6
#> ..$ species1_G2 :Formal class 'MassSpectrum' [package "MALDIquant"] with 3 slots
#> ..$ species2_E11:Formal class 'MassSpectrum' [package "MALDIquant"] with 3 slots
#> ..$ species2_E12:Formal class 'MassSpectrum' [package "MALDIquant"] with 3 slots
#> ..$ species3_F7 :Formal class 'MassSpectrum' [package "MALDIquant"] with 3 slots
#> ..$ species3_F8 :Formal class 'MassSpectrum' [package "MALDIquant"] with 3 slots
#> ..$ species3_F9 :Formal class 'MassSpectrum' [package "MALDIquant"] with 3 slots
#> $ peaks :List of 6
#> ..$ species1_G2 :Formal class 'MassPeaks' [package "MALDIquant"] with 4 slots
#> ..$ species2_E11:Formal class 'MassPeaks' [package "MALDIquant"] with 4 slots
#> ..$ species2_E12:Formal class 'MassPeaks' [package "MALDIquant"] with 4 slots
#> ..$ species3_F7 :Formal class 'MassPeaks' [package "MALDIquant"] with 4 slots
#> ..$ species3_F8 :Formal class 'MassPeaks' [package "MALDIquant"] with 4 slots
#> ..$ species3_F9 :Formal class 'MassPeaks' [package "MALDIquant"] with 4 slots
#> $ metadata: tibble [6 × 3] (S3: tbl_df/tbl/data.frame)
# A detailed view of the metadata with the median signal-to-noise
# ratio (SNR) and the number of peaks
processed$metadata
#> # A tibble: 6 × 3
#> name SNR peaks
#> <chr> <dbl> <dbl>
#> 1 species1_G2 5.09 21
#> 2 species2_E11 5.54 22
#> 3 species2_E12 5.63 23
#> 4 species3_F7 4.89 26
#> 5 species3_F8 5.56 25
#> 6 species3_F9 5.40 25Merge multiple processed spectra
During high-throughput analyses, multiples runs of Bruker MALDI Biotyper are expected resulting in several batches of spectra to be processed and compared. While their processing is natively independent, and could natively be run in parallel, the integration of the batches for their comparison needs an additional step.
The merge_processed_spectra()
function aggregates the processed spectra and bins together the detected
peaks, with a tolerance of \(0.002\)
between the average peak values in the bin (see MALDIquant::binPeaks),
which translate to a tolerance of 2000 ppm. This binning step results in
a \(n\times p\) feature matrix (or
intensity matrix), with \(n\) rows for
\(n\) processed spectra (peakless
spectra are discarded) and \(p\)
columns for the \(p\) peaks masses.
By default, as in the Strejeck et al. (2018) procedure, the intensity
values for spectra with missing peaks are interpolated from the
processed spectra signal. The current function enables the analyst to
decide whether to interpolate the values or leave missing peaks as
NA which would then be converted to an null intensity
value.
# Get an example directory of six Bruker MALDI Biotyper spectra
directory_biotyper_spectra <- system.file(
"toy-species-spectra",
package = "maldipickr"
)
# Import the six spectra
spectra_list <- import_biotyper_spectra(directory_biotyper_spectra)
# Transform the spectra signals according to Strejcek et al. (2018)
processed <- process_spectra(spectra_list)
# Merge the spectra to produce the feature matrix
fm <- merge_processed_spectra(list(processed))
# The feature matrix has 6 spectra as rows and
# 35 peaks as columns
dim(fm)
#> [1] 6 35
# Notice the difference when the interpolation is turned off
fm_no_interpolation <- merge_processed_spectra(
list(processed),
interpolate_missing = FALSE
)
sum(fm == 0) # 0
#> [1] 0
sum(fm_no_interpolation == 0) # 68
#> [1] 68
# Multiple runs can be aggregated using list()
# Merge the spectra to produce the feature matrix
fm_all <- merge_processed_spectra(list(processed, processed, processed))
# The feature matrix has 3×6=18 spectra as rows and
# 35 peaks as columns
dim(fm_all)
#> [1] 18 35Compute a similarity matrix between all processed spectra (not included)
Once all the batches of spectra have been processed together, we can
use a distance metric to evaluate how close the spectra are to one
another. Strejcek
et al. (2018) recommend the cosine metric to
compare the spectra and they use the fast implementation in the {coop}
package.
While we do not provide specific functions to generate the similarity
matrix, we illustrate below how it can be easily computed. Note that the
feature matrix from merge_processed_spectra()
has spectra as rows and peaks values as columns. So to get a similarity
matrix between spectra, the feature matrix must be transposed before
cosine computation.
# A. Compute the similarity matrix on the transposed feature matrix
# using Pearson correlation coefficient
sim_matrix <- stats::cor( t(fm), method = "pearson)
# B.1 Install the coop package
# install.packages("coop")
# B.2 Compute the similarity matrix on the transposed feature matrix
sim_matrix <- coop::cosine( t(fm) )
Delineate clusters from a similarity matrix
When the similarity matrix is computed between all pairs of the studied spectra, the next step is to delineate clusters of spectra to dereplicate the measured bacterial colonies.
The similarity_to_clusters())
is agnostic of the similarity metric used, whether it is the cosine
metric or the Pearson product moment, provided that a numeric threshold
relevant to the metric used is given, and above which two
spectra would be considered similar.
Indeed, the matrix is transformed into a network without loops, where nodes are spectra and links exist between spectra only if the similarity between the spectra is above (or equal to) the threshold. This representation allows to infer the clusters. A table summarises for each spectra, to which cluster number it was assigned to and the size of the cluster, which is the total number of spectra in the cluster.
# Toy similarity matrix between the six example spectra of
# three species. The cosine metric is used and a value of
# zero indicates dissimilar spectra and a value of one
# indicates identical spectra.
cosine_similarity <- matrix(
c(
1, 0.79, 0.77, 0.99, 0.98, 0.98,
0.79, 1, 0.98, 0.79, 0.8, 0.8,
0.77, 0.98, 1, 0.77, 0.77, 0.77,
0.99, 0.79, 0.77, 1, 1, 0.99,
0.98, 0.8, 0.77, 1, 1, 1,
0.98, 0.8, 0.77, 0.99, 1, 1
),
nrow = 6,
dimnames = list(
c(
"species1_G2", "species2_E11", "species2_E12",
"species3_F7", "species3_F8", "species3_F9"
),
c(
"species1_G2", "species2_E11", "species2_E12",
"species3_F7", "species3_F8", "species3_F9"
)
)
)
# Delineate clusters based on a 0.92 threshold applied
# to the similarity matrix
similarity_to_clusters(cosine_similarity, threshold = 0.92)
#> # A tibble: 6 × 3
#> name membership cluster_size
#> <chr> <int> <int>
#> 1 species1_G2 1 4
#> 2 species2_E11 2 2
#> 3 species2_E12 2 2
#> 4 species3_F7 1 4
#> 5 species3_F8 1 4
#> 6 species3_F9 1 4Set a reference spectrum for each cluster
Once the table of clusters is generated from the similarity matrix, a reference spectrum can be assigned to each cluster.
We choose to define high-quality spectra as representative spectra of the clusters using internal information. That is, representative spectra have, within their cluster, the highest median signal-to-noise ratio and then the highest number of detected peaks.
The function set_reference_spectra()
does not change the order of the cluster table but merely adds an
additional column is_reference to indicate whether the
corresponding spectrum is representative of the cluster.
# Get an example directory of six Bruker MALDI Biotyper spectra
# Import the six spectra and
# Transform the spectra signals according to Strejcek et al. (2018)
processed <- system.file(
"toy-species-spectra",
package = "maldipickr"
) %>%
import_biotyper_spectra() %>%
suppressMessages() %>%
process_spectra()
# Toy similarity matrix between the six example spectra of
# three species. The cosine metric is used and a value of
# zero indicates dissimilar spectra and a value of one
# indicates identical spectra.
cosine_similarity <- matrix(
c(
1, 0.79, 0.77, 0.99, 0.98, 0.98,
0.79, 1, 0.98, 0.79, 0.8, 0.8,
0.77, 0.98, 1, 0.77, 0.77, 0.77,
0.99, 0.79, 0.77, 1, 1, 0.99,
0.98, 0.8, 0.77, 1, 1, 1,
0.98, 0.8, 0.77, 0.99, 1, 1
),
nrow = 6,
dimnames = list(
c(
"species1_G2", "species2_E11", "species2_E12",
"species3_F7", "species3_F8", "species3_F9"
),
c(
"species1_G2", "species2_E11", "species2_E12",
"species3_F7", "species3_F8", "species3_F9"
)
)
)
# Delineate clusters based on a 0.92 threshold applied
# to the similarity matrix
clusters <- similarity_to_clusters(
cosine_similarity,
threshold = 0.92
)
# Set reference spectra with the toy example
set_reference_spectra(clusters, processed$metadata)
#> # A tibble: 6 × 6
#> name membership cluster_size SNR peaks is_reference
#> <chr> <int> <int> <dbl> <dbl> <lgl>
#> 1 species1_G2 1 4 5.09 21 FALSE
#> 2 species2_E11 2 2 5.54 22 FALSE
#> 3 species2_E12 2 2 5.63 23 TRUE
#> 4 species3_F7 1 4 4.89 26 FALSE
#> 5 species3_F8 1 4 5.56 25 TRUE
#> 6 species3_F9 1 4 5.40 25 FALSEImport clusters results generated by SPeDE
Raw spectra can also be processed and clustered by another approach,
named SPeDE,
developed by Dumolin et al. (2019). The resulting dereplication step
produces a comma separated table. The example below illustrates how to
import this table into R to be consistent with the dereplication table
generated within the {maldipickr}
package.
# Reformat the output from SPeDE table
# https://github.com/LM-UGent/SPeDE
import_spede_clusters(
system.file("spede.csv", package = "maldipickr")
)
#> # A tibble: 6 × 5
#> name membership cluster_size quality is_reference
#> <chr> <dbl> <int> <chr> <lgl>
#> 1 species1_G2 1 1 GREEN TRUE
#> 2 species2_E11 2 2 ORANGE FALSE
#> 3 species2_E12 2 2 GREEN TRUE
#> 4 species3_F7 3 1 GREEN TRUE
#> 5 species3_F8 4 2 ORANGE FALSE
#> 6 species3_F9 4 2 GREEN TRUECherry-pick Bruker MALDI Biotyper spectra
When isolating bacteria from an environment, experimenters want to be
thorough but also work-, time- and cost-savvy. One approach is to reduce
the redundancy of the bacterial isolates by analyzing their MALDI-TOF
spectra from the Bruker Biotyper. All the steps previously described in
this vignette consisted of processing the spectra to be able to pick
only non-redundant spectra, using the pick_spectra()
function.
The function, as illustrated in the examples below, can pick spectra using different types of inputs:
- the reference spectra information that is present in the cluster
table (after using
similarity_to_clusters()orimport_spede_clusters()functions; see example 1) - an external metadata table containing a variable (e.g., optical density, fluorescence) to be maximized (default) or minimized per cluster (see example 2)
Spectra, and clusters, can also be excluded from the cherry-picking
decision, a procedure termed masking here. We distinguish two
types of mask that are implemented in the pick_spectra()
function:
- soft mask that discards the spectra only, if they correspond for instance to low-quality sample, negative control samples (see example 3)
- hard mask that discards the spectra and their clusters (see example 4). This is particularly useful if some spectra have been previously picked. For instance, to exclude colonies grown and picked 24h after streaking when comparing with colonies grown for 72h.
Advanced users can also provide directly a cluster table with a custom sort by cluster to accommodate complex design.
Ultimately, the function delivers a table with as many rows as the
cluster table with an additional logical column named
to_pick to indicate whether the colony associated with the
spectra should be picked (TRUE) or not picked
(FALSE).
# 0. Load a toy example of a tibble of clusters created by
# the `similarity_to_clusters` function.
clusters <- readRDS(
system.file("clusters_tibble.RDS",
package = "maldipickr"
)
)
# 1. By default and if no other metadata are provided,
# the function picks reference spectra for each clusters.
pick_spectra(clusters)
#> # A tibble: 6 × 7
#> name membership cluster_size SNR peaks is_reference to_pick
#> <chr> <int> <int> <dbl> <dbl> <lgl> <lgl>
#> 1 species1_G2 1 4 5.09 21 FALSE FALSE
#> 2 species2_E11 2 2 5.54 22 FALSE FALSE
#> 3 species2_E12 2 2 5.63 23 TRUE TRUE
#> 4 species3_F7 1 4 4.89 26 FALSE FALSE
#> 5 species3_F8 1 4 5.56 25 TRUE TRUE
#> 6 species3_F9 1 4 5.40 25 FALSE FALSE
# 2.1 Simulate OD600 values with uniform distribution
# for each of the colonies we measured with
# the Bruker MALDI Biotyper
set.seed(104)
metadata <- dplyr::transmute(
clusters,
name = name, OD600 = runif(n = nrow(clusters))
)
metadata
#> # A tibble: 6 × 2
#> name OD600
#> <chr> <dbl>
#> 1 species1_G2 0.364
#> 2 species2_E11 0.772
#> 3 species2_E12 0.735
#> 4 species3_F7 0.973
#> 5 species3_F8 0.740
#> 6 species3_F9 0.201
# 2.2 Pick the spectra based on the highest
# OD600 value per cluster
pick_spectra(clusters, metadata, "OD600")
#> # A tibble: 6 × 8
#> name membership cluster_size SNR peaks is_reference OD600 to_pick
#> <chr> <int> <int> <dbl> <dbl> <lgl> <dbl> <lgl>
#> 1 species1_G2 1 4 5.09 21 FALSE 0.364 FALSE
#> 2 species2_E11 2 2 5.54 22 FALSE 0.772 TRUE
#> 3 species2_E12 2 2 5.63 23 TRUE 0.735 FALSE
#> 4 species3_F7 1 4 4.89 26 FALSE 0.973 TRUE
#> 5 species3_F8 1 4 5.56 25 TRUE 0.740 FALSE
#> 6 species3_F9 1 4 5.40 25 FALSE 0.201 FALSE
# 3.1 Say that the wells on the right side of the plate are
# used for negative controls and should not be picked.
metadata <- metadata %>% dplyr::mutate(
well = gsub(".*[A-Z]([0-9]{1,2}$)", "\\1", name) %>%
strtoi(),
is_edge = is_well_on_edge(
well_number = well, plate_layout = 96, edges = "right"
)
)
# 3.2 Pick the spectra after discarding (or soft masking)
# the spectra indicated by the `is_edge` column.
pick_spectra(clusters, metadata, "OD600",
soft_mask_column = "is_edge"
)
#> # A tibble: 6 × 10
#> name membership cluster_size SNR peaks is_reference OD600 well is_edge
#> <chr> <int> <int> <dbl> <dbl> <lgl> <dbl> <int> <lgl>
#> 1 species1… 1 4 5.09 21 FALSE 0.364 2 FALSE
#> 2 species2… 2 2 5.54 22 FALSE 0.772 11 FALSE
#> 3 species2… 2 2 5.63 23 TRUE 0.735 12 TRUE
#> 4 species3… 1 4 4.89 26 FALSE 0.973 7 FALSE
#> 5 species3… 1 4 5.56 25 TRUE 0.740 8 FALSE
#> 6 species3… 1 4 5.40 25 FALSE 0.201 9 FALSE
#> # ℹ 1 more variable: to_pick <lgl>
# 4.1 Say that some spectra were picked before
# (e.g., in the column F) in a previous experiment.
# We do not want to pick clusters with those spectra
# included to limit redundancy.
metadata <- metadata %>% dplyr::mutate(
picked_before = grepl("_F", name)
)
# 4.2 Pick the spectra from clusters without spectra
# labelled as `picked_before` (hard masking).
pick_spectra(clusters, metadata, "OD600",
hard_mask_column = "picked_before"
)
#> # A tibble: 6 × 11
#> name membership cluster_size SNR peaks is_reference OD600 well is_edge
#> <chr> <int> <int> <dbl> <dbl> <lgl> <dbl> <int> <lgl>
#> 1 species1… 1 4 5.09 21 FALSE 0.364 2 FALSE
#> 2 species2… 2 2 5.54 22 FALSE 0.772 11 FALSE
#> 3 species2… 2 2 5.63 23 TRUE 0.735 12 TRUE
#> 4 species3… 1 4 4.89 26 FALSE 0.973 7 FALSE
#> 5 species3… 1 4 5.56 25 TRUE 0.740 8 FALSE
#> 6 species3… 1 4 5.40 25 FALSE 0.201 9 FALSE
#> # ℹ 2 more variables: picked_before <lgl>, to_pick <lgl>References
- Dumolin C, Aerts M, Verheyde B, Schellaert S, Vandamme T, Van Der Jeugt F, De Canck E, Cnockaert M, Wieme AD, Cleenwerck I, Peiren J, Dawyndt P, Vandamme P, & Carlier A. (2019). “Introducing SPeDE: High-Throughput Dereplication and Accurate Determination of Microbial Diversity from Matrix-Assisted Laser Desorption–Ionization Time of Flight Mass Spectrometry Data”. MSystems 4(5). doi:10.1128/msystems.00437-19.
- Strejcek M, Smrhova T, Junkova P & Uhlik O (2018). “Whole-Cell MALDI-TOF MS versus 16S rRNA Gene Analysis for Identification and Dereplication of Recurrent Bacterial Isolates.” Frontiers in Microbiology 9 doi:10.3389/fmicb.2018.01294.