B-dereplicate-bruker-maldi-biotyper-spectra • maldipickr

library(maldipickr)

Bacterial colony identification with the Bruker MALDI Biotyper is a high-throughput method with the built-in tools, provided that the selected bacteria belong to the internal database.

Scientific projects where the number of unknown bacteria is expected to be high needs reference-free methods to be able to reduce the redundancy of isolated bacterial colonies, a process called dereplication.

Strejcek et al. (2018) proposed such a method by processing the spectra and suggest similarity thresholds between spectra above which spectra, and therefore the measured bacterial colonies, can be considered identical at a given taxonomic rank. Their processing procedure is implemented in the {maldipickr} package and illustrated in the following vignette.

In addition, we provide functions to enable the dereplication of different batches of Bruker MALDI Biotyper runs and combine the results, in order to be able to delineate the clusters from a common similarity matrix.

More importantly, we provide a function to select a spectra to be picked in each cluster, a process called cherry-picking, depending on external metadata and potential out-groups to be excluded for the current cherry-picking steps.

Process Bruker MALDI Biotyper spectra

From the imported raw data from the Bruker MALDI Biotyper, the processing of the spectra is based on the original implementation, and run the following tasks:

Square-root transformation
Mass range trimming to 4-10 kDa as they were deemed most determinant by Strejcek et al. (2018)
Signal smoothing using the Savitzky-Golay method and a half window size of 20
Baseline correction with the SNIP procedure
Normalisation by Total Ion Current
Peak detection using the SuperSmoother procedure and with a signal-to-noise ratio above 3
Peak filtering. This step has been added to discard peaks with a negative signal-to-noise ratio probably due to being on the edge of the mass range.

The full procedure is illustrated in the example below. While in this case, all the resulting processed spectra, peaks and final spectra metadata are stored in-memory, the process_spectra() function enables storing these files locally for scalable high-throughput analyses.

# Get an example directory of six Bruker MALDI Biotyper spectra
directory_biotyper_spectra <- system.file(
  "toy-species-spectra",
  package = "maldipickr"
)
# Import the six spectra
spectra_list <- import_biotyper_spectra(directory_biotyper_spectra)
# Transform the spectra signals according to Strejcek et al. (2018)
processed <- process_spectra(spectra_list)
# Overview of the list architecture that is returned
#  with the list of processed spectra, peaks identified and the
#  metadata table
str(processed, max.level = 2)
#> List of 3
#>  $ spectra :List of 6
#>   ..$ species1_G2 :Formal class 'MassSpectrum' [package "MALDIquant"] with 3 slots
#>   ..$ species2_E11:Formal class 'MassSpectrum' [package "MALDIquant"] with 3 slots
#>   ..$ species2_E12:Formal class 'MassSpectrum' [package "MALDIquant"] with 3 slots
#>   ..$ species3_F7 :Formal class 'MassSpectrum' [package "MALDIquant"] with 3 slots
#>   ..$ species3_F8 :Formal class 'MassSpectrum' [package "MALDIquant"] with 3 slots
#>   ..$ species3_F9 :Formal class 'MassSpectrum' [package "MALDIquant"] with 3 slots
#>  $ peaks   :List of 6
#>   ..$ species1_G2 :Formal class 'MassPeaks' [package "MALDIquant"] with 4 slots
#>   ..$ species2_E11:Formal class 'MassPeaks' [package "MALDIquant"] with 4 slots
#>   ..$ species2_E12:Formal class 'MassPeaks' [package "MALDIquant"] with 4 slots
#>   ..$ species3_F7 :Formal class 'MassPeaks' [package "MALDIquant"] with 4 slots
#>   ..$ species3_F8 :Formal class 'MassPeaks' [package "MALDIquant"] with 4 slots
#>   ..$ species3_F9 :Formal class 'MassPeaks' [package "MALDIquant"] with 4 slots
#>  $ metadata: tibble [6 × 3] (S3: tbl_df/tbl/data.frame)
# A detailed view of the metadata with the median signal-to-noise
#  ratio (SNR) and the number of peaks
processed$metadata
#> # A tibble: 6 × 3
#>   name           SNR peaks
#>   <chr>        <dbl> <dbl>
#> 1 species1_G2   5.09    21
#> 2 species2_E11  5.54    22
#> 3 species2_E12  5.63    23
#> 4 species3_F7   4.89    26
#> 5 species3_F8   5.56    25
#> 6 species3_F9   5.40    25

Merge multiple processed spectra

During high-throughput analyses, multiples runs of Bruker MALDI Biotyper are expected resulting in several batches of spectra to be processed and compared. While their processing is natively independent, and could natively be run in parallel, the integration of the batches for their comparison needs an additional step.

The merge_processed_spectra() function aggregates the processed spectra and bins together the detected peaks, with a tolerance of \(0.002\) between the average peak values in the bin (see MALDIquant::binPeaks), which translate to a tolerance of 2000 ppm. This binning step results in a \(n\times p\) feature matrix (or intensity matrix), with \(n\) rows for \(n\) processed spectra (peakless spectra are discarded) and \(p\) columns for the \(p\) peaks masses.

By default, as in the Strejeck et al. (2018) procedure, the intensity values for spectra with missing peaks are interpolated from the processed spectra signal. The current function enables the analyst to decide whether to interpolate the values or leave missing peaks as NA which would then be converted to an null intensity value.

# Get an example directory of six Bruker MALDI Biotyper spectra
directory_biotyper_spectra <- system.file(
  "toy-species-spectra",
  package = "maldipickr"
)
# Import the six spectra
spectra_list <- import_biotyper_spectra(directory_biotyper_spectra)
# Transform the spectra signals according to Strejcek et al. (2018)
processed <- process_spectra(spectra_list)
# Merge the spectra to produce the feature matrix
fm <- merge_processed_spectra(list(processed))
# The feature matrix has 6 spectra as rows and
#  35 peaks as columns
dim(fm)
#> [1]  6 35
# Notice the difference when the interpolation is turned off
fm_no_interpolation <- merge_processed_spectra(
  list(processed),
  interpolate_missing = FALSE
)
sum(fm == 0) # 0
#> [1] 0
sum(fm_no_interpolation == 0) # 68
#> [1] 68

# Multiple runs can be aggregated using list()
# Merge the spectra to produce the feature matrix
fm_all <- merge_processed_spectra(list(processed, processed, processed))
# The feature matrix has 3×6=18 spectra as rows and
#  35 peaks as columns
dim(fm_all)
#> [1] 18 35

Compute a similarity matrix between all processed spectra (not included)

Once all the batches of spectra have been processed together, we can use a distance metric to evaluate how close the spectra are to one another. Strejcek et al. (2018) recommend the cosine metric to compare the spectra and they use the fast implementation in the {coop} package.

While we do not provide specific functions to generate the similarity matrix, we illustrate below how it can be easily computed. Note that the feature matrix from merge_processed_spectra() has spectra as rows and peaks values as columns. So to get a similarity matrix between spectra, the feature matrix must be transposed before cosine computation.


# A. Compute the similarity matrix on the transposed feature matrix

#   using Pearson correlation coefficient

sim_matrix <- stats::cor( t(fm), method = "pearson)


# B.1 Install the coop package

# install.packages("coop")

# B.2 Compute the similarity matrix on the transposed feature matrix

sim_matrix <- coop::cosine( t(fm) )

Delineate clusters from a similarity matrix

When the similarity matrix is computed between all pairs of the studied spectra, the next step is to delineate clusters of spectra to dereplicate the measured bacterial colonies.

The similarity_to_clusters()) is agnostic of the similarity metric used, whether it is the cosine metric or the Pearson product moment, provided that a numeric threshold relevant to the metric used is given, and above which two spectra would be considered similar.

Indeed, the matrix is transformed into a network without loops, where nodes are spectra and links exist between spectra only if the similarity between the spectra is above (or equal to) the threshold. This representation allows to infer the clusters. A table summarises for each spectra, to which cluster number it was assigned to and the size of the cluster, which is the total number of spectra in the cluster.

# Toy similarity matrix between the six example spectra of
#  three species. The cosine metric is used and a value of
#  zero indicates dissimilar spectra and a value of one
#  indicates identical spectra.
cosine_similarity <- matrix(
  c(
    1, 0.79, 0.77, 0.99, 0.98, 0.98,
    0.79, 1, 0.98, 0.79, 0.8, 0.8,
    0.77, 0.98, 1, 0.77, 0.77, 0.77,
    0.99, 0.79, 0.77, 1, 1, 0.99,
    0.98, 0.8, 0.77, 1, 1, 1,
    0.98, 0.8, 0.77, 0.99, 1, 1
  ),
  nrow = 6,
  dimnames = list(
    c(
      "species1_G2", "species2_E11", "species2_E12",
      "species3_F7", "species3_F8", "species3_F9"
    ),
    c(
      "species1_G2", "species2_E11", "species2_E12",
      "species3_F7", "species3_F8", "species3_F9"
    )
  )
)
# Delineate clusters based on a 0.92 threshold applied
#  to the similarity matrix
similarity_to_clusters(cosine_similarity, threshold = 0.92)
#> # A tibble: 6 × 3
#>   name         membership cluster_size
#>   <chr>             <int>        <int>
#> 1 species1_G2           1            4
#> 2 species2_E11          2            2
#> 3 species2_E12          2            2
#> 4 species3_F7           1            4
#> 5 species3_F8           1            4
#> 6 species3_F9           1            4

Set a reference spectrum for each cluster

Once the table of clusters is generated from the similarity matrix, a reference spectrum can be assigned to each cluster.

We choose to define high-quality spectra as representative spectra of the clusters using internal information. That is, representative spectra have, within their cluster, the highest median signal-to-noise ratio and then the highest number of detected peaks.

The function set_reference_spectra() does not change the order of the cluster table but merely adds an additional column is_reference to indicate whether the corresponding spectrum is representative of the cluster.

# Get an example directory of six Bruker MALDI Biotyper spectra
# Import the six spectra and
# Transform the spectra signals according to Strejcek et al. (2018)
processed <- system.file(
  "toy-species-spectra",
  package = "maldipickr"
) %>%
  import_biotyper_spectra() %>%
  suppressMessages() %>%
  process_spectra()

# Toy similarity matrix between the six example spectra of
#  three species. The cosine metric is used and a value of
#  zero indicates dissimilar spectra and a value of one
#  indicates identical spectra.
cosine_similarity <- matrix(
  c(
    1, 0.79, 0.77, 0.99, 0.98, 0.98,
    0.79, 1, 0.98, 0.79, 0.8, 0.8,
    0.77, 0.98, 1, 0.77, 0.77, 0.77,
    0.99, 0.79, 0.77, 1, 1, 0.99,
    0.98, 0.8, 0.77, 1, 1, 1,
    0.98, 0.8, 0.77, 0.99, 1, 1
  ),
  nrow = 6,
  dimnames = list(
    c(
      "species1_G2", "species2_E11", "species2_E12",
      "species3_F7", "species3_F8", "species3_F9"
    ),
    c(
      "species1_G2", "species2_E11", "species2_E12",
      "species3_F7", "species3_F8", "species3_F9"
    )
  )
)
# Delineate clusters based on a 0.92 threshold applied
#  to the similarity matrix
clusters <- similarity_to_clusters(
  cosine_similarity,
  threshold = 0.92
)

# Set reference spectra with the toy example
set_reference_spectra(clusters, processed$metadata)
#> # A tibble: 6 × 6
#>   name         membership cluster_size   SNR peaks is_reference
#>   <chr>             <int>        <int> <dbl> <dbl> <lgl>       
#> 1 species1_G2           1            4  5.09    21 FALSE       
#> 2 species2_E11          2            2  5.54    22 FALSE       
#> 3 species2_E12          2            2  5.63    23 TRUE        
#> 4 species3_F7           1            4  4.89    26 FALSE       
#> 5 species3_F8           1            4  5.56    25 TRUE        
#> 6 species3_F9           1            4  5.40    25 FALSE

Import clusters results generated by SPeDE

Raw spectra can also be processed and clustered by another approach, named SPeDE, developed by Dumolin et al. (2019). The resulting dereplication step produces a comma separated table. The example below illustrates how to import this table into R to be consistent with the dereplication table generated within the {maldipickr} package.

# Reformat the output from SPeDE table
# https://github.com/LM-UGent/SPeDE
import_spede_clusters(
  system.file("spede.csv", package = "maldipickr")
)
#> # A tibble: 6 × 5
#>   name         membership cluster_size quality is_reference
#>   <chr>             <dbl>        <int> <chr>   <lgl>       
#> 1 species1_G2           1            1 GREEN   TRUE        
#> 2 species2_E11          2            2 ORANGE  FALSE       
#> 3 species2_E12          2            2 GREEN   TRUE        
#> 4 species3_F7           3            1 GREEN   TRUE        
#> 5 species3_F8           4            2 ORANGE  FALSE       
#> 6 species3_F9           4            2 GREEN   TRUE

Cherry-pick Bruker MALDI Biotyper spectra

When isolating bacteria from an environment, experimenters want to be thorough but also work-, time- and cost-savvy. One approach is to reduce the redundancy of the bacterial isolates by analyzing their MALDI-TOF spectra from the Bruker Biotyper. All the steps previously described in this vignette consisted of processing the spectra to be able to pick only non-redundant spectra, using the pick_spectra() function.

The function, as illustrated in the examples below, can pick spectra using different types of inputs:

the reference spectra information that is present in the cluster table (after using similarity_to_clusters() or import_spede_clusters() functions; see example 1)
an external metadata table containing a variable (e.g., optical density, fluorescence) to be maximized (default) or minimized per cluster (see example 2)

Spectra, and clusters, can also be excluded from the cherry-picking decision, a procedure termed masking here. We distinguish two types of mask that are implemented in the pick_spectra() function:

soft mask that discards the spectra only, if they correspond for instance to low-quality sample, negative control samples (see example 3)
hard mask that discards the spectra and their clusters (see example 4). This is particularly useful if some spectra have been previously picked. For instance, to exclude colonies grown and picked 24h after streaking when comparing with colonies grown for 72h.

Advanced users can also provide directly a cluster table with a custom sort by cluster to accommodate complex design.

Ultimately, the function delivers a table with as many rows as the cluster table with an additional logical column named to_pick to indicate whether the colony associated with the spectra should be picked (TRUE) or not picked (FALSE).

# 0. Load a toy example of a tibble of clusters created by
#   the `similarity_to_clusters` function.
clusters <- readRDS(
  system.file("clusters_tibble.RDS",
    package = "maldipickr"
  )
)
# 1. By default and if no other metadata are provided,
#   the function picks reference spectra for each clusters.
pick_spectra(clusters)
#> # A tibble: 6 × 7
#>   name         membership cluster_size   SNR peaks is_reference to_pick
#>   <chr>             <int>        <int> <dbl> <dbl> <lgl>        <lgl>  
#> 1 species1_G2           1            4  5.09    21 FALSE        FALSE  
#> 2 species2_E11          2            2  5.54    22 FALSE        FALSE  
#> 3 species2_E12          2            2  5.63    23 TRUE         TRUE   
#> 4 species3_F7           1            4  4.89    26 FALSE        FALSE  
#> 5 species3_F8           1            4  5.56    25 TRUE         TRUE   
#> 6 species3_F9           1            4  5.40    25 FALSE        FALSE

# 2.1 Simulate OD600 values with uniform distribution
#  for each of the colonies we measured with
#  the Bruker MALDI Biotyper
set.seed(104)
metadata <- dplyr::transmute(
  clusters,
  name = name, OD600 = runif(n = nrow(clusters))
)
metadata
#> # A tibble: 6 × 2
#>   name         OD600
#>   <chr>        <dbl>
#> 1 species1_G2  0.364
#> 2 species2_E11 0.772
#> 3 species2_E12 0.735
#> 4 species3_F7  0.973
#> 5 species3_F8  0.740
#> 6 species3_F9  0.201

# 2.2 Pick the spectra based on the highest
#   OD600 value per cluster
pick_spectra(clusters, metadata, "OD600")
#> # A tibble: 6 × 8
#>   name         membership cluster_size   SNR peaks is_reference OD600 to_pick
#>   <chr>             <int>        <int> <dbl> <dbl> <lgl>        <dbl> <lgl>  
#> 1 species1_G2           1            4  5.09    21 FALSE        0.364 FALSE  
#> 2 species2_E11          2            2  5.54    22 FALSE        0.772 TRUE   
#> 3 species2_E12          2            2  5.63    23 TRUE         0.735 FALSE  
#> 4 species3_F7           1            4  4.89    26 FALSE        0.973 TRUE   
#> 5 species3_F8           1            4  5.56    25 TRUE         0.740 FALSE  
#> 6 species3_F9           1            4  5.40    25 FALSE        0.201 FALSE

# 3.1 Say that the wells on the right side of the plate are
#   used for negative controls and should not be picked.
metadata <- metadata %>% dplyr::mutate(
  well = gsub(".*[A-Z]([0-9]{1,2}$)", "\\1", name) %>%
    strtoi(),
  is_edge = is_well_on_edge(
    well_number = well, plate_layout = 96, edges = "right"
  )
)

# 3.2 Pick the spectra after discarding (or soft masking)
#   the spectra indicated by the `is_edge` column.
pick_spectra(clusters, metadata, "OD600",
  soft_mask_column = "is_edge"
)
#> # A tibble: 6 × 10
#>   name      membership cluster_size   SNR peaks is_reference OD600  well is_edge
#>   <chr>          <int>        <int> <dbl> <dbl> <lgl>        <dbl> <int> <lgl>  
#> 1 species1…          1            4  5.09    21 FALSE        0.364     2 FALSE  
#> 2 species2…          2            2  5.54    22 FALSE        0.772    11 FALSE  
#> 3 species2…          2            2  5.63    23 TRUE         0.735    12 TRUE   
#> 4 species3…          1            4  4.89    26 FALSE        0.973     7 FALSE  
#> 5 species3…          1            4  5.56    25 TRUE         0.740     8 FALSE  
#> 6 species3…          1            4  5.40    25 FALSE        0.201     9 FALSE  
#> # ℹ 1 more variable: to_pick <lgl>

# 4.1 Say that some spectra were picked before
#   (e.g., in the column F) in a previous experiment.
# We do not want to pick clusters with those spectra
#   included to limit redundancy.
metadata <- metadata %>% dplyr::mutate(
  picked_before = grepl("_F", name)
)
# 4.2 Pick the spectra from clusters without spectra
#   labelled as `picked_before` (hard masking).
pick_spectra(clusters, metadata, "OD600",
  hard_mask_column = "picked_before"
)
#> # A tibble: 6 × 11
#>   name      membership cluster_size   SNR peaks is_reference OD600  well is_edge
#>   <chr>          <int>        <int> <dbl> <dbl> <lgl>        <dbl> <int> <lgl>  
#> 1 species1…          1            4  5.09    21 FALSE        0.364     2 FALSE  
#> 2 species2…          2            2  5.54    22 FALSE        0.772    11 FALSE  
#> 3 species2…          2            2  5.63    23 TRUE         0.735    12 TRUE   
#> 4 species3…          1            4  4.89    26 FALSE        0.973     7 FALSE  
#> 5 species3…          1            4  5.56    25 TRUE         0.740     8 FALSE  
#> 6 species3…          1            4  5.40    25 FALSE        0.201     9 FALSE  
#> # ℹ 2 more variables: picked_before <lgl>, to_pick <lgl>

References

Dumolin C, Aerts M, Verheyde B, Schellaert S, Vandamme T, Van Der Jeugt F, De Canck E, Cnockaert M, Wieme AD, Cleenwerck I, Peiren J, Dawyndt P, Vandamme P, & Carlier A. (2019). “Introducing SPeDE: High-Throughput Dereplication and Accurate Determination of Microbial Diversity from Matrix-Assisted Laser Desorption–Ionization Time of Flight Mass Spectrometry Data”. MSystems 4(5). doi:10.1128/msystems.00437-19.
Strejcek M, Smrhova T, Junkova P & Uhlik O (2018). “Whole-Cell MALDI-TOF MS versus 16S rRNA Gene Analysis for Identification and Dereplication of Recurrent Bacterial Isolates.” Frontiers in Microbiology 9 doi:10.3389/fmicb.2018.01294.