Filtering • tiledbsc

Setup

library(tiledbsc)
#> The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
#> which was just loaded, will retire in October 2023.
#> Please refer to R-spatial evolution reports for details, especially
#> https://r-spatial.org/r/2023/05/15/evolution4.html.
#> It may be desirable to make the sf package available;
#> package maintainers should consider adding sf to Suggests:.
#> The sp package is now running under evolution status 2
#>      (status 2 uses the sf package in place of rgdal)
library(SeuratObject)
#> Loading required package: sp

Overview

This vignette demonstrates how to leverage tiledbsc’s features for subsetting and filtering data stored as a SOMA.

As described in ?SOMA, there are two primary approaches for subsetting: dimension slicing and attribute filtering, both of which are described in this vignette and are performed using the set_query() method, provided by all tiledbsc objects.

Dimension slicing

Low-level classes

tiledbsc objects within a SOMA are indexed by a set of common cell- or feature-identifiers, which are efficiently sliced by providing a list of dimension identifiers.

As an example, let’s ingest cell-level metadata from the pbmc_small dataset into an AnnotationDataframe indexed by the cell identifiers.

data("pbmc_small", package = "SeuratObject")

adf <- AnnotationDataframe$new(uri = file.path(tempdir(), "pbmc_small_obs"))
#> No AnnotationDataframe found at '/tmp/RtmpwLzq6u/pbmc_small_obs'
adf$from_dataframe(pbmc_small[[]], index_col = "obs_id")
#> Creating new AnnotationDataframe array with index [obs_id] at '/tmp/RtmpwLzq6u/pbmc_small_obs'
#> Adding 3 metadata keys to array
#> Checking legacy validity mode for array: '/tmp/RtmpwLzq6u/pbmc_small_obs'
#> Ingesting AnnotationDataframe data into: /tmp/RtmpwLzq6u/pbmc_small_obs
adf
#> <AnnotationDataframe>
#>   uri: /tmp/RtmpwLzq6u/pbmc_small_obs 
#>   dimensions: obs_id 
#>   attributes: orig.ident, nCount_RNA, nFeature_RNA, RNA_snn_res.0.8, letter.idents, groups,...

As you’d expect, reading the data back into memory recreates the same data frame with all 80 cells.

dim(adf$to_dataframe())
#> Reading AnnotationDataframe into memory from '/tmp/RtmpwLzq6u/pbmc_small_obs'
#> Checking legacy validity mode for array: '/tmp/RtmpwLzq6u/pbmc_small_obs'
#> [1] 80  7

We can use the set_query() method to read in a specific subset of cells by providing a named list of ranges for each dimension.

adf$set_query(
  dims = list(obs_id = c("AAATTCGAATCACG", "AAGCAAGAGCTTAG", "AAGCGACTTTGACG"))
)
#> NULL

Now, the next time we call to_dataframe() the resulting data frame will only contain the cells we specified.

adf$to_dataframe()
#> Reading AnnotationDataframe into memory from '/tmp/RtmpwLzq6u/pbmc_small_obs'
#> Checking legacy validity mode for array: '/tmp/RtmpwLzq6u/pbmc_small_obs'
#>                   orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8
#> AAATTCGAATCACG SeuratProject        327           62               1
#> AAGCAAGAGCTTAG SeuratProject        126           48               0
#> AAGCGACTTTGACG SeuratProject        443           77               1
#>                letter.idents groups RNA_snn_res.1
#> AAATTCGAATCACG             B     g2             1
#> AAGCAAGAGCTTAG             A     g1             0
#> AAGCGACTTTGACG             B     g1             1

To clear this query and again read in all cells, we can use the reset_query() method.

adf$reset_query()

This functionality is also available for AnnotationGroup-based classes, which ostensibly contain a group of arrays that all share the same dimensions. In this case, the dims argument is automatically applied to all child arrays.

High-level classes

The SOMA class also provides a set_query() method for the same purpose. However, unlike the lower-level classes which are meant to be general-purpose, the SOMA class has well-defined requirements for the dimensions of its member arrays. As such, SOMA$set_query() provides arguments, obs_ids and var_ids, rather than the more general dims argument.

To see this in action, let’s create a SOMA from pbmc_small’s RNA assay.

soma <- SOMA$new(uri = file.path(tempdir(), "pbmc_small_rna"))
#> No SOMA currently exists at '/tmp/RtmpwLzq6u/pbmc_small_rna'
#> Creating new SOMA at '/tmp/RtmpwLzq6u/pbmc_small_rna'
#> No AnnotationDataframe found at '/tmp/RtmpwLzq6u/pbmc_small_rna/obs'
#> No AnnotationDataframe found at '/tmp/RtmpwLzq6u/pbmc_small_rna/var'
#> No AssayMatrixGroup currently exists at '/tmp/RtmpwLzq6u/pbmc_small_rna/X'
#> Creating new AssayMatrixGroup at '/tmp/RtmpwLzq6u/pbmc_small_rna/X'
#> No AnnotationMatrixGroup currently exists at '/tmp/RtmpwLzq6u/pbmc_small_rna/obsm'
#> Creating new AnnotationMatrixGroup at '/tmp/RtmpwLzq6u/pbmc_small_rna/obsm'
#> No AnnotationMatrixGroup currently exists at '/tmp/RtmpwLzq6u/pbmc_small_rna/varm'
#> Creating new AnnotationMatrixGroup at '/tmp/RtmpwLzq6u/pbmc_small_rna/varm'
#> No AnnotationPairwiseMatrixGroup currently exists at '/tmp/RtmpwLzq6u/pbmc_small_rna/obsp'
#> Creating new AnnotationPairwiseMatrixGroup at '/tmp/RtmpwLzq6u/pbmc_small_rna/obsp'
#> No AnnotationPairwiseMatrixGroup currently exists at '/tmp/RtmpwLzq6u/pbmc_small_rna/varp'
#> Creating new AnnotationPairwiseMatrixGroup at '/tmp/RtmpwLzq6u/pbmc_small_rna/varp'
#> No TileDBGroup currently exists at '/tmp/RtmpwLzq6u/pbmc_small_rna/uns'
#> Creating new TileDBGroup at '/tmp/RtmpwLzq6u/pbmc_small_rna/uns'
soma$from_seurat_assay(pbmc_small[["RNA"]])
#> Creating new AnnotationDataframe array with index [obs_id] at '/tmp/RtmpwLzq6u/pbmc_small_rna/obs'
#> Adding 3 metadata keys to array
#> Checking legacy validity mode for array: '/tmp/RtmpwLzq6u/pbmc_small_rna/obs'
#> Ingesting AnnotationDataframe data into: /tmp/RtmpwLzq6u/pbmc_small_rna/obs
#> Creating new AnnotationDataframe array with index [var_id] at '/tmp/RtmpwLzq6u/pbmc_small_rna/var'
#> Adding 3 metadata keys to array
#> Checking legacy validity mode for array: '/tmp/RtmpwLzq6u/pbmc_small_rna/var'
#> Ingesting AnnotationDataframe data into: /tmp/RtmpwLzq6u/pbmc_small_rna/var
#> No AssayMatrix found at '/tmp/RtmpwLzq6u/pbmc_small_rna/X/counts'
#> Creating new AssayMatrix array with index [var_id,obs_id] at '/tmp/RtmpwLzq6u/pbmc_small_rna/X/counts'
#> Adding 3 metadata keys to array
#> Ingesting AssayMatrix data into: /tmp/RtmpwLzq6u/pbmc_small_rna/X/counts
#> No AssayMatrix found at '/tmp/RtmpwLzq6u/pbmc_small_rna/X/data'
#> Creating new AssayMatrix array with index [var_id,obs_id] at '/tmp/RtmpwLzq6u/pbmc_small_rna/X/data'
#> Adding 3 metadata keys to array
#> Ingesting AssayMatrix data into: /tmp/RtmpwLzq6u/pbmc_small_rna/X/data
#> No AssayMatrix found at '/tmp/RtmpwLzq6u/pbmc_small_rna/X/scale.data'
#> Creating new AssayMatrix array with index [var_id,obs_id] at '/tmp/RtmpwLzq6u/pbmc_small_rna/X/scale.data'
#> Adding 3 metadata keys to array
#> Ingesting AssayMatrix data into: /tmp/RtmpwLzq6u/pbmc_small_rna/X/scale.data
#> Finished converting Seurat Assay with key [rna_] to SOMA

Now let’s use set_query() to slice three features of interest.

soma$set_query(var_ids = c("PPBP", "PF4", "GNLY"))

Any of the individual SOMA components will respect the new query. So if we read in the feature-level metadata from var, the resulting data frame contains only the features we specified.

soma$var$to_dataframe()
#> Reading AnnotationDataframe into memory from '/tmp/RtmpwLzq6u/pbmc_small_rna/var'
#> Checking legacy validity mode for array: '/tmp/RtmpwLzq6u/pbmc_small_rna/var'
#>      vst.mean vst.variance vst.variance.expected vst.variance.standardized
#> GNLY    2.400     43.76203              24.07857                  1.817468
#> PF4     2.150     65.24304              21.48275                  1.939028
#> PPBP    5.325    231.81709              93.14856                  2.488681
#>      vst.variable highly_variable
#> GNLY         TRUE            TRUE
#> PF4          TRUE            TRUE
#> PPBP         TRUE            TRUE

And when we recreate the entire Seurat Assay from the SOMA, the slice is carried through to all relevant components,

pbmc_small_rna <- soma$to_seurat_assay()
#> Reading AssayMatrix into memory from '/tmp/RtmpwLzq6u/pbmc_small_rna/X/counts'
#> Reading AssayMatrix into memory from '/tmp/RtmpwLzq6u/pbmc_small_rna/X/data'
#> Reading AssayMatrix into memory from '/tmp/RtmpwLzq6u/pbmc_small_rna/X/scale.data'
#> Reading AnnotationDataframe into memory from '/tmp/RtmpwLzq6u/pbmc_small_rna/var'
#> Checking legacy validity mode for array: '/tmp/RtmpwLzq6u/pbmc_small_rna/var'

including the raw counts,

GetAssayData(pbmc_small_rna, slot = "counts")
#> 3 x 80 sparse Matrix of class "dgCMatrix"
#>   [[ suppressing 80 column names 'AAGCAAGAGCTTAG', 'ACAGGTACTGGTGT', 'AGATATACCCGTAA' ... ]]
#>                                                                               
#> GNLY 4 15 3 1 1 1 18 1 35 15 3 11 1 18 3 22 29 10 1  .  .  .  .  .  .  .  .  .
#> PF4  .  . . . 1 .  . .  .  . .  . .  . .  .  .  . . 14 14 14  9 62  6 11 18 23
#> PPBP .  . . . . .  . .  .  . .  . .  . .  .  .  . . 34 36 43 66 54 30 41 55 58
#>                                                                               
#> GNLY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
#> PF4  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
#> PPBP 1 1 6 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
#>                                   
#> GNLY . . . . . . . . . . . . . . .
#> PF4  . . . . . . . . . . . . . . .
#> PPBP . . . . . . . . . . . . . . .

and normalized/scaled data:

GetAssayData(pbmc_small_rna, slot = "data")
#> 3 x 80 sparse Matrix of class "dgCMatrix"
#>   [[ suppressing 80 column names 'AAGCAAGAGCTTAG', 'ACAGGTACTGGTGT', 'AGATATACCCGTAA' ... ]]
#>                                                                             
#> GNLY 5.763498 6.902117 5.084058 3.040721 3.582547 3.160117 6.346724 4.676402
#> PF4  .        .        .        .        3.582547 .        .        .       
#> PPBP .        .        .        .        .        .        .        .       
#>                                                                          
#> GNLY 7.26106 6.677903 6.596746 6.509859 2.657641 7.26513 4.99761 6.904224
#> PF4  .       .        .        .        .        .       .       .       
#> PPBP .       .        .        .        .        .       .       .       
#>                                                                             
#> GNLY 7.528318 6.464628 4.378773 .        .        .        .        .       
#> PF4  .        .        .        6.794273 6.726633 6.494325 6.101539 7.071124
#> PPBP .        .        .        7.680917 7.670362 7.615447 8.092033 6.933099
#>                                                                               
#> GNLY .        .        .        .        .        .        .        .        .
#> PF4  5.993961 6.462304 6.754771 6.917480 .        .        .        .        .
#> PPBP 7.601402 7.776837 7.870948 7.841831 3.976987 3.117873 5.554937 4.753095 .
#>                                                                               
#> GNLY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
#> PF4  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
#> PPBP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
#>                         
#> GNLY . . . . . . . . . .
#> PF4  . . . . . . . . . .
#> PPBP . . . . . . . . . .

Again, to retrieve all of the original data we need to reset the query:

soma$reset_query()

Attribute Filtering

In addition to dimension slicing, set_query() allows you to filter query results using attribute filters. As a reminder, in TileDB parlance attributes are the non-dimensions (i.e., unindexed) components of a dataset.

This is particularly useful for filtering SOMAs based on cell- and feature-level metadata stored in the obs and var arrays. For example, let’s filter out genes with a vst.mean <= 5.

Unlink dimension slicing, attribute filters are applied immediately to the obs or var arrays, and the identifiers that passed the filter are used to slice the data at read time.

soma$set_query(var_attr_filter = vst.mean > 5)
#> Querying var with attribute filter
#> ...retrieved 12 var IDs

With verbose=TRUE, running the above command will print a message indicating how many identifiers were selected. Now, when we read the data back into memory the resulting Assay will only include the 12 selected features.

soma$to_seurat_assay()
#> Reading AssayMatrix into memory from '/tmp/RtmpwLzq6u/pbmc_small_rna/X/counts'
#> Reading AssayMatrix into memory from '/tmp/RtmpwLzq6u/pbmc_small_rna/X/data'
#> Reading AssayMatrix into memory from '/tmp/RtmpwLzq6u/pbmc_small_rna/X/scale.data'
#> Reading AnnotationDataframe into memory from '/tmp/RtmpwLzq6u/pbmc_small_rna/var'
#> Checking legacy validity mode for array: '/tmp/RtmpwLzq6u/pbmc_small_rna/var'
#> Assay data with 12 features for 80 cells
#> Top 3 variable features:
#>  HLA-DPB1, PPBP, S100A9

These two filtering methods can also be combined. For example, let’s apply the same attribute filter as above but only on the subset of features previously identified as highly-variable.

vfs <- pbmc_small_rna@var.features
soma$set_query(var_ids = vfs, var_attr_filter = vst.mean > 5)
#> Querying var with attribute filter
#> ...retrieved 1 var IDs

This approach can be more performant because the filter is selectively applied to the subset of indexed identifiers.

Session

Session Info

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] RcppSpdlog_0.0.14   SeuratObject_4.1.3  sp_2.0-0           
#> [4] tiledbsc_0.1.5.9002
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.7          future_1.33.0       spdl_0.0.5         
#>  [4] stringi_1.7.12      lattice_0.21-8      listenv_0.9.0      
#>  [7] digest_0.6.33       magrittr_2.0.3      tiledb_0.20.3      
#> [10] evaluate_0.21       grid_4.3.1          fastmap_1.1.1      
#> [13] rprojroot_2.0.3     jsonlite_1.8.7      Matrix_1.5-4.1     
#> [16] urltools_1.7.3      purrr_1.0.1         codetools_0.2-19   
#> [19] textshaping_0.3.6   jquerylib_0.1.4     cli_3.6.1          
#> [22] rlang_1.1.1         triebeard_0.4.1     parallelly_1.36.0  
#> [25] future.apply_1.11.0 bit64_4.0.5         cachem_1.0.8       
#> [28] yaml_2.3.7          tools_4.3.1         parallel_4.3.1     
#> [31] nanotime_0.3.7      memoise_2.0.1       globals_0.16.2     
#> [34] vctrs_0.6.3         R6_2.5.1            zoo_1.8-12         
#> [37] lifecycle_1.0.3     stringr_1.5.0       fs_1.6.3           
#> [40] bit_4.0.5           ragg_1.2.5          desc_1.4.2         
#> [43] pkgdown_2.0.7       progressr_0.13.0    bslib_0.5.0        
#> [46] glue_1.6.2          Rcpp_1.0.11         systemfonts_1.0.4  
#> [49] xfun_0.40           knitr_1.43          htmltools_0.5.5    
#> [52] rmarkdown_2.23      compiler_4.3.1      RcppCCTZ_0.2.12