Overview

This vignette provides an overview of the tiledbsc R package which uses the open source SOMA data model to store single-cell data in a collection of TileDB arrays.

With tiledbsc you can:

  • convert data from Seurat and other popular single-cell formats into SOMA
  • access SOMA data using any of TileDB’s APIs
  • store your datasets on S3 (or other remote object stores) and slice directly without having to download anything first

In this vignette you’ll learn how to:

  • ingest Seurat data into TileDB-backed SOMA Collections
  • access various components of a dataset
  • slice by cell-/feature-identifiers
  • query using cell-/feature-level annotations
  • update a SOMA with analysis results
  • load data from SOMA into R as a Seurat or Bioconductor (i.e., SummarizedExperiment) object
  • store your data on S3 and register with TileDB Cloud

Setup

library(tiledbsc)
library(SeuratObject)
#> Loading required package: sp

For this tutorial we’ll use a dataset from 10X genomics containing 2,700 peripheral blood mononuclear cells (PBMC) from a healthy donor. The dataset is available on 10X Genomics’ website.

Use the provided helper function to download the filtered gene/cell matrix from 10X and create a Seurat object.

pbmc3k <- dataset_seurat_pbmc3k()
pbmc3k
#> An object of class Seurat 
#> 32738 features across 2700 samples within 1 assay 
#> Active assay: RNA (32738 features, 0 variable features)

Populate a SOMACollection

Our first step is to create a new SOMACollection at a specific URI. The URI could be a local filepath, an S3 URI, or as discussed below, a TileDB Cloud URI that automatically registers the dataset on TileDB Cloud.

soco_uri <- file.path(tempdir(), "soco-pbmc3k")

soco <- SOMACollection$new(uri = soco_uri)
#> No SOMACollection currently exists at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k'
#> Creating new SOMACollection at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k'
#> No TileDBGroup currently exists at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/uns'
#> Creating new TileDBGroup at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/uns'
soco
#> <SOMACollection>
#>   uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k 
#>   groups: uns

Now we use tiledbsc to automatically ingest the various components of the pbmc3k object into the new SOMACollection.

soco$from_seurat(pbmc3k)
#> No SOMA currently exists at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA'
#> Creating new SOMA at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA'
#> No AnnotationDataframe found at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obs'
#> No AnnotationDataframe found at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/var'
#> No AssayMatrixGroup currently exists at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X'
#> Creating new AssayMatrixGroup at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X'
#> No AnnotationMatrixGroup currently exists at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obsm'
#> Creating new AnnotationMatrixGroup at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obsm'
#> No AnnotationMatrixGroup currently exists at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/varm'
#> Creating new AnnotationMatrixGroup at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/varm'
#> No AnnotationPairwiseMatrixGroup currently exists at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obsp'
#> Creating new AnnotationPairwiseMatrixGroup at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obsp'
#> No AnnotationPairwiseMatrixGroup currently exists at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/varp'
#> Creating new AnnotationPairwiseMatrixGroup at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/varp'
#> No TileDBGroup currently exists at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/uns'
#> Creating new TileDBGroup at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/uns'
#> Creating new AnnotationDataframe array with index [obs_id] at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obs'
#> Ingesting AnnotationDataframe data into: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obs
#> Creating new AnnotationDataframe array with index [var_id] at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/var'
#> Ingesting AnnotationDataframe data into: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/var
#> Skipping ingestion of 'data' because it is identical to 'counts'
#> No AssayMatrix found at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/counts'
#> Creating new AssayMatrix array with index [var_id,obs_id] at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/counts'
#> Ingesting AssayMatrix data into: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/counts
#> Finished converting Seurat Assay with key [rna_] to SOMA
#> Finished converting Seurat object to SOMACollection

Printing the soco object shows it comprises 2 members: RNA, which contains the RNA assay data, and uns, which is a special group for miscellaneous (i.e., unstructured) data.

soco
#> <SOMACollection>
#>   uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k 
#>   groups: RNA, uns

Access SOMA Components

As this dataset is unimodal, we’ll extract the SOMA containing the RNA data.

soma <- soco$get_member("RNA")
soma
#> <SOMA>
#>   uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA 
#>   arrays: obs, var 
#>   groups: obsm, obsp, uns, varm, varp, X

Access the [AnnotationDataframe] representing the obs array containing cell-level metadata stored in obs.

soma$obs
#> <AnnotationDataframe>
#>   uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obs 
#>   dimensions: obs_id 
#>   attributes: orig.ident, nCount_RNA, nFeature_RNA

Read it into memory as a data.frame.

obs <- soma$obs$to_dataframe()
#> Reading AnnotationDataframe into memory from '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obs'
summary(obs)
#>   orig.ident          nCount_RNA     nFeature_RNA   
#>  Length:2700        Min.   :  548   Min.   : 212.0  
#>  Class :character   1st Qu.: 1758   1st Qu.: 690.0  
#>  Mode  :character   Median : 2197   Median : 817.0  
#>                     Mean   : 2367   Mean   : 847.0  
#>                     3rd Qu.: 2763   3rd Qu.: 953.2  
#>                     Max.   :15844   Max.   :3422.0

Access the [AssayMatrix] representing the X array containing the raw RNA counts.

soma$X$members$counts
#> <AssayMatrix>
#>   uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/counts 
#>   dimensions: var_id, obs_id 
#>   attributes: value

Read it into memory as a [dgTMatrix][Matrix::TsparseMatrix-class].

mat <- soma$X$members$counts$to_matrix()
#> Reading AssayMatrix into memory from '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/counts'
dim(mat)
#> [1] 16634  2700

Dimension Slicing and Attribute Filtering

SOMACollection and SOMA objects can be sliced by a combination of obs and var identifiers.

soma$set_query(obs_ids = head(colnames(pbmc3k), 10))

This slice is automatically applied to the entire SOMA, so any component read into memory will now only contain the filtered results.

soma$obs$to_dataframe()
#> Reading AnnotationDataframe into memory from '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obs'
#>                     orig.ident nCount_RNA nFeature_RNA
#> AAACATACAACCAC-1 SeuratProject       2421          781
#> AAACATTGAGCTAC-1 SeuratProject       4903         1352
#> AAACATTGATCAGC-1 SeuratProject       3149         1131
#> AAACCGTGCTTCCG-1 SeuratProject       2639          960
#> AAACCGTGTATGCG-1 SeuratProject        981          522
#> AAACGCACTGGTAC-1 SeuratProject       2164          782
#> AAACGCTGACCAGT-1 SeuratProject       2176          783
#> AAACGCTGGTTCTT-1 SeuratProject       2260          790
#> AAACGCTGTAGCCA-1 SeuratProject       1276          533
#> AAACGCTGTTTCTG-1 SeuratProject       1103          550

To remove the filter simply reset the query.

soma$reset_query()
summary(soma$obs$to_dataframe())
#> Reading AnnotationDataframe into memory from '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obs'
#>   orig.ident          nCount_RNA     nFeature_RNA   
#>  Length:2700        Min.   :  548   Min.   : 212.0  
#>  Class :character   1st Qu.: 1758   1st Qu.: 690.0  
#>  Mode  :character   Median : 2197   Median : 817.0  
#>                     Mean   : 2367   Mean   : 847.0  
#>                     3rd Qu.: 2763   3rd Qu.: 953.2  
#>                     Max.   :15844   Max.   :3422.0

In addition to dimension slicing we can also apply filters using the cell- and feature-level metadata stored in the obs and var arrays.

soma$set_query(
    obs_attr_filter = seurat_annotations == "B" && nCount_RNA > 1000
)
#> Querying obs with attribute filter
#> Error: No attibute 'seurat_annotations' present.

We can load the SOMA slice into memory as a Seurat Assay

soma$to_seurat_assay()
#> Reading AssayMatrix into memory from '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/counts'
#> Reading AnnotationDataframe into memory from '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/var'
#> Assay data with 16634 features for 2700 cells
#> First 10 features:
#>  ENSG00000000003, ENSG00000000419, ENSG00000000457, ENSG00000000460,
#> ENSG00000000938, ENSG00000000971, ENSG00000001036, ENSG00000001084,
#> ENSG00000001167, ENSG00000001460

or as a Bioconductor [SummarizedExperiment].

if (requireNamespace("SummarizedExperiment", quietly = TRUE)) {
  soma$to_summarized_experiment()
}
#> Reading AssayMatrix into memory from '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/counts'
#> Reading AnnotationDataframe into memory from '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obs'
#> Reading AnnotationDataframe into memory from '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/var'
#> class: SummarizedExperiment 
#> dim: 16634 2700 
#> metadata(0):
#> assays(1): counts
#> rownames(16634): ENSG00000000003 ENSG00000000419 ... ENSG00000273486
#>   ENSG00000273489
#> rowData names(1): gene_name
#> colnames(2700): ATAGATACCATGGT-1 AAACATTGATCAGC-1 ... ACTTGTACCTGTCC-1
#>   GCAACCCTCCTCGT-1
#> colData names(3): orig.ident nCount_RNA nFeature_RNA

See vignette("filtering", package = "tiledbsc") for more details.

Adding Analysis Results to a SOMA

The original pbmc3k object contains only the raw counts data. Here we’ll use Seurat to generate new results and add them to the soma.

Add Layer for Normalized Counts

First we normalize the raw counts, which creates a new matrix stored in the Assay object.

pbmc3k <- Seurat::NormalizeData(pbmc3k, normalization.method = "LogNormalize")

Currently the RNA SOMA’s X group contains only a single layer, counts.

soma$X
#> <AssayMatrixGroup>
#>   uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X 
#>   arrays: counts

Add a new X layer containing log-normalized counts stored in the Seurat Assay’s data slot.

soma$from_seurat_assay(
  object = pbmc3k[["RNA"]],
  layers = "data",
  var = FALSE
)
#> No AssayMatrix found at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/data'
#> Creating new AssayMatrix array with index [var_id,obs_id] at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/data'
#> Ingesting AssayMatrix data into: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/data
#> Finished converting Seurat Assay with key [rna_] to SOMA

Now we can see the SOMA contains 2 X layers: counts and data.

soma$X
#> <AssayMatrixGroup>
#>   uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X 
#>   arrays: counts, data

Add Layer for Scaled counts

First, we identify highly variable subset of genes to scale.

var_genes <- VariableFeatures(
  Seurat::FindVariableFeatures(pbmc3k, nfeatures = 2000)
)

Then generate the scaled data.

pbmc3k <- Seurat::ScaleData(pbmc3k, features = var_genes)

And ingest the scaled data into a new X layer called "scale.data".

soma$from_seurat_assay(
    object = pbmc3k[["RNA"]],
    layers = "scale.data",
    var = FALSE
)

Again, verify the new layer was added.

soma$X
#> <AssayMatrixGroup>
#>   uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X 
#>   arrays: counts, data

Add Dimensional Reduction Results

We can use a similar workflow for dimensional-reduction results, which are stored in either the obsm or varm slots of a SOMA.

dimreduc_pca <- Reductions(
  Seurat::RunPCA(pbmc3k, features = var_genes),
  slot = "pca"
)

Add the Seurat DimReduc object to the SOMA.

soma$add_seurat_dimreduction(
    dimreduc_pca,
    technique = "pca"
)

The SOMA now contains 2 new arrays, one within the obsm group storing the cell embeddings

soma$obsm
#> <AnnotationMatrixGroup>
#>   uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obsm

and a second within the varm group storing the feature loadings.

soma$varm
#> <AnnotationMatrixGroup>
#>   uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/varm

Either of these can be accessed and queried directly, which is often useful for visualization applications.

obsm_pca <- soma$obsm$get_member("dimreduction_pca")
obsm_pca

Here we’ll load the cell embeddings as a matrix and visualize the first 5 PCs.

cell_pcs <- obsm_pca$to_matrix()
pairs(cell_pcs[, 1:5])

Populate a SOMACollection in TileDB Cloud

You can ingest a Seurat object into an S3 SOMACollection that’s automatically registered with TileDB Cloud using a special URI composed of:

  • your TileDB namespace
  • an S3 bucket where the dataset will be created

and running the same initialization/ingestion steps we performed earlier:

soco <- SOMACollection$new("tiledb://<namespace>/s3://<bucket>/pbmc3k-soco")
soco$from_seurat(pbmc3k)

Note this assumes you:

You can then view the new SOMACollection directly on TileDB Cloud where you can inspect the various dataset components, share securely, view activity logs, and more. For example, here’s the TileDB Cloud view of the pbmc3k dataset we just ingested: aaronwolen/pbmc3k-soco.

You can also securely access the dataset directly via R (or Python) using its TileDB Cloud URI, tiledb://aaronwolen/pbmc3k-soco.

Session

Session Info
sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur ... 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] RcppSpdlog_0.0.12  SeuratObject_4.1.3 sp_1.6-0           tiledbsc_0.1.5    
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.10                 XVector_0.38.0             
#>  [3] GenomeInfoDb_1.34.9         compiler_4.2.2             
#>  [5] progressr_0.13.0            zlibbioc_1.44.0            
#>  [7] bitops_1.0-7                MatrixGenerics_1.10.0      
#>  [9] tools_4.2.2                 digest_0.6.31              
#> [11] bit_4.0.5                   evaluate_0.20              
#> [13] lattice_0.20-45             nanotime_0.3.7             
#> [15] rlang_1.0.6                 Matrix_1.5-3               
#> [17] DelayedArray_0.24.0         cli_3.6.0                  
#> [19] RcppCCTZ_0.2.12             spdl_0.0.4                 
#> [21] parallel_4.2.2              xfun_0.37                  
#> [23] GenomeInfoDbData_1.2.9      knitr_1.42                 
#> [25] IRanges_2.32.0              S4Vectors_0.36.2           
#> [27] fs_1.6.1                    vctrs_0.5.2                
#> [29] globals_0.16.2              stats4_4.2.2               
#> [31] triebeard_0.3.0             bit64_4.0.5                
#> [33] grid_4.2.2                  Biobase_2.58.0             
#> [35] glue_1.6.2                  data.table_1.14.8          
#> [37] listenv_0.9.0               R6_2.5.1                   
#> [39] future.apply_1.10.0         parallelly_1.34.0          
#> [41] tiledb_0.18.0.1             GenomicRanges_1.50.2       
#> [43] matrixStats_0.63.0          urltools_1.7.3             
#> [45] codetools_0.2-18            BiocGenerics_0.44.0        
#> [47] SummarizedExperiment_1.28.0 future_1.32.0              
#> [49] RCurl_1.98-1.10             zoo_1.8-11