This vignette provides an overview of the tiledbsc R package which uses the open source SOMA data model to store single-cell data in a collection of TileDB arrays.
With tiledbsc you can:
In this vignette you’ll learn how to:
SummarizedExperiment
) object
library(tiledbsc)
library(SeuratObject)
#> Loading required package: sp
For this tutorial we’ll use a dataset from 10X genomics containing 2,700 peripheral blood mononuclear cells (PBMC) from a healthy donor. The dataset is available on 10X Genomics’ website.
Use the provided helper function to download the filtered gene/cell
matrix from 10X and create a Seurat
object.
pbmc3k <- dataset_seurat_pbmc3k()
pbmc3k
#> An object of class Seurat
#> 32738 features across 2700 samples within 1 assay
#> Active assay: RNA (32738 features, 0 variable features)
Our first step is to create a new SOMACollection
at a
specific URI. The URI could be a local filepath, an S3 URI, or as
discussed below, a TileDB Cloud URI that automatically registers the
dataset on TileDB Cloud.
soco_uri <- file.path(tempdir(), "soco-pbmc3k")
soco <- SOMACollection$new(uri = soco_uri)
#> No SOMACollection currently exists at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k'
#> Creating new SOMACollection at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k'
#> No TileDBGroup currently exists at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/uns'
#> Creating new TileDBGroup at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/uns'
soco
#> <SOMACollection>
#> uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k
#> groups: uns
Now we use tiledbsc to automatically ingest the various components of
the pbmc3k
object into the new
SOMACollection
.
soco$from_seurat(pbmc3k)
#> No SOMA currently exists at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA'
#> Creating new SOMA at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA'
#> No AnnotationDataframe found at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obs'
#> No AnnotationDataframe found at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/var'
#> No AssayMatrixGroup currently exists at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X'
#> Creating new AssayMatrixGroup at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X'
#> No AnnotationMatrixGroup currently exists at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obsm'
#> Creating new AnnotationMatrixGroup at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obsm'
#> No AnnotationMatrixGroup currently exists at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/varm'
#> Creating new AnnotationMatrixGroup at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/varm'
#> No AnnotationPairwiseMatrixGroup currently exists at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obsp'
#> Creating new AnnotationPairwiseMatrixGroup at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obsp'
#> No AnnotationPairwiseMatrixGroup currently exists at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/varp'
#> Creating new AnnotationPairwiseMatrixGroup at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/varp'
#> No TileDBGroup currently exists at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/uns'
#> Creating new TileDBGroup at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/uns'
#> Creating new AnnotationDataframe array with index [obs_id] at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obs'
#> Ingesting AnnotationDataframe data into: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obs
#> Creating new AnnotationDataframe array with index [var_id] at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/var'
#> Ingesting AnnotationDataframe data into: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/var
#> Skipping ingestion of 'data' because it is identical to 'counts'
#> No AssayMatrix found at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/counts'
#> Creating new AssayMatrix array with index [var_id,obs_id] at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/counts'
#> Ingesting AssayMatrix data into: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/counts
#> Finished converting Seurat Assay with key [rna_] to SOMA
#> Finished converting Seurat object to SOMACollection
Printing the soco
object shows it comprises 2 members:
RNA
, which contains the RNA assay data, and
uns
, which is a special group for miscellaneous (i.e.,
unstructured) data.
soco
#> <SOMACollection>
#> uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k
#> groups: RNA, uns
As this dataset is unimodal, we’ll extract the SOMA containing the RNA data.
soma <- soco$get_member("RNA")
soma
#> <SOMA>
#> uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA
#> arrays: obs, var
#> groups: obsm, obsp, uns, varm, varp, X
Access the [AnnotationDataframe
] representing the
obs
array containing cell-level metadata stored in
obs
.
soma$obs
#> <AnnotationDataframe>
#> uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obs
#> dimensions: obs_id
#> attributes: orig.ident, nCount_RNA, nFeature_RNA
Read it into memory as a data.frame
.
obs <- soma$obs$to_dataframe()
#> Reading AnnotationDataframe into memory from '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obs'
summary(obs)
#> orig.ident nCount_RNA nFeature_RNA
#> Length:2700 Min. : 548 Min. : 212.0
#> Class :character 1st Qu.: 1758 1st Qu.: 690.0
#> Mode :character Median : 2197 Median : 817.0
#> Mean : 2367 Mean : 847.0
#> 3rd Qu.: 2763 3rd Qu.: 953.2
#> Max. :15844 Max. :3422.0
Access the [AssayMatrix
] representing the X
array containing the raw RNA counts.
soma$X$members$counts
#> <AssayMatrix>
#> uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/counts
#> dimensions: var_id, obs_id
#> attributes: value
Read it into memory as a
[dgTMatrix
][Matrix::TsparseMatrix-class
].
mat <- soma$X$members$counts$to_matrix()
#> Reading AssayMatrix into memory from '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/counts'
dim(mat)
#> [1] 16634 2700
SOMACollection
and SOMA
objects can be
sliced by a combination of obs and var identifiers.
This slice is automatically applied to the entire SOMA, so any component read into memory will now only contain the filtered results.
soma$obs$to_dataframe()
#> Reading AnnotationDataframe into memory from '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obs'
#> orig.ident nCount_RNA nFeature_RNA
#> AAACATACAACCAC-1 SeuratProject 2421 781
#> AAACATTGAGCTAC-1 SeuratProject 4903 1352
#> AAACATTGATCAGC-1 SeuratProject 3149 1131
#> AAACCGTGCTTCCG-1 SeuratProject 2639 960
#> AAACCGTGTATGCG-1 SeuratProject 981 522
#> AAACGCACTGGTAC-1 SeuratProject 2164 782
#> AAACGCTGACCAGT-1 SeuratProject 2176 783
#> AAACGCTGGTTCTT-1 SeuratProject 2260 790
#> AAACGCTGTAGCCA-1 SeuratProject 1276 533
#> AAACGCTGTTTCTG-1 SeuratProject 1103 550
To remove the filter simply reset the query.
soma$reset_query()
summary(soma$obs$to_dataframe())
#> Reading AnnotationDataframe into memory from '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obs'
#> orig.ident nCount_RNA nFeature_RNA
#> Length:2700 Min. : 548 Min. : 212.0
#> Class :character 1st Qu.: 1758 1st Qu.: 690.0
#> Mode :character Median : 2197 Median : 817.0
#> Mean : 2367 Mean : 847.0
#> 3rd Qu.: 2763 3rd Qu.: 953.2
#> Max. :15844 Max. :3422.0
In addition to dimension slicing we can also apply filters using the
cell- and feature-level metadata stored in the obs
and
var
arrays.
soma$set_query(
obs_attr_filter = seurat_annotations == "B" && nCount_RNA > 1000
)
#> Querying obs with attribute filter
#> Error: No attibute 'seurat_annotations' present.
We can load the SOMA slice into memory as a Seurat
Assay
…
soma$to_seurat_assay()
#> Reading AssayMatrix into memory from '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/counts'
#> Reading AnnotationDataframe into memory from '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/var'
#> Assay data with 16634 features for 2700 cells
#> First 10 features:
#> ENSG00000000003, ENSG00000000419, ENSG00000000457, ENSG00000000460,
#> ENSG00000000938, ENSG00000000971, ENSG00000001036, ENSG00000001084,
#> ENSG00000001167, ENSG00000001460
or as a Bioconductor [SummarizedExperiment
].
if (requireNamespace("SummarizedExperiment", quietly = TRUE)) {
soma$to_summarized_experiment()
}
#> Reading AssayMatrix into memory from '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/counts'
#> Reading AnnotationDataframe into memory from '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obs'
#> Reading AnnotationDataframe into memory from '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/var'
#> class: SummarizedExperiment
#> dim: 16634 2700
#> metadata(0):
#> assays(1): counts
#> rownames(16634): ENSG00000000003 ENSG00000000419 ... ENSG00000273486
#> ENSG00000273489
#> rowData names(1): gene_name
#> colnames(2700): ATAGATACCATGGT-1 AAACATTGATCAGC-1 ... ACTTGTACCTGTCC-1
#> GCAACCCTCCTCGT-1
#> colData names(3): orig.ident nCount_RNA nFeature_RNA
See vignette("filtering", package = "tiledbsc")
for more
details.
The original pbmc3k
object contains only the raw counts
data. Here we’ll use Seurat
to generate new results and add
them to the soma
.
First we normalize the raw counts, which creates a new matrix stored
in the Assay
object.
pbmc3k <- Seurat::NormalizeData(pbmc3k, normalization.method = "LogNormalize")
Currently the RNA SOMA’s X
group contains only a single
layer, counts
.
soma$X
#> <AssayMatrixGroup>
#> uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X
#> arrays: counts
Add a new X
layer containing log-normalized counts
stored in the Seurat Assay
’s data
slot.
soma$from_seurat_assay(
object = pbmc3k[["RNA"]],
layers = "data",
var = FALSE
)
#> No AssayMatrix found at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/data'
#> Creating new AssayMatrix array with index [var_id,obs_id] at '/var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/data'
#> Ingesting AssayMatrix data into: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X/data
#> Finished converting Seurat Assay with key [rna_] to SOMA
Now we can see the SOMA contains 2 X
layers:
counts
and data
.
soma$X
#> <AssayMatrixGroup>
#> uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X
#> arrays: counts, data
First, we identify highly variable subset of genes to scale.
var_genes <- VariableFeatures(
Seurat::FindVariableFeatures(pbmc3k, nfeatures = 2000)
)
Then generate the scaled data.
pbmc3k <- Seurat::ScaleData(pbmc3k, features = var_genes)
And ingest the scaled data into a new X
layer called
"scale.data"
.
soma$from_seurat_assay(
object = pbmc3k[["RNA"]],
layers = "scale.data",
var = FALSE
)
Again, verify the new layer was added.
soma$X
#> <AssayMatrixGroup>
#> uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/X
#> arrays: counts, data
We can use a similar workflow for dimensional-reduction results,
which are stored in either the obsm
or varm
slots of a SOMA.
dimreduc_pca <- Reductions(
Seurat::RunPCA(pbmc3k, features = var_genes),
slot = "pca"
)
Add the Seurat DimReduc
object to the SOMA.
soma$add_seurat_dimreduction(
dimreduc_pca,
technique = "pca"
)
The SOMA now contains 2 new arrays, one within the obsm
group storing the cell embeddings
soma$obsm
#> <AnnotationMatrixGroup>
#> uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/obsm
and a second within the varm
group storing the feature
loadings.
soma$varm
#> <AnnotationMatrixGroup>
#> uri: /var/folders/mb/hbl3p_1d4kv4_plbcqb6wd8w0000gn/T//RtmpOA4uOo/soco-pbmc3k/soma_RNA/varm
Either of these can be accessed and queried directly, which is often useful for visualization applications.
obsm_pca <- soma$obsm$get_member("dimreduction_pca")
obsm_pca
Here we’ll load the cell embeddings as a matrix
and
visualize the first 5 PCs.
cell_pcs <- obsm_pca$to_matrix()
pairs(cell_pcs[, 1:5])
You can ingest a Seurat object into an S3 SOMACollection
that’s automatically registered with TileDB Cloud using a special URI
composed of:
and running the same initialization/ingestion steps we performed earlier:
soco <- SOMACollection$new("tiledb://<namespace>/s3://<bucket>/pbmc3k-soco")
soco$from_seurat(pbmc3k)
Note this assumes you:
You can then view the new SOMACollection directly on TileDB Cloud
where you can inspect the various dataset components, share securely,
view activity logs, and more. For example, here’s the TileDB Cloud view
of the pbmc3k
dataset we just ingested: aaronwolen/pbmc3k-soco
.
You can also securely access the dataset directly via R (or Python)
using its TileDB Cloud URI,
tiledb://aaronwolen/pbmc3k-soco
.
sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur ... 10.16
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] RcppSpdlog_0.0.12 SeuratObject_4.1.3 sp_1.6-0 tiledbsc_0.1.5
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.10 XVector_0.38.0
#> [3] GenomeInfoDb_1.34.9 compiler_4.2.2
#> [5] progressr_0.13.0 zlibbioc_1.44.0
#> [7] bitops_1.0-7 MatrixGenerics_1.10.0
#> [9] tools_4.2.2 digest_0.6.31
#> [11] bit_4.0.5 evaluate_0.20
#> [13] lattice_0.20-45 nanotime_0.3.7
#> [15] rlang_1.0.6 Matrix_1.5-3
#> [17] DelayedArray_0.24.0 cli_3.6.0
#> [19] RcppCCTZ_0.2.12 spdl_0.0.4
#> [21] parallel_4.2.2 xfun_0.37
#> [23] GenomeInfoDbData_1.2.9 knitr_1.42
#> [25] IRanges_2.32.0 S4Vectors_0.36.2
#> [27] fs_1.6.1 vctrs_0.5.2
#> [29] globals_0.16.2 stats4_4.2.2
#> [31] triebeard_0.3.0 bit64_4.0.5
#> [33] grid_4.2.2 Biobase_2.58.0
#> [35] glue_1.6.2 data.table_1.14.8
#> [37] listenv_0.9.0 R6_2.5.1
#> [39] future.apply_1.10.0 parallelly_1.34.0
#> [41] tiledb_0.18.0.1 GenomicRanges_1.50.2
#> [43] matrixStats_0.63.0 urltools_1.7.3
#> [45] codetools_0.2-18 BiocGenerics_0.44.0
#> [47] SummarizedExperiment_1.28.0 future_1.32.0
#> [49] RCurl_1.98-1.10 zoo_1.8-11