Setup

library(tiledbsc)
#> The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
#> which was just loaded, will retire in October 2023.
#> Please refer to R-spatial evolution reports for details, especially
#> https://r-spatial.org/r/2023/05/15/evolution4.html.
#> It may be desirable to make the sf package available;
#> package maintainers should consider adding sf to Suggests:.
#> The sp package is now running under evolution status 2
#>      (status 2 uses the sf package in place of rgdal)
library(SeuratObject)
#> Loading required package: sp

Overview

While tiledbsc facilitates the storage and retrieval of complete Seurat/Bioconductor datasets, it’s also possible to selectively update specific components of an existing SOMA.

Create the initial dataset

As in the Introduction we’ll use the pbmc_small dataset from the SeuratObject package.

data("pbmc_small", package = "SeuratObject")
pbmc_small
#> An object of class Seurat 
#> 230 features across 80 samples within 1 assay 
#> Active assay: RNA (230 features, 20 variable features)
#>  2 dimensional reductions calculated: pca, tsne

We’ll start by creating a Seurat Assay object that contains only raw RNA counts.

pbmc_small_rna <- CreateAssayObject(
  counts = GetAssayData(pbmc_small[["RNA"]], "counts")
)

Key(pbmc_small_rna) <- Key(pbmc_small[["RNA"]])

and then ingest this into a new SOMA on disk:

This performed the following operations:

  • created groups X, obsm, varm, obsp, varp, and misc
  • created array, obs, with a single dimension containing all cell names and 0 attributes (because Seurat Assay objects do not contain cell-level metadata)
  • created array, var, with a single dimension containing all feature names and 0 attributes (because the feature-level metadata was empty)
  • created array counts within the X group and ingested all data from pbmc_small_rna’s counts slot

Printing out the X group, we can see it contains only a single member, counts.

soma$X
#> <AssayMatrixGroup>
#>   uri: /tmp/RtmpO41gFl/pbmc_small_rna/X 
#>   arrays: counts

Note: A Seurat Assay instantiated with data passed to the counts argument will store a reference to the same matrix in both the counts and data slots (i.e., GetAssayData(pbmc_small_rna, "counts") and GetAssayData(pbmc_small_rna, "data")). To avoid redundantly storing data on disk, tiledbsc* only ingests data from data when it is not identical to counts.*

Add new layers for normalized and scaled data

Now let’s populate pbmc_small_rna’s counts slot with normalized counts.

pbmc_small_rna <- SetAssayData(
  pbmc_small_rna,
  slot = "data",
  new.data = GetAssayData(pbmc_small[["RNA"]], "data")
)

pbmc_small_rna
#> Assay data with 230 features for 80 cells
#> First 10 features:
#>  MS4A1, CD79B, CD79A, HLA-DRA, TCL1A, HLA-DQB1, HVCN1, HLA-DMB, LTB,
#> LINC00926

At this point, counts and data contain different transformations of the same matrix but only the counts data has been written to disk.

We could simply rerun soma$from_seurat_assay(pbmc_small_rna) to update the SOMA and ingest the normalized counts matrix into X/data, however, this would perform another full conversion of the entire Assay object. As a result, data from the counts slot would be re-ingested into X/counts and feature-level metadata would be re-ingested into var.

To avoid these redudndant ingestions, use the layers argument to specify which of the assay slots (counts, data, or scale.data) should be ingested. We also set var = FALSE to prevent re-ingestion of the feature-level metadata since it hasn’t changed.

soma$from_seurat_assay(pbmc_small_rna, layers = "data", var = FALSE)
#> No AssayMatrix found at '/tmp/RtmpO41gFl/pbmc_small_rna/X/data'
#> Creating new AssayMatrix array with index [var_id,obs_id] at '/tmp/RtmpO41gFl/pbmc_small_rna/X/data'
#> Adding 3 metadata keys to array
#> Ingesting AssayMatrix data into: /tmp/RtmpO41gFl/pbmc_small_rna/X/data
#> Finished converting Seurat Assay with key [rna_] to SOMA

Printing out the X group again reveals that it now contains 2 members: counts and data.

soma$X
#> <AssayMatrixGroup>
#>   uri: /tmp/RtmpO41gFl/pbmc_small_rna/X 
#>   arrays: counts, data

Let’s repeat this process for the scaled/centered version of the data that’s stored in the scale.data slot.

pbmc_small_rna <- SetAssayData(
  pbmc_small_rna,
  slot = "scale.data",
  new.data = GetAssayData(pbmc_small[["RNA"]], "scale.data")
)

soma$from_seurat_assay(pbmc_small_rna, layers = "scale.data", var = FALSE)
#> Skipping ingestion of 'data' because it is identical to 'counts'
#> No AssayMatrix found at '/tmp/RtmpO41gFl/pbmc_small_rna/X/scale.data'
#> Creating new AssayMatrix array with index [var_id,obs_id] at '/tmp/RtmpO41gFl/pbmc_small_rna/X/scale.data'
#> Adding 3 metadata keys to array
#> Ingesting AssayMatrix data into: /tmp/RtmpO41gFl/pbmc_small_rna/X/scale.data
#> Finished converting Seurat Assay with key [rna_] to SOMA
soma$X
#> <AssayMatrixGroup>
#>   uri: /tmp/RtmpO41gFl/pbmc_small_rna/X 
#>   arrays: counts, data, scale.data

The X group now contains 3 arrays, each of which has only been written to a single time. We can verify this by counting the number of fragments contained within any of individual arrays:

soma$X$members$count$fragment_count()
#> [1] 1

TileDB creates a new fragment each time a write is performed, so 1 fragment confirms the array has been written to only once.

Update an existing layer

If the data in your Seurat Assay has changed since it was ingested into TileDB, you can follow the same basic process to update the array.

Here, we’ll aritifically shift all of the values in the scale.data slot by 1 and then re-ingest the updated data into the scale.data layer on disk:

pbmc_small_rna <- SetAssayData(
  pbmc_small_rna,
  slot = "scale.data",
  new.data = GetAssayData(pbmc_small[["RNA"]], "scale.data") + 1
)

soma$from_seurat_assay(pbmc_small_rna, layers = "scale.data", var = FALSE)
#> Skipping ingestion of 'data' because it is identical to 'counts'
#> Checking legacy validity mode for array: '/tmp/RtmpO41gFl/pbmc_small_rna/X/scale.data'
#> Found existing AssayMatrix at '/tmp/RtmpO41gFl/pbmc_small_rna/X/scale.data'
#> Updating existing AssayMatrix at '/tmp/RtmpO41gFl/pbmc_small_rna/X/scale.data'
#> Ingesting AssayMatrix data into: /tmp/RtmpO41gFl/pbmc_small_rna/X/scale.data
#> Finished converting Seurat Assay with key [rna_] to SOMA

Now let’s verify that number of fragments in each X’s layers:

sapply(
  names(soma$X$members),
  function(x) soma$X$members[[x]]$fragment_count()
)
#>     counts       data scale.data 
#>          1          1          2

Session

Session Info
sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] RcppSpdlog_0.0.14   SeuratObject_4.1.3  sp_2.0-0           
#> [4] tiledbsc_0.1.5.9002
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.7          future_1.33.0       spdl_0.0.5         
#>  [4] stringi_1.7.12      lattice_0.21-8      listenv_0.9.0      
#>  [7] digest_0.6.33       magrittr_2.0.3      tiledb_0.20.3      
#> [10] evaluate_0.21       grid_4.3.1          fastmap_1.1.1      
#> [13] rprojroot_2.0.3     jsonlite_1.8.7      Matrix_1.5-4.1     
#> [16] urltools_1.7.3      purrr_1.0.1         codetools_0.2-19   
#> [19] textshaping_0.3.6   jquerylib_0.1.4     cli_3.6.1          
#> [22] rlang_1.1.1         triebeard_0.4.1     parallelly_1.36.0  
#> [25] future.apply_1.11.0 bit64_4.0.5         cachem_1.0.8       
#> [28] yaml_2.3.7          tools_4.3.1         parallel_4.3.1     
#> [31] nanotime_0.3.7      memoise_2.0.1       globals_0.16.2     
#> [34] vctrs_0.6.3         R6_2.5.1            zoo_1.8-12         
#> [37] lifecycle_1.0.3     stringr_1.5.0       fs_1.6.3           
#> [40] bit_4.0.5           ragg_1.2.5          desc_1.4.2         
#> [43] pkgdown_2.0.7       progressr_0.13.0    bslib_0.5.0        
#> [46] glue_1.6.2          Rcpp_1.0.11         systemfonts_1.0.4  
#> [49] xfun_0.40           knitr_1.43          htmltools_0.5.5    
#> [52] rmarkdown_2.23      compiler_4.3.1      RcppCCTZ_0.2.12