TileDB and (R)MariaDB Examples • tiledb

Introduction

TileDB provides the Universal Data Engine that can be accessed in a variety of ways. The C/C++ library offered by TileDB Embedded is one approach, and the R package, as well as the Python package and other language bindings use it. Another interface is provided by the MariaDB Integration via the MyTile storage plugin.

This provides TileDB integration with any frontend that interfaces with MariaDB—such as the RMariaDB package for R. So this vignette illustrates the use of TileDB via MariaDB using R and the RMariaDB packages.

Installation or Using Docker

In order to use the MyTile storage plugin, one has to compile both the storage plugin itself and the MariaDB server with consistent compiler flags. As the build also requires the TileDB Embedded library and headers, using a Docker container may be easiest. The Dockerfile also provides a concrete example of the build setup.

So here we will use the tiledb/tiledb-mariadb-r container.

Launch Container in Background

We launch the container ‘tiledb/tiledb-mariadb-r’ as a daemon, allowing MariaDB to accept an empty password (using an older MySQL variable setting), and name the running image ‘tiledb-mariadb-r’:

docker run --name tiledb-mariadb-r -it -d --rm \
       -e MYSQL_ALLOW_EMPTY_PASSWORD=1 tiledb/tiledb-mariadb-r

Access Container Once to Write via TileDB

Using the name given to the running instance, we start an R session to write data with TileDB:

docker exec -it -u root tiledb-mariadb-r R

Examples

Palmer Penguins

With the sessions started as in the previous section, we start an R session.

> library(tiledb)
> library(palmerpenguins)
> praw <- penguins_raw
> fromDataFrame(praw, "/tmp/penguinsraw")

Access Container Again to Read via RMariaDB

In another shell, we can access the container once more and launch another R process:

docker exec -it -u root tiledb-mariadb-r R

where we use RMariaDB to access the data via a tibble object and magrittr pipe

> library(RMariaDB)
> library(dplyr, warn.conflicts=FALSE)
> con <- DBI::dbConnect(RMariaDB::MariaDB(), dbname="test")
> tbl(con, "/tmp/penguinsraw") %>%
+        dplyr::select(contains("Length"))
# Source:   lazy query [?? x 2]
# Database: mysql [@localhost:NA/test]
   `Culmen Length (mm)` `Flipper Length (mm)`
                  <dbl>                 <dbl>
 1                 39.1                   181
 2                 39.5                   186
 3                 40.3                   195
 4                 NA                      NA
 5                 36.7                   193
 6                 39.3                   190
 7                 38.9                   181
 8                 39.2                   195
 9                 34.1                   193
10                 42                     190
# ... with more rows

Note that this R session uses TileDB only via the MyTile plugin which is activated implicitly via the ‘table location’ (here /tmp/penguinsraw) of the test database enabling plugins.

Also note that the query is still ‘lazy’: only column names and the first ten observations have been retrieved, and the total size is still unknown as indicated by lazy query [?? x 2]. Adding a collect() verb materialized the full subset.

> tbl(con, "/tmp/penguinsraw")  %>%
+     dplyr::select(contains("Length")) %>%
+     collect()
# A tibble: 344 x 2
   `Culmen Length (mm)` `Flipper Length (mm)`
                  <dbl>                 <dbl>
 1                 39.1                   181
 2                 39.5                   186
 3                 40.3                   195
 4                 NA                      NA
 5                 36.7                   193
 6                 39.3                   190
 7                 38.9                   181
 8                 39.2                   195
 9                 34.1                   193
10                 42                     190
# ... with 334 more rows
>

We now see that all 344 rows of this particular result set have been accessed.

Similarly, we can also run summaries on selected column:

> tbl(con, "/tmp/penguinsraw") %>%
+     group_by(Species) %>%
+     summarise(across(starts_with("Flipper"),
+                      list(~mean(.x, na.rm=TRUE), ~sd(.x, na.rm=TRUE))))
# Source:   lazy query [?? x 3]
# Database: mysql [@localhost:NA/test]
  Species              `\`Flipper Length (mm)\`_~me~ `\`Flipper Length (mm)\`_~~
  <chr>                                        <dbl>                       <dbl>
1 Adelie Penguin (Pyg~                          190.                        6.54
2 Chinstrap penguin (~                          196.                        7.13
3 Gentoo penguin (Pyg~                          217.                        6.48
>

DBI

We can connect directly using the DBI package along with the RMariaDB package and its MyTile bindings:

library(DBI)
con <- dbConnect(RMariaDB::MariaDB(), dbname="test")
res <- dbSendQuery(con, "select * from `/work/penguins` as `q91` limit 10")
df <- dbFetch(res, n = 10)
dbClearResult(res)
print(df)
dbDisconnect(con)

S3

When using Docker, the container setup needs to be modified to include the required AWS access environment variables:

docker run --name tiledb-mariadb-r \
       -it -d --rm \
       -v $PWD:/work -w /work \
       -e MYSQL_ALLOW_EMPTY_PASSWORD=1 \
       -e AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
       -e AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
       tiledb/tiledb-mariadb-r

This exports the current working directory (accessed as $PWD in the shell) as /work inside the container (via the -v switch), and instructs the container to start in directory /work (via the -w switch).

In the following example we also set a TileDB configuration option to switch AWS regions.

> library(dplyr, warn.conflicts=FALSE)
> # connect as usual
> con <- DBI::dbConnect(RMariaDB::MariaDB(), dbname="test")
> # use connection to update config in order to switch s3 regions
> res1 <- DBI::dbSendQuery(con, "set mytile_tiledb_config=\"vfs.s3.region=us-west-1\";")
> DBI::dbClearResult(res1)
> # run simple query
> tbl(con, "s3://tiledb-public-us-west-1/test-array-4x4") %>% collect()
# A tibble: 16 x 3
   `__dim_0` `__dim_1` `__attr`
     <int64>   <int64>    <dbl>
 1         0         0   0.397
 2         0         1   0.432
 3         0         2   0.617
 4         0         3   0.403
 5         1         0   0.837
 6         1         1   0.670
 7         1         2   0.235
 8         1         3   0.841
 9         2         0   0.222
10         2         1   0.468
11         2         2   0.970
12         2         3   0.551
13         3         0   0.628
14         3         1   0.0149
15         3         2   0.0540
16         3         3   0.445
>
> DBI::dbDisconnect(con)  # close connection

NYC Taxis

We can write the New York taxi data into TileDB (in the current directory) after reading it as a csv:

> nyc <- data.table::fread("trip_data_1.csv")
> tiledb::fromDataFrame(nyc, "trip_data_1")

Then we can access it via the path /work/trip_data_1 via RMariaDB.

> library(dplyr, warn.conflicts=FALSE)
> con <- DBI::dbConnect(RMariaDB::MariaDB(), dbname="test")
> # list column names
> tbl(con, "/work/trip_data_1") %>% colnames()
> tbl(con, "/work/trip_data_1") %>% colnames()
 [1] "__tiledb_rows"      "dropoff_latitude"   "medallion"
 [4] "hack_license"       "dropoff_longitude"  "vendor_id"
 [7] "rate_code"          "store_and_fwd_flag" "pickup_datetime"
[10] "dropoff_datetime"   "trip_time_in_secs"  "pickup_latitude"
[13] "passenger_count"    "trip_distance"      "pickup_longitude"
> # extract one column (here passenger_count) completely and tabulate in R
> tbl(con, "/work/trip_data_1") %>% select(passenger_count) %>% collect() %>% table()
.
       0        1        2        3        4        5        6        9
     166 10471701  1986196   597485   280992   920006   520066        1
     208      255
       1        1
> # alternatively, run a lazy query
> tbl(con, "/work/trip_data_1") %>% group_by(passenger_count) %>% summarize(nobs = n())
# Source:   lazy query [?? x 2]
# Database: mysql [@localhost:NA/test]
   passenger_count     nobs
             <int>  <int64>
 1               0      166
 2               1 10471701
 3               2  1986196
 4               3   597485
 5               4   280992
 6               5   920006
 7               6   520066
 8               9        1
 9             208        1
10             255        1
> # here the request is actually sent as SQL and the computation is done in the SQL layer
> tbl(con, "/work/trip_data_1")  %>%
+      group_by(passenger_count) %>%
+      summarize(nobs = n())     %>%
+      show_query()
<SQL>
SELECT `passenger_count`, COUNT(*) AS `nobs`
FROM `/work/trip_data_1`
GROUP BY `passenger_count`
> # naturally we can also send the query directly to the SQL backend
> sql <- "SELECT passenger_count, count(*) AS n FROM `/work/trip_data_1` GROUP BY passenger_count;"
> DBI::dbGetQuery(con, sql)
   passenger_count        n
1                0      166
2                1 10471701
3                2  1986196
4                3   597485
5                4   280992
6                5   920006
7                6   520066
8                9        1
9              208        1
10             255        1
>
> DBI::dbDisconnect(con)  # close connection

Summary

This short vignette demonstrates the use of TileDB-stored data from R using RMariaDB via the MyTile storage extension to MariaBD.