TileDB and (R)MariaDB Examples
Source:vignettes/tiledb-mariadb-examples.Rmd
tiledb-mariadb-examples.Rmd
Introduction
TileDB provides the Universal Data Engine that can be accessed in a variety of ways. The C/C++ library offered by TileDB Embedded is one approach, and the R package, as well as the Python package and other language bindings use it. Another interface is provided by the MariaDB Integration via the MyTile storage plugin.
This provides TileDB integration with any frontend that interfaces with MariaDB—such as the RMariaDB package for R. So this vignette illustrates the use of TileDB via MariaDB using R and the RMariaDB packages.
Installation or Using Docker
In order to use the MyTile storage plugin, one has to compile both the storage plugin itself and the MariaDB server with consistent compiler flags. As the build also requires the TileDB Embedded library and headers, using a Docker container may be easiest. The Dockerfile also provides a concrete example of the build setup.
So here we will use the tiledb/tiledb-mariadb-r
container.
Launch Container in Background
We launch the container ‘tiledb/tiledb-mariadb-r’ as a daemon, allowing MariaDB to accept an empty password (using an older MySQL variable setting), and name the running image ‘tiledb-mariadb-r’:
docker run --name tiledb-mariadb-r -it -d --rm \
-e MYSQL_ALLOW_EMPTY_PASSWORD=1 tiledb/tiledb-mariadb-r
Examples
Palmer Penguins
With the sessions started as in the previous section, we start an R session.
> library(tiledb)
> library(palmerpenguins)
> praw <- penguins_raw
> fromDataFrame(praw, "/tmp/penguinsraw")
Access Container Again to Read via RMariaDB
In another shell, we can access the container once more and launch another R process:
docker exec -it -u root tiledb-mariadb-r R
where we use RMariaDB to access the data via a tibble
object and magrittr
pipe
> library(RMariaDB)
> library(dplyr, warn.conflicts=FALSE)
> con <- DBI::dbConnect(RMariaDB::MariaDB(), dbname="test")
> tbl(con, "/tmp/penguinsraw") %>%
+ dplyr::select(contains("Length"))
# Source: lazy query [?? x 2]
# Database: mysql [@localhost:NA/test]
`Culmen Length (mm)` `Flipper Length (mm)`
<dbl> <dbl>
1 39.1 181
2 39.5 186
3 40.3 195
4 NA NA
5 36.7 193
6 39.3 190
7 38.9 181
8 39.2 195
9 34.1 193
10 42 190
# ... with more rows
Note that this R session uses TileDB only via the MyTile plugin which is activated implicitly via the ‘table location’ (here /tmp/penguinsraw
) of the test database enabling plugins.
Also note that the query is still ‘lazy’: only column names and the first ten observations have been retrieved, and the total size is still unknown as indicated by lazy query [?? x 2]
. Adding a collect()
verb materialized the full subset.
> tbl(con, "/tmp/penguinsraw") %>%
+ dplyr::select(contains("Length")) %>%
+ collect()
# A tibble: 344 x 2
`Culmen Length (mm)` `Flipper Length (mm)`
<dbl> <dbl>
1 39.1 181
2 39.5 186
3 40.3 195
4 NA NA
5 36.7 193
6 39.3 190
7 38.9 181
8 39.2 195
9 34.1 193
10 42 190
# ... with 334 more rows
>
We now see that all 344 rows of this particular result set have been accessed.
Similarly, we can also run summaries on selected column:
> tbl(con, "/tmp/penguinsraw") %>%
+ group_by(Species) %>%
+ summarise(across(starts_with("Flipper"),
+ list(~mean(.x, na.rm=TRUE), ~sd(.x, na.rm=TRUE))))
# Source: lazy query [?? x 3]
# Database: mysql [@localhost:NA/test]
`\`Flipper Length (mm)\`_~me~ `\`Flipper Length (mm)\`_~~
Species <chr> <dbl> <dbl>
1 Adelie Penguin (Pyg~ 190. 6.54
2 Chinstrap penguin (~ 196. 7.13
3 Gentoo penguin (Pyg~ 217. 6.48
>
DBI
We can connect directly using the DBI
package along with the RMariaDB
package and its MyTile bindings:
library(DBI)
con <- dbConnect(RMariaDB::MariaDB(), dbname="test")
res <- dbSendQuery(con, "select * from `/work/penguins` as `q91` limit 10")
df <- dbFetch(res, n = 10)
dbClearResult(res)
print(df)
dbDisconnect(con)
S3
When using Docker, the container setup needs to be modified to include the required AWS access environment variables:
docker run --name tiledb-mariadb-r \
-it -d --rm \
-v $PWD:/work -w /work \
-e MYSQL_ALLOW_EMPTY_PASSWORD=1 \
-e AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
-e AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
tiledb/tiledb-mariadb-r
This exports the current working directory (accessed as $PWD
in the shell) as /work
inside the container (via the -v
switch), and instructs the container to start in directory /work
(via the -w
switch).
In the following example we also set a TileDB configuration option to switch AWS regions.
> library(dplyr, warn.conflicts=FALSE)
> # connect as usual
> con <- DBI::dbConnect(RMariaDB::MariaDB(), dbname="test")
> # use connection to update config in order to switch s3 regions
> res1 <- DBI::dbSendQuery(con, "set mytile_tiledb_config=\"vfs.s3.region=us-west-1\";")
> DBI::dbClearResult(res1)
> # run simple query
> tbl(con, "s3://tiledb-public-us-west-1/test-array-4x4") %>% collect()
# A tibble: 16 x 3
`__dim_0` `__dim_1` `__attr`
<int64> <int64> <dbl>
1 0 0 0.397
2 0 1 0.432
3 0 2 0.617
4 0 3 0.403
5 1 0 0.837
6 1 1 0.670
7 1 2 0.235
8 1 3 0.841
9 2 0 0.222
10 2 1 0.468
11 2 2 0.970
12 2 3 0.551
13 3 0 0.628
14 3 1 0.0149
15 3 2 0.0540
16 3 3 0.445
>
> DBI::dbDisconnect(con) # close connection
NYC Taxis
We can write the New York taxi data into TileDB (in the current directory) after reading it as a csv:
> nyc <- data.table::fread("trip_data_1.csv")
> tiledb::fromDataFrame(nyc, "trip_data_1")
Then we can access it via the path /work/trip_data_1
via RMariaDB.
> library(dplyr, warn.conflicts=FALSE)
> con <- DBI::dbConnect(RMariaDB::MariaDB(), dbname="test")
> # list column names
> tbl(con, "/work/trip_data_1") %>% colnames()
> tbl(con, "/work/trip_data_1") %>% colnames()
1] "__tiledb_rows" "dropoff_latitude" "medallion"
[4] "hack_license" "dropoff_longitude" "vendor_id"
[7] "rate_code" "store_and_fwd_flag" "pickup_datetime"
[10] "dropoff_datetime" "trip_time_in_secs" "pickup_latitude"
[13] "passenger_count" "trip_distance" "pickup_longitude"
[> # extract one column (here passenger_count) completely and tabulate in R
> tbl(con, "/work/trip_data_1") %>% select(passenger_count) %>% collect() %>% table()
.0 1 2 3 4 5 6 9
166 10471701 1986196 597485 280992 920006 520066 1
208 255
1 1
> # alternatively, run a lazy query
> tbl(con, "/work/trip_data_1") %>% group_by(passenger_count) %>% summarize(nobs = n())
# Source: lazy query [?? x 2]
# Database: mysql [@localhost:NA/test]
passenger_count nobs<int> <int64>
1 0 166
2 1 10471701
3 2 1986196
4 3 597485
5 4 280992
6 5 920006
7 6 520066
8 9 1
9 208 1
10 255 1
> # here the request is actually sent as SQL and the computation is done in the SQL layer
> tbl(con, "/work/trip_data_1") %>%
+ group_by(passenger_count) %>%
+ summarize(nobs = n()) %>%
+ show_query()
<SQL>
`passenger_count`, COUNT(*) AS `nobs`
SELECT `/work/trip_data_1`
FROM `passenger_count`
GROUP BY > # naturally we can also send the query directly to the SQL backend
> sql <- "SELECT passenger_count, count(*) AS n FROM `/work/trip_data_1` GROUP BY passenger_count;"
> DBI::dbGetQuery(con, sql)
passenger_count n1 0 166
2 1 10471701
3 2 1986196
4 3 597485
5 4 280992
6 5 920006
7 6 520066
8 9 1
9 208 1
10 255 1
>
> DBI::dbDisconnect(con) # close connection