TileDB-CF Xarray Engine
Reading from TileDB with Xarray
Xarray uses a plugin infrastructure that allows third-party developers to create their own backend engines for reading data into xarray. TileDB-CF contains one such backend. To use the backend, make sure tiledb-cf
is installed in your current Python environment, and use the tiledb
engine:
import xarray as xr
="tiledb") xr.open_dataset(tiledb_uri, engine
The TileDB engine can be used to open either a TileDB array or a TileDB group. See the requirements on the arrays below.
The backend engine will open the group or array as a dataset with TileDB dimensions mapping to dataset dimensions, TileDB attributes mapping to dataset variables/DataArrays, and TileDB metadata mapping to dataset attributes.
For a TileDB array to be readable by xarray, the following must be satisfied:
- The array must be dense.
- All dimensions on the array must be either signed or unsigned integers.
- Add dimensions must have a domain that starts at
0
.
For a TileDB group to be readable by xarray, the following must be satisfied:
- All arrays in the group satisfy the above requirements for the array to be readable.
- Each attribute has a unique “variable name”.
The TileDB backend engine can be used with the standard xarray keyword arguments. It supports the additional TileDB-specific arguments:
config
: An optional TileDB configuration object to use in arrays and groups.ctx
: An optional TileDB context object to use for all TileDB operations.timestamp
: An optional timestamp to open the TileDB array at (not supported on groups).
Writing from Xarray to TileDB
The xarray writer is stricter than the xarray backend engine (reader). While the reader will attempt to open arrays with multiple attributes, the xarray writer only creates arrays with one attribute per name.
There are two sets of functions for writing to xarray:
Single dataset ingestion.
- Functions used:
from_xarray
- Useful when copying an entire xarray dataset to a TileDB group in a single function call.
- Creates the group and copies all data and metadata to the new group in a single function call.
- Functions used:
Multi-dataset ingestion.
- Main functions:
create_group_from_xarray
andcopy_data_from_xarray
. - Additional helper function:
copy_metadata_from_xarray
. - Useful when copying multiple xarray datasets to a single TileDB group.
- Creates the group and copies data to the group in separate API calls.
- Main functions:
The xarray to TileDB writer will copy the dataset in the following way:
- One group is created for the dataset.
- Dataset “attributes” are copied to group level metadata.
- Each xarray variable is copied to its own dense TileDB array with a single TileDB attribute.
The array schema for an xarray variable is generated as follows:
TileDB array properties:
- The TileDB array is dense.
TileDB Domain:
All dimensions have the same datatype determined by the
dim_dtype
encoding.The dimension names in the TileDB array match the dimension names in the xarray variable.
The dimension tiles are determined by the
tiles
encoding.The domain of each dimension is set to
[0, max_size - 1]
wheremax_size
is computed as follows:Use the corresponding element of the
max_shape
encoding if provided.If the
max_shape
encoding is not provided and the xarray dimension is “unlimited”, use the largest possible size for this integer type.If the
max_shape
encoding is not provided and the xarray dimension is not “unlimited”, use the size of the xarray dimension.
TileDB Attribute:
The attribute datatype is the same as the variable datatype (after applying xarray encodings).
The attribute name is set using the following:
Use the name provided by
attr_name
encoding.If the
attr_name
encoding is not provided and there is no dimension on this variable with the same name as the variable, use the name of the variable.If the
attr_name
encoding is provided and there is a dimension on this variable with the same name as the variable, use the variable name appended with_
.
The attribute filters are determined by the
filters
encoding.
TileDB Encoding
The writer takes a dictionary from dataset variable names to a dictionary of encodings for setting TileDB properties. The possible encoding keywords are provided in the table below.
Encoding Keyword | Details | Type |
---|---|---|
attr_name |
Name to use for the TileDB attribute. | str |
filters |
Filter list to apply to the TileDB attribute. | tiledb.FilterList |
tiles |
Tile sizes to apply to the TileDB dimensions. | tuple of ints |
max_shape |
Maximum possible size of the TileDB array. | tuple of ints |
dim_dtype |
Datatype to use for the TileDB dimensions. | str or numpy.dtype |
Region to Write
If the creating TileDB array’s with either unlimited dimensions or with encoded max_shape
larger than the current size of the xarray variable, then the region to write the data to needs to be provided. This is input as a dictionary from dimension names to slices. The slice uses xarray/numpy conventions and will write to a region that does not include the upper bound of the slice.
Creating Multiple Fragments
When copying data with either the from_xarray
or copy_data_from_xarray
functions, the copy routine will use Xarray chunks for separate writes - creating multiple fragments.