Converting a simple NetCDF file to a TileDB array

About this Example

What it Shows

The purpose of this example is to show the basics of converting a NetCDF file to TileDB arrays.

This includes:

  1. Options for auto-generating a converter from a NetCDF file.
  2. Changing the TileDB schema settings before conversion.
  3. Creating the TileDB group and copying data from the NetCDF file to the TileDB arrays.

Example dataset

This example shows convertering a small NetCDF file with 2 dimensions and 4 variables:

  • Dimensions:
    • x: size=100
    • y: size=100
  • Variables:
    • x(x)
      • description: evenly spaced values from -5 to 5
      • data type: 64-bit floating point
    • y(y)
      • description: evenly spaced values from -5 to 5
      • data type: 64-bit floating point
    • A1(x, y)
      • description: x + y
      • data type: 64-bit floating point
    • A1(x, y)
      • description: sin((x/2)^2 + y^2
      • data type: 64-bit floating point

Set-up Requirements

This example requires the following python packages are installed: netCDF4, numpy, tiledb, tiledb-cf, and matplotlib

import netCDF4
import numpy as np
import tiledb
import tiledb.cf
import matplotlib.pyplot as plt
# Set names for the output generated by the example.
output_dir = "output/netcdf-to-tiledb-basics"
netcdf_file = f"{output_dir}/simple1.nc"
group_uri = f"{output_dir}/simple_netcdf_to_group_1"
array_uri = f"{output_dir}/simple_netcdf_to_array_1"
# Reset output folder
import os
import shutil

shutil.rmtree(output_dir, ignore_errors=True)
os.mkdir(output_dir)

Create an example NetCDF file

If the NetCDF file does not exist, we create a small NetCDF file for this example.

x_data = np.linspace(-5.0, 5.0, 100)
y_data = np.linspace(-5.0, 5.0, 100)
xv, yv = np.meshgrid(x_data, y_data, sparse=True)
with netCDF4.Dataset(netcdf_file, mode="w") as dataset:
    dataset.setncatts({"title": "Simple dataset for examples"})
    dataset.createDimension("x", 100)
    dataset.createDimension("y", 100)
    A1 = dataset.createVariable("A1", np.float64, ("x", "y"))
    A1.setncattr("full_name", "Example matrix A1")
    A1.setncattr("description", "x + y")
    A1[:, :] = xv + yv
    A2 = dataset.createVariable("A2", np.float64, ("x", "y"))
    A2[:, :] = np.sin((xv / 2.0) ** 2 + yv**2)
    A2.setncattr("full_name", "Example matrix A2")
    A2.setncattr("description", "sin((x/2)^2 + y^2")
    x1 = dataset.createVariable("x", np.float64, ("x",))
    x1[:] = x_data
    y = dataset.createVariable("y", np.float64, ("y",))
    y[:] = y_data
print(f"Created example NetCDF file `{netcdf_file}`.")

Convert NetCDF file to a TileDB Group

In this section we convert a NetCDF file to TileDB in a way that:

  • maps NetCDF dimensions to TileDB dimensions,
  • maps NetCDF variables to TileDB attributes.

The functions NetCDF4ConverterEngine.from_file and NetCDF4ConverterEngine.from_group auto-generate a NetCDF4ConverterEngine for an exising NetCDF file. The properties in the NetCDF4ConverterEngine can be modified after the converter is generated.

Parameters:

  • Set the location of the NetCDF group to be converted.

    • In from_file:
      • input_file: The input NetCDF file to generate the converter engine from.
      • group_path: The path to the NetCDF group to copy data from. Use '/' for the root group.
    • In from_group:
      • input_netcdf_group: The NetCDF group to generate the converter engine from. (Must be a netCDF4.Dataset or netCDF4.Group.)
  • Set the array grouping. A NetCDF variable maps to TileDB attributes. The collect_attrs parameters determines if each NetCDF variable is stored in a separate array, or if all NetCDF variables with the same underlying dimensions are stored in the same TileDB array. Scalar variables are always grouped together.

    • collect_attrs: If True, store all attributes with the same dimensions in the same array. Otherwise, store each attribute in a separate array.
  • Set default properties for TileDB dimension.

    • unlimited_dim_size: The default size of the domain for TileDB dimensions created from unlimited NetCDF dimensions. If None, the current size of the NetCDF dimension will be used.
    • dim_dtype: The default numpy dtype to use when converting a NetCDF dimension to a TileDB dimension.
  • Set tile sizes for TileDB dimensions. Multiple arrays in the TileDB group may have the same name, domain, and type, but different tiles and compression filters. The tiles_by_var and tiles_by_dims parameters allow a way of setting the tiles for the dimensions in different arrays.

    • tiles_by_var: A map from the name of a single NetCDF variable to the tiles of the dimensions of the TileDB array that contains the data from that variable.
    • tiles_by_dims: A map from the name of NetCDF dimensions defining a variable to the tiles of the dimensions of any TileDB array that contains data from a variable defined on those dimensions.
  • Convert 1D variables with the same name and dimension to a TileDB dimension instead of a TileDB attribute. This is an advanced usage, and will move the data away from a NetCDF-like data model.

    • coords_to_dims: If True, convert the NetCDF coordinate variable into a TileDB dimension for sparse arrays. Otherwise, convert the coordinate dimension into a TileDB dimension and the coordinate variable into a TileDB attribute.
  • Unpack NetCDF data that uses add_offset and scale_factor.

    • unpack_vars: If True, for any variable that has NetCDF attributes add_offset or scale_factor apply linear transformation x -> add_offset * x + scale_factor to data before conversion. The data type will be set to the data type of the scaled data and the attributes add_offset and scale_factor will be dropped.
  • Set filters for TileDB dimensions, attributes, and offsets.

    • offsets_filters: Default TileDB filters for all offsets for variable length TileDB attributes and TileDB dimensions.
    • attrs_filters: Default TileDB filters for all attributes.
# Auto-generate NetCDF to TileDB conversion from a NetCDF file.
converter = tiledb.cf.NetCDF4ConverterEngine.from_file(
    netcdf_file,
    dim_dtype=np.uint32,
    attrs_filters=[tiledb.ZstdFilter(level=7)],
)
converter
# Update properties manually by modifying the array creators
# 1. Update properties for x
x_array = converter.get_array_creator_by_attr("x.data")
x_array.name = "x"
x_array.domain_creator.tiles = (20,)
# 2. Update properties for y
y_array = converter.get_array_creator_by_attr("y.data")
y_array.name = "y"
y_array.domain_creator.tiles = (20,)
# 3. Update properties for array of matrices
data_array = converter.get_array_creator_by_attr("A1")
data_array.name = "data"
data_array.domain_creator.tiles = (20, 20)
converter

Run the conversions to create two dense TileDB arrays:

converter.convert_to_group(group_uri)

Examine the data in the arrays

Open the attributes from the generated TileDB group:

with tiledb.Group(group_uri) as group:
    with (
        tiledb.cf.open_group_array(group, attr="x.data") as x_array,
        tiledb.cf.open_group_array(group, attr="y.data") as y_array,
        tiledb.cf.open_group_array(group, array="data") as data_array,
    ):
        x = x_array[:]
        y = y_array[:]
        data = data_array[...]
        A1 = data["A1"]
        A2 = data["A2"]
        a1_description = tiledb.cf.AttrMetadata(data_array.meta, "A1")["description"]
        a2_description = tiledb.cf.AttrMetadata(data_array.meta, "A2")["description"]
fig, axes = plt.subplots(nrows=1, ncols=2)
axes[0].contourf(x, y, A1)
axes[0].set_title(a1_description)
axes[1].contourf(x, y, A2)
axes[1].set_title(a2_description);

Convert to a Single Sparse TileDB Array

In this section we convert a NetCDF file to TileDB in a way taht:

  • maps NetCDF ‘coordinate’ variables to TileDB dimensions,
  • maps NetCDF standard variables to TileDB attributes.
converter2 = tiledb.cf.NetCDF4ConverterEngine.from_file(
    netcdf_file, coords_to_dims=True, attrs_filters=[tiledb.ZstdFilter(level=7)]
)
converter2
# Update properties for the array
converter2.get_shared_dim("x").domain = (-5.0, 5.0)
converter2.get_shared_dim("y").domain = (-5.0, 5.0)
data_array = converter2.get_array_creator("array0")
data_array.domain_creator.tiles = (1.0, 1.0)
data_array.capacity = 400
converter2
converter2.convert_to_array(array_uri)
tiledb.ArraySchema.load(array_uri)