import netCDF4
import numpy as np
import tiledb
import tiledb.cf
import matplotlib.pyplot as pltConverting a simple NetCDF file to a TileDB array
About this Example
What it Shows
The purpose of this example is to show the basics of converting a NetCDF file to TileDB arrays.
This includes:
- Options for auto-generating a converter from a NetCDF file.
- Changing the TileDB schema settings before conversion.
- Creating the TileDB group and copying data from the NetCDF file to the TileDB arrays.
Example dataset
This example shows convertering a small NetCDF file with 2 dimensions and 4 variables:
- Dimensions:
- x: size=100
- y: size=100
- Variables:
- x(x)
- description: evenly spaced values from -5 to 5
- data type: 64-bit floating point
- y(y)
- description: evenly spaced values from -5 to 5
- data type: 64-bit floating point
- A1(x, y)
- description: x + y
- data type: 64-bit floating point
- A1(x, y)
- description: sin((x/2)^2 + y^2
- data type: 64-bit floating point
- x(x)
Set-up Requirements
This example requires the following python packages are installed: netCDF4, numpy, tiledb, tiledb-cf, and matplotlib
# Set names for the output generated by the example.
output_dir = "output/netcdf-to-tiledb-basics"
netcdf_file = f"{output_dir}/simple1.nc"
group_uri = f"{output_dir}/simple_netcdf_to_group_1"
array_uri = f"{output_dir}/simple_netcdf_to_array_1"# Reset output folder
import os
import shutil
shutil.rmtree(output_dir, ignore_errors=True)
os.mkdir(output_dir)Create an example NetCDF file
If the NetCDF file does not exist, we create a small NetCDF file for this example.
x_data = np.linspace(-5.0, 5.0, 100)
y_data = np.linspace(-5.0, 5.0, 100)
xv, yv = np.meshgrid(x_data, y_data, sparse=True)
with netCDF4.Dataset(netcdf_file, mode="w") as dataset:
dataset.setncatts({"title": "Simple dataset for examples"})
dataset.createDimension("x", 100)
dataset.createDimension("y", 100)
A1 = dataset.createVariable("A1", np.float64, ("x", "y"))
A1.setncattr("full_name", "Example matrix A1")
A1.setncattr("description", "x + y")
A1[:, :] = xv + yv
A2 = dataset.createVariable("A2", np.float64, ("x", "y"))
A2[:, :] = np.sin((xv / 2.0) ** 2 + yv**2)
A2.setncattr("full_name", "Example matrix A2")
A2.setncattr("description", "sin((x/2)^2 + y^2")
x1 = dataset.createVariable("x", np.float64, ("x",))
x1[:] = x_data
y = dataset.createVariable("y", np.float64, ("y",))
y[:] = y_data
print(f"Created example NetCDF file `{netcdf_file}`.")Convert NetCDF file to a TileDB Group
In this section we convert a NetCDF file to TileDB in a way that:
- maps NetCDF dimensions to TileDB dimensions,
- maps NetCDF variables to TileDB attributes.
The functions NetCDF4ConverterEngine.from_file and NetCDF4ConverterEngine.from_group auto-generate a NetCDF4ConverterEngine for an exising NetCDF file. The properties in the NetCDF4ConverterEngine can be modified after the converter is generated.
Parameters:
Set the location of the NetCDF group to be converted.
- In
from_file:input_file: The input NetCDF file to generate the converter engine from.group_path: The path to the NetCDF group to copy data from. Use'/'for the root group.
- In
from_group:input_netcdf_group: The NetCDF group to generate the converter engine from. (Must be anetCDF4.DatasetornetCDF4.Group.)
- In
Set the array grouping. A NetCDF variable maps to TileDB attributes. The
collect_attrsparameters determines if each NetCDF variable is stored in a separate array, or if all NetCDF variables with the same underlying dimensions are stored in the same TileDB array. Scalar variables are always grouped together.collect_attrs: IfTrue, store all attributes with the same dimensions in the same array. Otherwise, store each attribute in a separate array.
Set default properties for TileDB dimension.
unlimited_dim_size: The default size of the domain for TileDB dimensions created from unlimited NetCDF dimensions. IfNone, the current size of the NetCDF dimension will be used.dim_dtype: The default numpy dtype to use when converting a NetCDF dimension to a TileDB dimension.
Set tile sizes for TileDB dimensions. Multiple arrays in the TileDB group may have the same name, domain, and type, but different tiles and compression filters. The
tiles_by_varandtiles_by_dimsparameters allow a way of setting the tiles for the dimensions in different arrays.tiles_by_var: A map from the name of a single NetCDF variable to the tiles of the dimensions of the TileDB array that contains the data from that variable.tiles_by_dims: A map from the name of NetCDF dimensions defining a variable to the tiles of the dimensions of any TileDB array that contains data from a variable defined on those dimensions.
Convert 1D variables with the same name and dimension to a TileDB dimension instead of a TileDB attribute. This is an advanced usage, and will move the data away from a NetCDF-like data model.
coords_to_dims: IfTrue, convert the NetCDF coordinate variable into a TileDB dimension for sparse arrays. Otherwise, convert the coordinate dimension into a TileDB dimension and the coordinate variable into a TileDB attribute.
Unpack NetCDF data that uses
add_offsetandscale_factor.unpack_vars: IfTrue, for any variable that has NetCDF attributesadd_offsetorscale_factorapply linear transformationx -> add_offset * x + scale_factorto data before conversion. The data type will be set to the data type of the scaled data and the attributesadd_offsetandscale_factorwill be dropped.
Set filters for TileDB dimensions, attributes, and offsets.
offsets_filters: Default TileDB filters for all offsets for variable length TileDB attributes and TileDB dimensions.attrs_filters: Default TileDB filters for all attributes.
# Auto-generate NetCDF to TileDB conversion from a NetCDF file.
converter = tiledb.cf.NetCDF4ConverterEngine.from_file(
netcdf_file,
dim_dtype=np.uint32,
attrs_filters=[tiledb.ZstdFilter(level=7)],
)
converter# Update properties manually by modifying the array creators
# 1. Update properties for x
x_array = converter.get_array_creator_by_attr("x.data")
x_array.name = "x"
x_array.domain_creator.tiles = (20,)
# 2. Update properties for y
y_array = converter.get_array_creator_by_attr("y.data")
y_array.name = "y"
y_array.domain_creator.tiles = (20,)
# 3. Update properties for array of matrices
data_array = converter.get_array_creator_by_attr("A1")
data_array.name = "data"
data_array.domain_creator.tiles = (20, 20)
converterRun the conversions to create two dense TileDB arrays:
converter.convert_to_group(group_uri)Examine the data in the arrays
Open the attributes from the generated TileDB group:
with tiledb.Group(group_uri) as group:
with (
tiledb.cf.open_group_array(group, attr="x.data") as x_array,
tiledb.cf.open_group_array(group, attr="y.data") as y_array,
tiledb.cf.open_group_array(group, array="data") as data_array,
):
x = x_array[:]
y = y_array[:]
data = data_array[...]
A1 = data["A1"]
A2 = data["A2"]
a1_description = tiledb.cf.AttrMetadata(data_array.meta, "A1")["description"]
a2_description = tiledb.cf.AttrMetadata(data_array.meta, "A2")["description"]fig, axes = plt.subplots(nrows=1, ncols=2)
axes[0].contourf(x, y, A1)
axes[0].set_title(a1_description)
axes[1].contourf(x, y, A2)
axes[1].set_title(a2_description);Convert to a Single Sparse TileDB Array
In this section we convert a NetCDF file to TileDB in a way taht:
- maps NetCDF ‘coordinate’ variables to TileDB dimensions,
- maps NetCDF standard variables to TileDB attributes.
converter2 = tiledb.cf.NetCDF4ConverterEngine.from_file(
netcdf_file, coords_to_dims=True, attrs_filters=[tiledb.ZstdFilter(level=7)]
)
converter2# Update properties for the array
converter2.get_shared_dim("x").domain = (-5.0, 5.0)
converter2.get_shared_dim("y").domain = (-5.0, 5.0)
data_array = converter2.get_array_creator("array0")
data_array.domain_creator.tiles = (1.0, 1.0)
data_array.capacity = 400
converter2converter2.convert_to_array(array_uri)tiledb.ArraySchema.load(array_uri)