geospatial.ingestion

cloud.geospatial.ingestion

Functions

Name Description
build_file_list_udf Build a list of sources
build_inputs_udf Groups input URIs into batches.
consolidate_meta Consolidate arrays in the dataset.
ingest_datasets Ingest samples into a dataset.
ingest_datasets_dag Ingests geospatial point clouds, geometries and images into TileDB arrays
ingest_geometry_udf Internal udf that ingests server side batch of geometry files
ingest_point_cloud_udf Internal udf that ingests server side batch of point cloud files
ingest_raster_udf Internal udf that ingests server side batch of raster files
load_geometry_metadata Return geospatial metadata for a sequence of input geometry data files
load_pointcloud_metadata Return geospatial metadata for a sequence of input point cloud data files
load_raster_metadata Return geospatial metadata for a sequence of input raster data files
read_uris Read a list of URIs from a URI.
register_dataset_udf Register the dataset on TileDB Cloud.
remove_dataset_type_from_array_meta Removes dataset_type meta if the ingested result is an array.

build_file_list_udf

cloud.geospatial.ingestion.build_file_list_udf(
    dataset_type
    config=None
    search_uri=None
    pattern=None
    ignore=None
    dataset_list_uri=None
    max_files=None
    verbose=False
    trace=False
    log_uri=None
)

Build a list of sources

Parameters

Name Type Description Default
dataset_type DatasetType dataset type, one of pointcloud, raster or geometry required
config Optional[Mapping[str, object]] config dictionary, defaults to None None
search_uri Optional[str] URI to search for geospatial dataset files, defaults to None None
pattern Optional[str] Unix shell style pattern to match when searching for files, defaults to None None
ignore Optional[str] Unix shell style pattern to ignore when searching for files, defaults to None None
dataset_list_uri Optional[str] URI with a list of dataset URIs, defaults to None None
max_files Optional[int] maximum number of URIs to read/find, defaults to None (no limit) None
verbose bool verbose logging, defaults to False False
trace bool bool, enabling log tracing, defaults to False False
log_uri Optional[str] log array URI None

Returns

Name Type Description
Sequence[str] A sequence of source files grouped into batches

build_inputs_udf

cloud.geospatial.ingestion.build_inputs_udf(
    dataset_type
    sources
    config=None
    compression_filter=None
    tile_size=RASTER_TILE_SIZE
    pixels_per_fragment=PIXELS_PER_FRAGMENT
    chunk_size=POINT_CLOUD_CHUNK_SIZE
    nodata=None
    resampling='bilinear'
    res=None
    verbose=False
    trace=False
    log_uri=None
)

Groups input URIs into batches.

Parameters

Name Type Description Default
dataset_type DatasetType dataset type, one of pointcloud, raster or geometry required
sources Sequence[str] URIs to process required
config Optional[Mapping[str, object]] config dictionary, defaults to None None
compression_filter Optional[dict] serialized tiledb filter, defaults to None None
tile_size int for rasters this is the tile (block) size for the merged destination array, defaults to 1024 RASTER_TILE_SIZE
pixels_per_fragment int This is the number of pixels that will be written per fragment. Ideally aim to align as a factor of tile_size PIXELS_PER_FRAGMENT
chunk_size int for point cloud this is the PDAL chunk size, defaults to 1000000 POINT_CLOUD_CHUNK_SIZE
nodata Optional[float] NODATA value for raster merging None
resampling Optional[str] string, resampling method, one of None, bilinear, cubic, nearest and average 'bilinear'
res Tuple[float, float] Tuple[float, float], output resolution in x/y None
verbose bool verbose logging, defaults to False False
trace bool bool, enabling log tracing, to False False
log_uri Optional[str] log array URI None

Returns

Name Type Description
dict[str, object] A dict containing the kwargs needed for the next function call

consolidate_meta

cloud.geospatial.ingestion.consolidate_meta(
    dataset_uri
    *
    config=None
    id='consolidate'
    verbose=False
    trace=False
    log_uri=None
)

Consolidate arrays in the dataset.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
config Optional[Mapping[str, object]] config dictionary, defaults to None None
id str profiler event id, defaults to “consolidate” 'consolidate'
verbose bool verbose logging, defaults to False False

ingest_datasets

cloud.geospatial.ingestion.ingest_datasets(
    dataset_uri
    *
    dataset_type
    acn=None
    config=None
    namespace=None
    register_name=None
    search_uri=None
    pattern=None
    ignore=None
    dataset_list_uri=None
    max_files=None
    compression_filter=None
    workers=MAX_WORKERS
    batch_size=BATCH_SIZE
    tile_size=RASTER_TILE_SIZE
    pixels_per_fragment=PIXELS_PER_FRAGMENT
    chunk_size=POINT_CLOUD_CHUNK_SIZE
    nodata=None
    res=None
    stats=False
    verbose=False
    trace=False
    log_uri=None
)

Ingest samples into a dataset.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
dataset_type DatasetType dataset type, one of pointcloud, raster or geometry required
acn Optional[str] Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None None
config config dictionary, defaults to None None
namespace Optional[str] TileDB-Cloud namespace, defaults to None None
register_name Optional[str] name to register the dataset with on TileDB Cloud, defaults to None None
search_uri Optional[str] URI to search for geospatial dataset files, defaults to None None
pattern Optional[str] Unix shell style pattern to match when searching for files, defaults to None None
ignore Optional[str] Unix shell style pattern to ignore when searching for files, defaults to None None
dataset_list_uri Optional[str] URI with a list of dataset URIs, defaults to None None
max_files Optional[int] maximum number of URIs to read/find, defaults to None (no limit) None
compression_filter Optional[dict] serialized tiledb filter, defaults to None None
workers int number of workers for dataset ingestion, defaults to MAX_WORKERS MAX_WORKERS
batch_size int batch size for dataset ingestion, defaults to BATCH_SIZE BATCH_SIZE
tile_size int for rasters this is the tile (block) size for the merged destination array defaults to 1024 RASTER_TILE_SIZE
pixels_per_fragment int This is the number of pixels that will be written per fragment. Ideally aim to align as a factor of tile_size PIXELS_PER_FRAGMENT
chunk_size int for point cloud this is the PDAL chunk size, defaults to 1000000 POINT_CLOUD_CHUNK_SIZE
nodata Optional[float] NODATA value for rasters None
res Tuple[float, float] Tuple[float, float], output resolution in x/y None
stats bool bool, print TileDB stats to stdout False
verbose bool verbose logging, defaults to False False
trace bool bool, enable trace for logging, defaults to False False
log_uri Optional[str] log array URI None

ingest_datasets_dag

cloud.geospatial.ingestion.ingest_datasets_dag(
    dataset_uri
    *
    dataset_type
    acn=None
    config=None
    namespace=None
    register_name=None
    search_uri=None
    pattern=None
    ignore=None
    dataset_list_uri=None
    max_files=None
    compression_filter=None
    workers=MAX_WORKERS
    batch_size=BATCH_SIZE
    tile_size=RASTER_TILE_SIZE
    pixels_per_fragment=PIXELS_PER_FRAGMENT
    chunk_size=POINT_CLOUD_CHUNK_SIZE
    nodata=None
    resampling='bilinear'
    res=None
    stats=False
    verbose=False
    trace=False
    log_uri=None
)

Ingests geospatial point clouds, geometries and images into TileDB arrays

Parameters

Name Type Description Default
dataset_uri str dataset URI required
dataset_type DatasetType dataset type, one of pointcloud, raster or geometry required
acn Optional[str] Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None None
config Optional[Mapping[str, object]] config dictionary, defaults to None None
namespace Optional[str] TileDB-Cloud namespace, defaults to None None
register_name Optional[str] name to register the dataset with on TileDB Cloud, defaults to None and the destination array is not registered None
search_uri Optional[str] URI to search for geospatial dataset files, defaults to None None
pattern Optional[str] Unix shell style pattern to match when searching for files, defaults to None None
ignore Optional[str] Unix shell style pattern to ignore when searching for files, defaults to None None
dataset_list_uri Optional[str] URI with a list of dataset URIs, defaults to None None
max_files Optional[int] maximum number of URIs to read/find, defaults to None (no limit) None
compression_filter Optional[dict] serialized tiledb filter, defaults to None None
workers int number of workers for dataset ingestion, defaults to MAX_WORKERS MAX_WORKERS
batch_size int batch size for dataset ingestion, defaults to BATCH_SIZE BATCH_SIZE
tile_size int for rasters this is the tile (block) size for the merged destination array, defaults to 1024 RASTER_TILE_SIZE
pixels_per_fragment int This is the number of pixels that will be written per fragment. Ideally aim to align as a factor of tile_size PIXELS_PER_FRAGMENT
chunk_size int for point cloud this is the PDAL chunk size, defaults to 1000000 POINT_CLOUD_CHUNK_SIZE
nodata Optional[float] NODATA value for raster merging None
resampling Optional[str] string, resampling method, one of None, bilinear, cubic, nearest and average 'bilinear'
res Tuple[float, float] Tuple[float, float], output resolution in x/y None
stats bool bool, print TileDB stats to stdout False
verbose bool verbose logging, defaults to False False
trace bool bool, enabling log tracing, defaults to False False
log_uri Optional[str] log array URI None

ingest_geometry_udf

cloud.geospatial.ingestion.ingest_geometry_udf(
    dataset_uri
    args={}
    sources=None
    schema=None
    extents=None
    crs=None
    chunk_size=GEOMETRY_CHUNK_SIZE
    batch_size=BATCH_SIZE
    compressor=None
    append=False
    verbose=False
    stats=False
    config=None
    id='geometry'
    trace=False
    log_uri=None
)

Internal udf that ingests server side batch of geometry files into tiledb arrays using Fiona API

Parameters

Name Type Description Default
dataset_uri str str, output TileDB array name required
args Union[Dict, List] dict, input key value arguments as a dictionary {}
sources Sequence[str] Sequence of input geometry file names None
schema dict dict, dictionary of schema attributes and geometries None
extents Optional[XYBoundsTuple] Extents of the destination geometry array None
crs Optional[str] str, CRS for the destination dataset None
chunk_size Optional[int] int, sets tile capacity and the number of geometries written at once GEOMETRY_CHUNK_SIZE
batch_size Optional[int] batch size for dataset ingestion, defaults to BATCH_SIZE BATCH_SIZE
compressor Optional[dict] dict, serialized compression filter None
append bool bool, whether to append to the array False
verbose bool verbose logging, defaults to False False
stats bool bool, print TileDB stats to stdout False
config Optional[Mapping[str, object]] dict, configuration to pass on tiledb.VFS None
id str str, ID for logging 'geometry'
log_uri Optional[str] log array URI None

Returns

Name Type Description
Union[Sequence[os.PathLike], None] if not appending then the function returns a tuple of file paths

ingest_point_cloud_udf

cloud.geospatial.ingestion.ingest_point_cloud_udf(
    args={}
    dataset_uri
    sources=None
    append=False
    chunk_size=POINT_CLOUD_CHUNK_SIZE
    batch_size=BATCH_SIZE
    verbose=False
    stats=False
    config=None
    id='pointcloud'
    trace=False
    log_uri=None
)

Internal udf that ingests server side batch of point cloud files into tiledb arrays using PDAL API. Compression uses the default profile built in to PDAL.

Parameters

Name Type Description Default
args Union[Dict, List] dict or list, input key value arguments as a dictionary {}
dataset_uri str str, output TileDB array name required
sources Sequence[GeoMetadata] Sequence of GeoMetadata objects None
append bool bool, whether to append to the array False
chunk_size Optional[int] PDAL configuration for chunking fragments POINT_CLOUD_CHUNK_SIZE
batch_size Optional[int] batch size for dataset ingestion, defaults to BATCH_SIZE BATCH_SIZE
verbose bool verbose logging, defaults to False False
stats bool bool, print TileDB stats to stdout False
config Optional[Mapping[str, object]] dict, configuration to pass on tiledb.VFS None
id str str, ID for logging 'pointcloud'
log_uri Optional[str] log array URI None

Returns

Name Type Description
Union[Sequence[os.PathLike], None] if not appending then a sequence of file paths

ingest_raster_udf

cloud.geospatial.ingestion.ingest_raster_udf(
    args={}
    dataset_uri
    sources=None
    extents=None
    band_count=None
    dtype=None
    nodata=None
    pixels_per_fragment=PIXELS_PER_FRAGMENT
    tile_size=RASTER_TILE_SIZE
    resampling=DEFAULT_RASTER_SAMPLING
    append=False
    batch_size=BATCH_SIZE
    stats=False
    verbose=False
    config=None
    compressor=None
    id='raster'
    trace=False
    log_uri=None
)

Internal udf that ingests server side batch of raster files into tiledb arrays using Rasterio API

Parameters

Name Type Description Default
args Union[Dict, List] dict, input key value arguments as a dictionary {}
dataset_uri str str, output TileDB array name required
sources Tuple[GeoBlockMetadata] tuple, sequence of GeoBlockMetadata objects containing the destination raster window and the input files that contribute to this window None
extents Optional[BoundingBox] Extents of the destination raster None
band_count Optional[int] int, number of bands in destination array None
dtype Optional[str] str, dtype of destination array None
nodata Optional[float] float, NODATA value for destination raster None
tile_size int for rasters this is the tile (block) size for the merged destination array, defaults to 1024 RASTER_TILE_SIZE
pixels_per_fragment int This is the number of pixels that will be written per fragment. Ideally aim to align as a factor of tile_size PIXELS_PER_FRAGMENT
resampling str string, resampling method, one of None, bilinear, cubic, nearest and average DEFAULT_RASTER_SAMPLING
append bool bool, whether to append to the array False
batch_size int batch size for dataset ingestion, defaults to BATCH_SIZE BATCH_SIZE
stats bool bool, print TileDB stats to stdout False
verbose bool verbose logging, defaults to False False
config Optional[Mapping[str, object]] dict, configuration to pass on tiledb.VFS None
compressor Optional[dict] dict, serialized compression filter None
id str str, ID for logging 'raster'
log_uri Optional[str] log array URI None

Returns

Name Type Description
Union[Sequence[GeoBlockMetadata], None] if not appending then a sequence of populated GeoBlockMetadata objects

load_geometry_metadata

cloud.geospatial.ingestion.load_geometry_metadata(
    sources
    *
    config=None
    verbose=False
    id='pointcloud_metadata'
    trace=False
    log_uri=None
)

Return geospatial metadata for a sequence of input geometry data files

:Return: list[GeoMetadata], a list of populated GeoMetadata objects

Parameters

Name Type Description Default
sources Iterable[os.PathLike] A sequence of paths or path to input required
config Optional[Mapping[str, object]] dict configuration to pass on tiledb.VFS None
verbose bool bool, enable verbose logging, default is False False
trace bool bool, enable trace logging, default is False False
log_uri Optional[str] Optional[str] = None, None

load_pointcloud_metadata

cloud.geospatial.ingestion.load_pointcloud_metadata(
    sources
    *
    config=None
    verbose=False
    id='pointcloud_metadata'
    trace=False
    log_uri=None
)

Return geospatial metadata for a sequence of input point cloud data files

:Return: list[GeoMetadata], a list of populated GeoMetadata objects

Parameters

Name Type Description Default
sources Iterable[os.PathLike] iterator, paths or path to process required
config Optional[Mapping[str, object]] dict, configuration to pass on tiledb.VFS None
verbose bool bool, enable verbose logging, default is False False
trace bool bool, enable trace logging, default is False False
log_uri Optional[str] Optional[str] = None, None

load_raster_metadata

cloud.geospatial.ingestion.load_raster_metadata(
    sources
    *
    config=None
    verbose=False
    id='raster_metadata'
    trace=False
    log_uri=None
)

Return geospatial metadata for a sequence of input raster data files

:Return: list[GeoMetadata]: list of populated GeoMetadata objects

Parameters

Name Type Description Default
sources Iterable[os.PathLike] iterator, paths or path to process required
config Optional[Mapping[str, object]] dict, configuration to pass on tiledb.VFS None
verbose bool bool, enable verbose logging, default is False False
trace bool bool, enable trace logging, default is False False
id str str, ID for logging 'raster_metadata'
log_uri Optional[str] Optional[str] = None, None

read_uris

cloud.geospatial.ingestion.read_uris(
    list_uri
    dataset_type
    *
    log_uri=None
    config=None
    max_files=None
    verbose=False
)

Read a list of URIs from a URI.

Parameters

Name Type Description Default
list_uri str URI of the list of URIs required
dataset_type DatasetType dataset type, one of pointcloud, raster or geometry required
log_uri Optional[str] log array URI None
config Optional[Mapping[str, object]] config dictionary, defaults to None None
max_files Optional[int] maximum number of URIs returned, defaults to None None
verbose bool verbose logging, defaults to False False

Returns

Name Type Description
Sequence[str] list of URIs

register_dataset_udf

cloud.geospatial.ingestion.register_dataset_udf(
    dataset_uri
    *
    register_name
    namespace=None
    acn=None
    config=None
    verbose=False
)

Register the dataset on TileDB Cloud.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
register_name str name to register the dataset with on TileDB Cloud required
namespace Optional[str] TileDB Cloud namespace, defaults to the user’s default namespace None
acn Optional[str] Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None None
config Optional[Mapping[str, object]] config dictionary, defaults to None None
verbose bool verbose logging, defaults to False False

remove_dataset_type_from_array_meta

cloud.geospatial.ingestion.remove_dataset_type_from_array_meta(
    dataset_uri
    *
    verbose=False
)

Removes dataset_type meta if the ingested result is an array. FIXME: This exists to fix an internal UI issue until formally fixed. FIXME: Related ticket -> sc-48098

Parameters

Name Type Description Default
dataset_uri str dataset URI required
verbose bool verbose logging, defaults to False False