soma.ingest

cloud.soma.ingest

Functions

Name Description
ingest_h5ad Performs the actual work of ingesting H5AD data into TileDB.
register_dataset_udf Register the dataset on TileDB Cloud.
run_ingest_workflow Starts a workflow to ingest H5AD data into SOMA.
run_ingest_workflow_udf This is the highest-level ingestor component that runs on-node. Only here

ingest_h5ad

cloud.soma.ingest.ingest_h5ad(
    output_uri
    input_uri
    measurement_name
    extra_tiledb_config=None
    platform_config=None
    ingest_mode='write'
    logging_level=logging.INFO
    dry_run=False
)

Performs the actual work of ingesting H5AD data into TileDB.

Parameters

Name Type Description Default
output_uri str The output URI to write to. This will probably look like tiledb://namespace/some://storage/uri. required
input_uri str The URI of the H5AD file to read from. This file is read using TileDB VFS, so any path supported (and accessible) will work. required
measurement_name str The name of the Measurement within the Experiment to store the data. required
extra_tiledb_config Optional[Dict[str, object]] Extra configuration for TileDB. None
platform_config Optional[Dict[str, object]] The SOMA platform_config value to pass in, if any. None
ingest_mode str One of the ingest modes supported by tiledbsoma.io.read_h5ad. 'write'
dry_run bool If provided and set to True, does the input-path traversals without ingesting data. False

register_dataset_udf

cloud.soma.ingest.register_dataset_udf(
    dataset_uri
    *
    register_name
    acn
    namespace=None
    config=None
    verbose=False
)

Register the dataset on TileDB Cloud.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
register_name str name to register the dataset with on TileDB Cloud required
namespace Optional[str] TileDB Cloud namespace, defaults to the user’s default namespace None
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
verbose bool verbose logging, defaults to False False

run_ingest_workflow

cloud.soma.ingest.run_ingest_workflow(
    output_uri
    input_uri
    measurement_name
    pattern=None
    extra_tiledb_config=None
    platform_config=None
    ingest_mode='write'
    ingest_resources=None
    namespace=None
    register_name=None
    acn=None
    logging_level=logging.INFO
    dry_run=False
    **kwargs
)

Starts a workflow to ingest H5AD data into SOMA.

Parameters

Name Type Description Default
output_uri str The output URI to write to. This will probably look like tiledb://namespace/some://storage/uri. required
input_uri str The URI of the H5AD file(s) to read from. These are read using TileDB VFS, so any path supported (and accessible) will work. If the input_uri passes vfs.is_file, it’s ingested. If the input_uri passes vfs.is_dir, then all first-level entries are ingested . In the latter, directory case, an input file is skipped if pattern is provided and doesn’t match the input file. As well, in the directory case, each entry’s basename is appended to the output_uri to form the entry’s output URI. For example, if a.h5ad` andb.h5adare present withininput_uriofs3://bucket/h5ads/andoutput_uriistiledb://namespace/s3://bucket/somas, thentiledb://namespace/s3://bucket/somas/aandtiledb://namespace/s3://bucket/somas/bare written. | _required_ | | measurement_name | str | The name of the Measurement within the Experiment to store the data. | _required_ | | pattern | Optional\[str\] | As described forinput_uri. | `None` | | extra_tiledb_config | Optional\[Dict\[str, object\]\] | Extra configuration for TileDB. | `None` | | platform_config | Optional\[Dict\[str, object\]\] | The SOMAplatform_configvalue to pass in, if any. | `None` | | ingest_mode | str | One of the ingest modes supported bytiledbsoma.io.read_h5ad. | `'write'` | | ingest_resources | Optional\[Dict\[str, object\]\] | A specification for the amount of resources to provide to the UDF executing the ingestion process, to override the default. | `None` | | namespace | Optional\[str\] | An alternate namespace to run the ingestion process under. | `None` | | register_name | Optional\[str\] | name to register the dataset with on TileDB Cloud. | `None` | | acn | Optional\[str\] | The name of the credentials to pass to the executing UDF. | `None` | | dry_run | bool | If provided and set toTrue`, does the input-path traversals without ingesting data. |False`

Returns

Name Type Description
Dict[str, str] A dictionary of {"status": "started", "graph_id": ...}, with the UUID of the graph on the server side, which can be used to manage execution and monitor progress.

run_ingest_workflow_udf

cloud.soma.ingest.run_ingest_workflow_udf(
    output_uri
    input_uri
    measurement_name
    pattern=None
    extra_tiledb_config=None
    platform_config=None
    ingest_mode='write'
    ingest_resources=None
    namespace=None
    register_name=None
    acn=None
    logging_level=logging.INFO
    dry_run=False
    **kwargs
)

This is the highest-level ingestor component that runs on-node. Only here can we do VFS with access_credentials_name – that does not work correctly on the client.