soma.ingest

client.soma.ingest

Functions

Name Description
ingest_h5ad Ingests H5AD data by calling tiledbsoma.io.from_anndata
run_ingest_workflow Starts a workflow to ingest H5AD data into SOMA.
run_ingest_workflow_udf This is the highest-level ingestor component that runs on-node. Only

ingest_h5ad

client.soma.ingest.ingest_h5ad(
    output_uri,
    input_uri,
    measurement_name,
    extra_tiledb_config=None,
    platform_config=None,
    ingest_mode='write',
    logging_level=logging.INFO,
    dry_run=False,
    **kwargs,
)

Ingests H5AD data by calling tiledbsoma.io.from_anndata

Parameters

Name Type Description Default
output_uri str The output URI to write to. This will probably look like “tiledb://workspace/teamspace/path/to/soma”. required
input_uri str The URI of the H5AD file to read from. This file is read using TileDB VFS, so any path supported (and accessible) will work. required
measurement_name str The name of the Measurement within the Experiment to store the data. required
extra_tiledb_config dict Extra configuration for TileDB. None
platform_config dict The SOMA platform_config value to pass in, if any. None
ingest_mode str One of the ingest modes supported by tiledbsoma.io.read_h5ad. 'write'
logging_level int Set a logging level for this function. logging.INFO
dry_run bool If provided and set to True, does the input-path traversals without ingesting data. False

run_ingest_workflow

client.soma.ingest.run_ingest_workflow(
    output_uri,
    input_uri,
    measurement_name,
    pattern=None,
    extra_tiledb_config=None,
    platform_config=None,
    ingest_mode='write',
    ingest_resources=None,
    acn=None,
    logging_level=logging.INFO,
    dry_run=False,
    dag_factory=None,
    dag_kwargs=None,
    soma_image_name='genomics',
    wait_for_inner=False,
    **kwargs,
)

Starts a workflow to ingest H5AD data into SOMA.

Parameters

Name Type Description Default
output_uri str Output URI. required
input_uri str The URI of the H5AD file(s) to read from. These are read using TileDB VFS, so any path supported (and accessible) will work. If the input_uri passes vfs.is_file, it is ingested. If the input_uri passes vfs.is_dir, then all first-level entries are ingested . In the latter, directory case, an input file is skipped if pattern is provided and doesn’t match the input file. As well, in the directory case, each entry’s basename is appended to the output_uri to form the entry’s output URI. For example, if “a.h5ad” and “b.h5ad” are present within input_uri of “s3://bucket/h5ads/” and output_uri is “tiledb://workspace/teamspace/somas”, then “tiledb://workspace/teamspace/somas/a” and “tiledb://workspace/teamspace/somas/b” are written. required
measurement_name str The name of the measurement within the experiment to store the data. required
pattern str As described for input_uri. None
extra_tiledb_config Optional[Dict[str, object]] Extra configuration for TileDB. None
platform_config Optional[Dict[str, object]] The SOMA platform_config value to pass in, if any. None
ingest_mode Optional[str] One of the ingest modes supported by tiledbsoma.io.read_h5ad. 'write'
ingest_resources dict A specification for the amount of resources to provide to the UDF executing the ingestion process, to override the default. None
acn str The name of the credentials to pass to the executing UDF. None
dry_run bool If provided and set to True, does the input-path traversals without ingesting data. False
dag_factory callable Allows custom DAG classes to be used in tests. Defaults to dag.DAG. None
dag_kwargs dict Keyword arguments for the dag_factory. None
wait_for_inner bool Whether the inner task graph that computes run_ingest_workflow_udf() should wait for completion. Default: False. False

Returns

Name Type Description
dict A dictionary of {"status": "started", "graph_id": ...}, with the UUID of the graph on the server side, which can be used to manage execution and monitor progress.

run_ingest_workflow_udf

client.soma.ingest.run_ingest_workflow_udf(
    output_uri,
    input_uri,
    measurement_name,
    pattern=None,
    extra_tiledb_config=None,
    platform_config=None,
    ingest_mode='write',
    ingest_resources=None,
    acn=None,
    logging_level=logging.INFO,
    dry_run=False,
    dag_factory=None,
    dag_kwargs=None,
    soma_image_name='genomics',
    wait=False,
    **kwargs,
)

This is the highest-level ingestor component that runs on-node. Only here can we do VFS with access_credentials_name – that does not work correctly on the client.