vcf.query

client.vcf.query

Functions

Name Description
build_read_dag Build the DAG for a distributed read on a TileDB-VCF dataset.
concat_tables_udf Concatenate a list of Arrow tables.
read Run a distributed read on a TileDB-VCF dataset.
read_samples Reads sample IDs from a TileDB-VCF dataset.
setup Set the default TileDB context, OS environment variables for AWS,
vcf_query_udf Run a query on a TileDB-VCF dataset.

build_read_dag

client.vcf.query.build_read_dag(
    dataset_uri,
    teamspace,
    *,
    config=None,
    attrs=None,
    regions=None,
    bed_file=None,
    num_region_partitions=1,
    dag_name='VCF-Distributed-Query',
    max_workers=MAX_WORKERS,
    samples=None,
    samples_task_id=None,
    memory_budget_mb=1024,
    af_filter=None,
    transform_result=None,
    promote_null=False,
    max_sample_batch_size=MAX_SAMPLE_BATCH_SIZE,
    log_uri=None,
    workspace=None,
    resource_class=None,
    verbose=False,
    batch_mode=False,
    image_name='genomics',
)

Build the DAG for a distributed read on a TileDB-VCF dataset.

:param dataset_uri: dataset URI :param teamspace: teamspace to execute task graph via :param config: config dictionary, defaults to None :param attrs: attribute names to read, defaults to None :param regions: genomics regions to read, defaults to None :param bed_file: URI of a BED file containing genomics regions to read, defaults to None :param num_region_partitions: number of region partitions, defaults to 1 :param dag_name: the name of the built DAG, defaults to “VCF-Distributed-Query”, :param max_workers: maximum number of workers, defaults to 40 :param samples: sample names to read (’’ for sample-less query; None for all), defaults to None :param samples_task_id: the ID of a task that fetches sample names, defaults to None :param memory_budget_mb: VCF memory budget in MiB, defaults to 1024 :param af_filter: allele frequency filter, defaults to None :param transform_result: function to apply to each partition; by default, does not transform the result :param promote_null: For all cols with null dtype, cast each as dtype of joining col when dtypes are different :param max_sample_batch_size: maximum number of samples to read in a single node, defaults to 500 :param log_uri: log array URI for profiling, defaults to None :param workspace: TileDB-Cloud workspace, defaults to None :param resource_class: TileDB-Cloud resource class for UDFs, defaults to None :param verbose: verbose logging, defaults to False :param batch_mode: run the query with batch UDFs, defaults to False :param image_name: udf image name to use, useful for testing beta features :return: DAG and result Node

concat_tables_udf

client.vcf.query.concat_tables_udf(
    tables,
    *,
    config=None,
    promote_null=False,
    log_uri=None,
    verbose=False,
)

Concatenate a list of Arrow tables.

:param tables: Arrow tables :param config: config dictionary, defaults to None :param promote_null: For all cols with null dtype, cast each as dtype of joining col when dtypes are different :param log_uri: log URI for profiling, defaults to None :param verbose: verbose logging, defaults to False :return: concatenated Arrow table

read

client.vcf.query.read(
    dataset_uri,
    teamspace,
    *,
    config=None,
    attrs=None,
    regions=None,
    bed_file=None,
    num_region_partitions=1,
    dag_name='VCF-Distributed-Query',
    max_workers=MAX_WORKERS,
    samples=None,
    samples_task_id=None,
    memory_budget_mb=1024,
    af_filter=None,
    transform_result=None,
    promote_null=False,
    max_sample_batch_size=MAX_SAMPLE_BATCH_SIZE,
    log_uri=None,
    workspace=None,
    resource_class=None,
    verbose=False,
    batch_mode=False,
    image_name='genomics',
)

Run a distributed read on a TileDB-VCF dataset.

:param dataset_uri: dataset URI :param teamspace: teamspace to execute task graph via :param config: config dictionary, defaults to None :param attrs: attribute names to read, defaults to None :param regions: genomics regions to read, defaults to None :param bed_file: URI of a BED file containing genomics regions to read, defaults to None :param num_region_partitions: number of region partitions, defaults to 1 :param dag_name: the name of the read DAG, defaults to “VCF-Distributed-Query”, :param max_workers: maximum number of workers, defaults to 40 :param samples: sample names to read (’’ for sample-less query; None for all), defaults to None :param samples_task_id: the ID of a task that fetches sample names, defaults to None :param memory_budget_mb: VCF memory budget in MiB, defaults to 1024 :param af_filter: allele frequency filter, defaults to None :param transform_result: function to apply to each partition; by default, does not transform the result :param promote_null: For all cols with null dtype, cast each as dtype of joining col when dtypes are different :param max_sample_batch_size: maximum number of samples to read in a single node, defaults to 500 :param log_uri: log array URI for profiling, defaults to None :param workspace: TileDB-Cloud workspace, defaults to None :param resource_class: TileDB-Cloud resource class for UDFs, defaults to None :param verbose: verbose logging, defaults to False :param batch_mode: run the query with batch UDFs, defaults to False :param image_name: udf image name to use, useful for testing beta features :return: Arrow table containing the query results

read_samples

client.vcf.query.read_samples(
    dataset_uri,
    config=None,
    verbose=False,
    image_name='genomics',
)

Reads sample IDs from a TileDB-VCF dataset.

:param dataset_uri: dataset URI :param config: config dictionary, defaults to None :param verbose: verbose logging, defaults to False :param image_name: udf image name to use, useful for testing beta features :return: List of sample IDs as strings

setup

client.vcf.query.setup(config=None, verbose=False)

Set the default TileDB context, OS environment variables for AWS, and return a logger instance.

:param config: config dictionary, defaults to None :return: logger instance

vcf_query_udf

client.vcf.query.vcf_query_udf(
    dataset_uri,
    *,
    config=None,
    attrs=None,
    regions=None,
    bed_file=None,
    samples=None,
    samples_task_id=None,
    region_partition=None,
    sample_partition=None,
    memory_budget_mb=1024,
    af_filter=None,
    transform_result=None,
    promote_null=False,
    log_uri=None,
    log_id='query',
    verbose=False,
    image_name='genomics',
)

Run a query on a TileDB-VCF dataset.

:param dataset_uri: dataset URI :param config: config dictionary, defaults to None :param attrs: attribute names to read, defaults to None :param regions: genomics regions to read, defaults to None :param bed_file: URI of a BED file containing genomics regions to read, defaults to None :param samples: sample names to read (’’ for sample-less query; None for all), defaults to None :param samples_task_id: the ID of a task that fetches sample names, defaults to None :param region_partition: region partition tuple (0-based indexed, num_partitions), defaults to None :param sample_partition: sample partition tuple (0-based indexed, num_partitions), defaults to None :param memory_budget_mb: VCF memory budget in MiB, defaults to 1024 :param af_filter: allele frequency filter, defaults to None :param transform_result: function to apply to the result table; by default, does not transform the result :param promote_null: For all cols with null dtype, cast each as dtype of joining col when dtypes are different :param log_uri: log array URI for profiling, defaults to None :param log_id: profiler event ID, defaults to “query” :param verbose: verbose logging, defaults to False :param image_name: udf image name to use, useful for testing beta features :return: Arrow table containing the query results