vcf.query
client.vcf.query
Functions
| Name | Description |
|---|---|
| build_read_dag | Build the DAG for a distributed read on a TileDB-VCF dataset. |
| concat_tables_udf | Concatenate a list of Arrow tables. |
| read | Run a distributed read on a TileDB-VCF dataset. |
| read_samples | Reads sample IDs from a TileDB-VCF dataset. |
| setup | Set the default TileDB context, OS environment variables for AWS, |
| vcf_query_udf | Run a query on a TileDB-VCF dataset. |
build_read_dag
client.vcf.query.build_read_dag(
dataset_uri,
teamspace,
*,
config=None,
attrs=None,
regions=None,
bed_file=None,
num_region_partitions=1,
dag_name='VCF-Distributed-Query',
max_workers=MAX_WORKERS,
samples=None,
samples_task_id=None,
memory_budget_mb=1024,
af_filter=None,
transform_result=None,
promote_null=False,
max_sample_batch_size=MAX_SAMPLE_BATCH_SIZE,
log_uri=None,
workspace=None,
resource_class=None,
verbose=False,
batch_mode=False,
image_name='genomics',
)Build the DAG for a distributed read on a TileDB-VCF dataset.
:param dataset_uri: dataset URI :param teamspace: teamspace to execute task graph via :param config: config dictionary, defaults to None :param attrs: attribute names to read, defaults to None :param regions: genomics regions to read, defaults to None :param bed_file: URI of a BED file containing genomics regions to read, defaults to None :param num_region_partitions: number of region partitions, defaults to 1 :param dag_name: the name of the built DAG, defaults to “VCF-Distributed-Query”, :param max_workers: maximum number of workers, defaults to 40 :param samples: sample names to read (’’ for sample-less query; None for all), defaults to None :param samples_task_id: the ID of a task that fetches sample names, defaults to None :param memory_budget_mb: VCF memory budget in MiB, defaults to 1024 :param af_filter: allele frequency filter, defaults to None :param transform_result: function to apply to each partition; by default, does not transform the result :param promote_null: For all cols with null dtype, cast each as dtype of joining col when dtypes are different :param max_sample_batch_size: maximum number of samples to read in a single node, defaults to 500 :param log_uri: log array URI for profiling, defaults to None :param workspace: TileDB-Cloud workspace, defaults to None :param resource_class: TileDB-Cloud resource class for UDFs, defaults to None :param verbose: verbose logging, defaults to False :param batch_mode: run the query with batch UDFs, defaults to False :param image_name: udf image name to use, useful for testing beta features :return: DAG and result Node
concat_tables_udf
client.vcf.query.concat_tables_udf(
tables,
*,
config=None,
promote_null=False,
log_uri=None,
verbose=False,
)Concatenate a list of Arrow tables.
:param tables: Arrow tables :param config: config dictionary, defaults to None :param promote_null: For all cols with null dtype, cast each as dtype of joining col when dtypes are different :param log_uri: log URI for profiling, defaults to None :param verbose: verbose logging, defaults to False :return: concatenated Arrow table
read
client.vcf.query.read(
dataset_uri,
teamspace,
*,
config=None,
attrs=None,
regions=None,
bed_file=None,
num_region_partitions=1,
dag_name='VCF-Distributed-Query',
max_workers=MAX_WORKERS,
samples=None,
samples_task_id=None,
memory_budget_mb=1024,
af_filter=None,
transform_result=None,
promote_null=False,
max_sample_batch_size=MAX_SAMPLE_BATCH_SIZE,
log_uri=None,
workspace=None,
resource_class=None,
verbose=False,
batch_mode=False,
image_name='genomics',
)Run a distributed read on a TileDB-VCF dataset.
:param dataset_uri: dataset URI :param teamspace: teamspace to execute task graph via :param config: config dictionary, defaults to None :param attrs: attribute names to read, defaults to None :param regions: genomics regions to read, defaults to None :param bed_file: URI of a BED file containing genomics regions to read, defaults to None :param num_region_partitions: number of region partitions, defaults to 1 :param dag_name: the name of the read DAG, defaults to “VCF-Distributed-Query”, :param max_workers: maximum number of workers, defaults to 40 :param samples: sample names to read (’’ for sample-less query; None for all), defaults to None :param samples_task_id: the ID of a task that fetches sample names, defaults to None :param memory_budget_mb: VCF memory budget in MiB, defaults to 1024 :param af_filter: allele frequency filter, defaults to None :param transform_result: function to apply to each partition; by default, does not transform the result :param promote_null: For all cols with null dtype, cast each as dtype of joining col when dtypes are different :param max_sample_batch_size: maximum number of samples to read in a single node, defaults to 500 :param log_uri: log array URI for profiling, defaults to None :param workspace: TileDB-Cloud workspace, defaults to None :param resource_class: TileDB-Cloud resource class for UDFs, defaults to None :param verbose: verbose logging, defaults to False :param batch_mode: run the query with batch UDFs, defaults to False :param image_name: udf image name to use, useful for testing beta features :return: Arrow table containing the query results
read_samples
client.vcf.query.read_samples(
dataset_uri,
config=None,
verbose=False,
image_name='genomics',
)Reads sample IDs from a TileDB-VCF dataset.
:param dataset_uri: dataset URI :param config: config dictionary, defaults to None :param verbose: verbose logging, defaults to False :param image_name: udf image name to use, useful for testing beta features :return: List of sample IDs as strings
setup
client.vcf.query.setup(config=None, verbose=False)Set the default TileDB context, OS environment variables for AWS, and return a logger instance.
:param config: config dictionary, defaults to None :return: logger instance
vcf_query_udf
client.vcf.query.vcf_query_udf(
dataset_uri,
*,
config=None,
attrs=None,
regions=None,
bed_file=None,
samples=None,
samples_task_id=None,
region_partition=None,
sample_partition=None,
memory_budget_mb=1024,
af_filter=None,
transform_result=None,
promote_null=False,
log_uri=None,
log_id='query',
verbose=False,
image_name='genomics',
)Run a query on a TileDB-VCF dataset.
:param dataset_uri: dataset URI :param config: config dictionary, defaults to None :param attrs: attribute names to read, defaults to None :param regions: genomics regions to read, defaults to None :param bed_file: URI of a BED file containing genomics regions to read, defaults to None :param samples: sample names to read (’’ for sample-less query; None for all), defaults to None :param samples_task_id: the ID of a task that fetches sample names, defaults to None :param region_partition: region partition tuple (0-based indexed, num_partitions), defaults to None :param sample_partition: sample partition tuple (0-based indexed, num_partitions), defaults to None :param memory_budget_mb: VCF memory budget in MiB, defaults to 1024 :param af_filter: allele frequency filter, defaults to None :param transform_result: function to apply to the result table; by default, does not transform the result :param promote_null: For all cols with null dtype, cast each as dtype of joining col when dtypes are different :param log_uri: log array URI for profiling, defaults to None :param log_id: profiler event ID, defaults to “query” :param verbose: verbose logging, defaults to False :param image_name: udf image name to use, useful for testing beta features :return: Arrow table containing the query results