vcf.query
cloud.vcf.query
Functions
Name | Description |
---|---|
build_read_dag | Build the DAG for a distributed read on a TileDB-VCF dataset. |
concat_tables_udf | Concatenate a list of Arrow tables. |
read | Run a distributed read on a TileDB-VCF dataset. |
setup | Set the default TileDB context, OS environment variables for AWS, |
vcf_query_udf | Run a query on a TileDB-VCF dataset. |
build_read_dag
cloud.vcf.query.build_read_dag(
dataset_uri,*,
=None,
config=None,
attrs=None,
regions=None,
bed_file=1,
num_region_partitions='VCF-Distributed-Query',
dag_name=MAX_WORKERS,
max_workers=None,
samples=1024,
memory_budget_mb=None,
af_filter=None,
transform_result=False,
promote_null=MAX_SAMPLE_BATCH_SIZE,
max_sample_batch_size=None,
log_uri=None,
namespace=None,
resource_class=False,
verbose=False,
batch_mode )
Build the DAG for a distributed read on a TileDB-VCF dataset.
Parameters
Name | Type | Description | Default |
---|---|---|---|
dataset_uri | str | dataset URI | required |
config | Optional[Mapping[str, Any]] | config dictionary, defaults to None | None |
attrs | Optional[Union[Sequence[str], str]] | attribute names to read, defaults to None | None |
regions | Optional[Union[Sequence[str], str, Delayed, DelayedArrayUDF, DelayedMultiArrayUDF, DelayedSQL]] | genomics regions to read, defaults to None | None |
bed_file | Optional[str] | URI of a BED file containing genomics regions to read, defaults to None | None |
num_region_partitions | int | number of region partitions, defaults to 1 | 1 |
dag_name | str | the name of the built DAG, defaults to “VCF-Distributed-Query”, | 'VCF-Distributed-Query' |
max_workers | int | maximum number of workers, defaults to 40 | MAX_WORKERS |
samples | Optional[Union[Sequence[str], str, Delayed, DelayedArrayUDF, DelayedMultiArrayUDF, DelayedSQL]] | sample names to read, defaults to None | None |
memory_budget_mb | int | VCF memory budget in MiB, defaults to 1024 | 1024 |
af_filter | Optional[str] | allele frequency filter, defaults to None | None |
transform_result | Optional[Callable[[pa.Table], pa.Table]] | function to apply to each partition; by default, does not transform the result | None |
promote_null | bool | For all cols with null dtype, cast each as dtype of joining col when dtypes are different | False |
max_sample_batch_size | int | maximum number of samples to read in a single node, defaults to 500 | MAX_SAMPLE_BATCH_SIZE |
log_uri | Optional[str] | log array URI for profiling, defaults to None | None |
namespace | Optional[str] | TileDB-Cloud namespace, defaults to None | None |
resource_class | Optional[str] | TileDB-Cloud resource class for UDFs, defaults to None | None |
verbose | bool | verbose logging, defaults to False | False |
batch_mode | bool | run the query with batch UDFs, defaults to False | False |
Returns
Name | Type | Description |
---|---|---|
Tuple[tiledb.cloud.dag.DAG, tiledb.cloud.dag.Node] | DAG and result Node |
concat_tables_udf
cloud.vcf.query.concat_tables_udf(
tables,*,
=None,
config=False,
promote_null=None,
log_uri=False,
verbose )
Concatenate a list of Arrow tables.
Parameters
Name | Type | Description | Default |
---|---|---|---|
tables | List[pa.Table] | Arrow tables | required |
config | Optional[Mapping[str, Any]] | config dictionary, defaults to None | None |
promote_null | bool | For all cols with null dtype, cast each as dtype of joining col when dtypes are different | False |
log_uri | Optional[str] | log URI for profiling, defaults to None | None |
verbose | bool | verbose logging, defaults to False | False |
Returns
Name | Type | Description |
---|---|---|
pa.table | concatenated Arrow table |
read
cloud.vcf.query.read(
dataset_uri,*,
=None,
config=None,
attrs=None,
regions=None,
bed_file=1,
num_region_partitions='VCF-Distributed-Query',
dag_name=MAX_WORKERS,
max_workers=None,
samples=1024,
memory_budget_mb=None,
af_filter=None,
transform_result=False,
promote_null=MAX_SAMPLE_BATCH_SIZE,
max_sample_batch_size=None,
log_uri=None,
namespace=None,
resource_class=False,
verbose=False,
batch_mode )
Run a distributed read on a TileDB-VCF dataset.
Parameters
Name | Type | Description | Default |
---|---|---|---|
dataset_uri | str | dataset URI | required |
config | Optional[Mapping[str, Any]] | config dictionary, defaults to None | None |
attrs | Optional[Union[Sequence[str], str]] | attribute names to read, defaults to None | None |
regions | Optional[Union[Sequence[str], str, Delayed, DelayedArrayUDF, DelayedMultiArrayUDF, DelayedSQL]] | genomics regions to read, defaults to None | None |
bed_file | Optional[str] | URI of a BED file containing genomics regions to read, defaults to None | None |
num_region_partitions | int | number of region partitions, defaults to 1 | 1 |
dag_name | str | the name of the read DAG, defaults to “VCF-Distributed-Query”, | 'VCF-Distributed-Query' |
max_workers | int | maximum number of workers, defaults to 40 | MAX_WORKERS |
samples | Optional[Union[Sequence[str], str, Delayed, DelayedArrayUDF, DelayedMultiArrayUDF, DelayedSQL]] | sample names to read, defaults to None | None |
memory_budget_mb | int | VCF memory budget in MiB, defaults to 1024 | 1024 |
af_filter | Optional[str] | allele frequency filter, defaults to None | None |
transform_result | Optional[Callable[[pa.Table], pa.Table]] | function to apply to each partition; by default, does not transform the result | None |
promote_null | bool | For all cols with null dtype, cast each as dtype of joining col when dtypes are different | False |
max_sample_batch_size | int | maximum number of samples to read in a single node, defaults to 500 | MAX_SAMPLE_BATCH_SIZE |
log_uri | Optional[str] | log array URI for profiling, defaults to None | None |
namespace | Optional[str] | TileDB-Cloud namespace, defaults to None | None |
resource_class | Optional[str] | TileDB-Cloud resource class for UDFs, defaults to None | None |
verbose | bool | verbose logging, defaults to False | False |
batch_mode | bool | run the query with batch UDFs, defaults to False | False |
Returns
Name | Type | Description |
---|---|---|
pa.Table | Arrow table containing the query results |
setup
=None, verbose=False) cloud.vcf.query.setup(config
Set the default TileDB context, OS environment variables for AWS, and return a logger instance.
Parameters
Name | Type | Description | Default |
---|---|---|---|
config | Optional[Mapping[str, Any]] | config dictionary, defaults to None | None |
Returns
Name | Type | Description |
---|---|---|
logging.Logger | logger instance |
vcf_query_udf
cloud.vcf.query.vcf_query_udf(
dataset_uri,*,
=None,
config=None,
attrs=None,
regions=None,
bed_file=None,
samples=None,
region_partition=None,
sample_partition=1024,
memory_budget_mb=None,
af_filter=None,
transform_result=False,
promote_null=None,
log_uri='query',
log_id=False,
verbose )
Run a query on a TileDB-VCF dataset.
Parameters
Name | Type | Description | Default |
---|---|---|---|
dataset_uri | str | dataset URI | required |
config | Optional[Mapping[str, Any]] | config dictionary, defaults to None | None |
attrs | Optional[Union[Sequence[str], str]] | attribute names to read, defaults to None | None |
regions | Optional[Union[Sequence[str], str, pd.DataFrame]] | genomics regions to read, defaults to None | None |
bed_file | Optional[str] | URI of a BED file containing genomics regions to read, defaults to None | None |
samples | Optional[Union[Sequence[str], str]] | sample names to read, defaults to None | None |
region_partition | Optional[Tuple[int, int]] | region partition tuple (0-based indexed, num_partitions), defaults to None | None |
sample_partition | Optional[Tuple[int, int]] | sample partition tuple (0-based indexed, num_partitions), defaults to None | None |
memory_budget_mb | int | VCF memory budget in MiB, defaults to 1024 | 1024 |
af_filter | Optional[str] | allele frequency filter, defaults to None | None |
transform_result | Optional[Callable[[pa.Table], pa.Table]] | function to apply to the result table; by default, does not transform the result | None |
promote_null | bool | For all cols with null dtype, cast each as dtype of joining col when dtypes are different | False |
log_uri | Optional[str] | log array URI for profiling, defaults to None | None |
log_id | str | profiler event ID, defaults to “query” | 'query' |
verbose | bool | verbose logging, defaults to False | False |
Returns
Name | Type | Description |
---|---|---|
pa.table | Arrow table containing the query results |