vcf.ingestion
cloud.vcf.ingestion
Classes
Name | Description |
---|---|
Contigs | The contigs to ingest. |
Contigs
cloud.vcf.ingestion.Contigs()
The contigs to ingest.
ALL = all contigs CHROMOSOMES = all human chromosomes OTHER = all contigs other than the human chromosomes ALL_DISABLE_MERGE = all contigs with merging disabled, for non-human datasets
Functions
Name | Description |
---|---|
consolidate_dataset_udf | Consolidate arrays in the dataset. |
create_dataset_udf | Create a TileDB-VCF dataset. |
create_manifest | Create a manifest array in the dataset. |
filter_samples_udf | Return URIs for samples not already in the dataset. |
filter_uris_udf | Return URIs from sample_uris that are not in the manifest. |
find_uris_aws_udf | Find URIs matching a pattern in the search_uri path with an efficient |
find_uris_udf | Find URIs matching a pattern in the search_uri path. |
get_logger_wrapper | Get a logger instance and log version information. |
ingest_manifest_dag | Create a DAG to load the manifest array. |
ingest_manifest_udf | Ingest sample URIs into the manifest array. |
ingest_samples_dag | Create a DAG to ingest samples into the dataset. |
ingest_samples_udf | Ingest samples into the dataset. |
ingest_vcf | Ingest samples into a dataset. |
ingest_vcf_annotations | Ingest annotation VCF into a dataset. For example, a ClinVar or gnomAD VCF. |
read_metadata_uris_udf | Read a list of URIs from a TileDB array. The URIs will be read from the |
read_uris_udf | Read a list of URIs from a URI. |
register_dataset_udf | Register the dataset on TileDB Cloud. |
consolidate_dataset_udf
cloud.vcf.ingestion.consolidate_dataset_udf(
dataset_uri*
=None
config=MANIFEST_ARRAY
exclude=None
includeid='consolidate'
=False
verbose )
Consolidate arrays in the dataset.
Parameters
Name | Type | Description | Default |
---|---|---|---|
dataset_uri | str | dataset URI | required |
config | Optional[Mapping[str, Any]] | config dictionary, defaults to None | None |
exclude | Optional[Union[Sequence[str], str]] | group members to exclude, defaults to MANIFEST_ARRAY | MANIFEST_ARRAY |
include | Optional[Union[Sequence[str], str]] | group members to include, defaults to None | None |
id | str | profiler event id, defaults to “consolidate” | 'consolidate' |
verbose | bool | verbose logging, defaults to False | False |
create_dataset_udf
cloud.vcf.ingestion.create_dataset_udf(
dataset_uri*
=None
config=None
extra_attrs=None
vcf_attrs=None
anchor_gap=None
compression_level=False
annotation_dataset=False
verbose )
Create a TileDB-VCF dataset.
Parameters
Name | Type | Description | Default |
---|---|---|---|
dataset_uri | str | dataset URI | required |
config | Optional[Mapping[str, Any]] | config dictionary, defaults to None | None |
extra_attrs | Optional[Union[Sequence[str], str]] | INFO/FORMAT fields to materialize, defaults to None | None |
vcf_attrs | Optional[str] | VCF with all INFO/FORMAT fields to materialize, defaults to None | None |
anchor_gap | Optional[int] | anchor gap for VCF dataset, defaults to None | None |
compression_level | Optional[int] | zstd compression level for the VCF dataset, defaults to None (uses the default level in TileDB-VCF) | None |
annotation_dataset | bool | create an annotation dataset, defaults to False | False |
verbose | bool | verbose logging, defaults to False | False |
Returns
Name | Type | Description |
---|---|---|
str | dataset URI |
create_manifest
cloud.vcf.ingestion.create_manifest(dataset_uri, group)
Create a manifest array in the dataset.
Parameters
Name | Type | Description | Default |
---|---|---|---|
dataset_uri | str | dataset URI | required |
group | tiledb.Group | dataset group | required |
filter_samples_udf
cloud.vcf.ingestion.filter_samples_udf(
dataset_uri*
=None
config=False
verbose )
Return URIs for samples not already in the dataset.
Parameters
Name | Type | Description | Default |
---|---|---|---|
dataset_uri | str | dataset URI | required |
config | Optional[Mapping[str, Any]] | config dictionary, defaults to None | None |
verbose | bool | verbose logging, defaults to False | False |
Returns
Name | Type | Description |
---|---|---|
Sequence[str] | sample URIs |
filter_uris_udf
cloud.vcf.ingestion.filter_uris_udf(
dataset_uri
sample_uris*
=None
config=False
verbose )
Return URIs from sample_uris
that are not in the manifest.
Parameters
Name | Type | Description | Default |
---|---|---|---|
dataset_uri | str | dataset URI | required |
sample_uris | Sequence[str] | sample URIs | required |
config | Optional[Mapping[str, Any]] | config dictionary, defaults to None | None |
verbose | bool | verbose logging, defaults to False | False |
Returns
Name | Type | Description |
---|---|---|
Sequence[str] | filtered sample URIs |
find_uris_aws_udf
cloud.vcf.ingestion.find_uris_aws_udf(
dataset_uri
search_uri*
=None
config=None
include=None
exclude=None
max_files=False
verbose )
Find URIs matching a pattern in the search_uri
path with an efficient implementation for S3.
include
and exclude
patterns are Unix shell style (see fnmatch module).
Parameters
Name | Type | Description | Default |
---|---|---|---|
dataset_uri | str | dataset URI | required |
search_uri | str | URI to search for VCF files | required |
config | Optional[Mapping[str, Any]] | config dictionary, defaults to None | None |
include | Optional[str] | include pattern used in the search, defaults to None | None |
exclude | Optional[str] | exclude pattern applied to the search results, defaults to None | None |
max_files | Optional[int] | maximum number of URIs returned, defaults to None | None |
verbose | bool | verbose logging, defaults to False | False |
Returns
Name | Type | Description |
---|---|---|
Sequence[str] | list of URIs |
find_uris_udf
cloud.vcf.ingestion.find_uris_udf(
dataset_uri
search_uri*
=None
config=None
include=None
exclude=None
max_files=False
verbose )
Find URIs matching a pattern in the search_uri
path.
include
and exclude
patterns are Unix shell style (see fnmatch module).
Parameters
Name | Type | Description | Default |
---|---|---|---|
dataset_uri | str | dataset URI | required |
search_uri | str | URI to search for VCF files | required |
config | Optional[Mapping[str, Any]] | config dictionary, defaults to None | None |
include | Optional[str] | include pattern used in the search, defaults to None | None |
exclude | Optional[str] | exclude pattern applied to the search results, defaults to None | None |
max_files | Optional[int] | maximum number of URIs returned, defaults to None | None |
verbose | bool | verbose logging, defaults to False | False |
Returns
Name | Type | Description |
---|---|---|
Sequence[str] | list of URIs |
get_logger_wrapper
=False) cloud.vcf.ingestion.get_logger_wrapper(verbose
Get a logger instance and log version information.
Parameters
Name | Type | Description | Default |
---|---|---|---|
verbose | bool | verbose logging, defaults to False | False |
Returns
Name | Type | Description |
---|---|---|
logging.Logger | logger instance |
ingest_manifest_dag
cloud.vcf.ingestion.ingest_manifest_dag(
dataset_uri*
=None
acn=None
config=None
namespace=None
search_uri=None
pattern=None
ignore=None
sample_list_uri=None
metadata_uri='uri'
metadata_attr=None
max_files=MANIFEST_BATCH_SIZE
batch_size=MANIFEST_WORKERS
workers=None
extra_attrs=None
vcf_attrs=None
anchor_gap=None
compression_level=False
verbose=False
aws_find_mode=False
disable_manifest )
Create a DAG to load the manifest array.
Parameters
Name | Type | Description | Default |
---|---|---|---|
dataset_uri | str | dataset URI | required |
acn | Optional[str] | Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None | None |
config | Optional[Mapping[str, Any]] | config dictionary, defaults to None | None |
namespace | Optional[str] | TileDB-Cloud namespace, defaults to None | None |
search_uri | Optional[str] | URI to search for VCF files, defaults to None | None |
pattern | Optional[str] | pattern to match when searching for VCF files, defaults to None | None |
ignore | Optional[str] | pattern to ignore when searching for VCF files, defaults to None | None |
sample_list_uri | Optional[str] | URI with a list of VCF URIs, defaults to None | None |
metadata_uri | Optional[str] | URI of metadata array holding VCF URIs, defaults to None | None |
metadata_attr | str | name of metadata attribute containing URIs, defaults to “uri” | 'uri' |
max_files | Optional[int] | maximum number of URIs to ingest, defaults to None | None |
batch_size | int | manifest batch size, defaults to MANIFEST_BATCH_SIZE | MANIFEST_BATCH_SIZE |
workers | int | maximum number of parallel workers, defaults to MANIFEST_WORKERS | MANIFEST_WORKERS |
extra_attrs | Optional[Union[Sequence[str], str]] | INFO/FORMAT fields to materialize, defaults to None | None |
vcf_attrs | Optional[str] | VCF with all INFO/FORMAT fields to materialize, defaults to None | None |
anchor_gap | Optional[int] | anchor gap for VCF dataset, defaults to None | None |
compression_level | Optional[int] | zstd compression level for the VCF dataset, defaults to None (uses the default level in TileDB-VCF) | None |
verbose | bool | verbose logging, defaults to False | False |
aws_find_mode | bool | use AWS CLI to find VCFs, defaults to False | False |
disable_manifest | bool | disable manifest creation, defaults to False | False |
ingest_manifest_udf
cloud.vcf.ingestion.ingest_manifest_udf(
dataset_uri
sample_uris*
=None
configid='manifest'
=False
verbose )
Ingest sample URIs into the manifest array.
Parameters
Name | Type | Description | Default |
---|---|---|---|
dataset_uri | str | dataset URI | required |
sample_uris | Sequence[str] | sample URIs | required |
config | Optional[Mapping[str, Any]] | config dictionary, defaults to None | None |
id | str | profiler event id, defaults to “manifest” | 'manifest' |
verbose | bool | verbose logging, defaults to False | False |
ingest_samples_dag
cloud.vcf.ingestion.ingest_samples_dag(
dataset_uri*
=None
acn=None
config=None
namespace=Contigs.ALL
contigs=VCF_THREADS
threads=VCF_BATCH_SIZE
batch_size=VCF_WORKERS
workers=None
max_samples=True
resume=None
ingest_resources=False
verbose=True
create_index=None
trace_id=False
consolidate_stats=False
use_remote_tmp=None
sample_list_uri )
Create a DAG to ingest samples into the dataset.
Note: If sample_list_uri
is provided, the manifest is not checked for existing samples.
Parameters
Name | Type | Description | Default |
---|---|---|---|
dataset_uri | str | dataset URI | required |
acn | Optional[str] | Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None | None |
config | Optional[Mapping[str, Any]] | config dictionary, defaults to None | None |
namespace | Optional[str] | TileDB-Cloud namespace, defaults to None | None |
contigs | Optional[Union[Sequence[str], Contigs]] | contig mode (Contigs.ALL | Contigs.CHROMOSOMES | Contigs.OTHER | Contigs.ALL_DISABLE_MERGE) or list of contigs to ingest, defaults to Contigs.ALL | Contigs.ALL |
threads | int | number of threads to use per ingestion task, defaults to VCF_THREADS | VCF_THREADS |
batch_size | int | sample batch size, defaults to VCF_BATCH_SIZE | VCF_BATCH_SIZE |
workers | int | maximum number of parallel workers, defaults to VCF_WORKERS | VCF_WORKERS |
max_samples | Optional[int] | maximum number of samples to ingest, defaults to None (no limit) | None |
resume | bool | enable resume ingestion mode, defaults to True | True |
ingest_resources | Optional[Mapping[str, str]] | manual override for ingest UDF resources, defaults to None | None |
verbose | bool | verbose logging, defaults to False | False |
create_index | bool | force creation of a local index file, defaults to True | True |
trace_id | Optional[str] | trace ID for logging, defaults to None | None |
consolidate_stats | bool | consolidate the stats arrays, defaults to False | False |
use_remote_tmp | bool | use remote tmp space if VCFs need to be bgzipped, defaults to False (preferred for small VCFs) | False |
sample_list_uri | Optional[str] | URI with a list of VCF URIs, defaults to None | None |
ingest_samples_udf
cloud.vcf.ingestion.ingest_samples_udf(
dataset_uri
sample_uris*
=None
config
threads
memory_mb
sample_batch_size='all'
contig_mode=None
contigs_to_keep_separate=True
contig_fragment_merging=True
resume=True
create_indexid='samples'
=False
verbose=None
trace_id=False
use_remote_tmp )
Ingest samples into the dataset.
Parameters
Name | Type | Description | Default |
---|---|---|---|
dataset_uri | str | dataset URI | required |
sample_uris | Sequence[str] | sample URIs | required |
threads | int | number of threads to use for ingestion | required |
memory_mb | int | memory to use for ingestion in MiB | required |
sample_batch_size | int | sample batch size to use for ingestion | required |
config | Optional[Mapping[str, Any]] | config dictionary, defaults to None | None |
contig_mode | str | ingestion mode, defaults to “all” | 'all' |
contigs_to_keep_separate | Optional[Sequence[str]] | list of contigs to keep separate, defaults to None | None |
contig_fragment_merging | bool | enable contig fragment merging, defaults to True | True |
resume | bool | enable resume ingestion mode, defaults to True | True |
create_index | bool | force creation of a local index file, defaults to True | True |
id | str | profiler event id, defaults to “samples” | 'samples' |
verbose | bool | verbose logging, defaults to False | False |
trace_id | Optional[str] | trace ID for logging, defaults to None | None |
use_remote_tmp | bool | use remote tmp space if VCFs need to be bgzipped, defaults to False (preferred for small VCFs) | False |
ingest_vcf
cloud.vcf.ingestion.ingest_vcf(
dataset_uri*
=None
acn=None
config=None
namespace=None
register_name=None
search_uri=None
pattern=None
ignore=None
sample_list_uri=None
metadata_uri='uri'
metadata_attr=None
max_files=None
max_samples=Contigs.ALL
contigs=True
resume=DEFAULT_ATTRIBUTES
extra_attrs=None
vcf_attrs=None
anchor_gap=None
compression_level=MANIFEST_BATCH_SIZE
manifest_batch_size=MANIFEST_WORKERS
manifest_workers=VCF_BATCH_SIZE
vcf_batch_size=VCF_WORKERS
vcf_workers=VCF_THREADS
vcf_threads=None
ingest_resources=False
verbose=True
create_index=None
trace_id=True
consolidate_stats=False
aws_find_mode=False
use_remote_tmp=False
disable_manifest )
Ingest samples into a dataset.
Parameters
Name | Type | Description | Default |
---|---|---|---|
dataset_uri | str | dataset URI | required |
acn | Optional[str] | Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None | None |
config | config dictionary, defaults to None | None |
|
namespace | Optional[str] | TileDB-Cloud namespace, defaults to None | None |
register_name | Optional[str] | name to register the dataset with on TileDB Cloud, defaults to None | None |
search_uri | Optional[str] | URI to search for VCF files, defaults to None | None |
pattern | Optional[str] | Unix shell style pattern to match when searching for VCF files, defaults to None | None |
ignore | Optional[str] | Unix shell style pattern to ignore when searching for VCF files, defaults to None | None |
sample_list_uri | Optional[str] | URI with a list of VCF URIs, defaults to None | None |
metadata_uri | Optional[str] | URI of metadata array holding VCF URIs, defaults to None | None |
metadata_attr | str | name of metadata attribute containing URIs, defaults to “uri” | 'uri' |
max_files | Optional[int] | maximum number of VCF URIs to read/find, defaults to None (no limit) | None |
max_samples | Optional[int] | maximum number of samples to ingest, defaults to None (no limit) | None |
contigs | Optional[Union[Sequence[str], Contigs]] | contig mode (Contigs.ALL | Contigs.CHROMOSOMES | Contigs.OTHER | Contigs.ALL_DISABLE_MERGE) or list of contigs to ingest, defaults to Contigs.ALL | Contigs.ALL |
resume | bool | enable resume ingestion mode, defaults to True | True |
extra_attrs | Optional[Union[Sequence[str], str]] | INFO/FORMAT fields to materialize, defaults to repr(DEFAULT_ATTRIBUTES) |
DEFAULT_ATTRIBUTES |
vcf_attrs | Optional[str] | VCF with all INFO/FORMAT fields to materialize, defaults to None | None |
anchor_gap | Optional[int] | anchor gap for VCF dataset, defaults to None | None |
compression_level | Optional[int] | zstd compression level for the VCF dataset, defaults to None (uses the default level in TileDB-VCF) | None |
manifest_batch_size | int | batch size for manifest ingestion, defaults to MANIFEST_BATCH_SIZE | MANIFEST_BATCH_SIZE |
manifest_workers | int | number of workers for manifest ingestion, defaults to MANIFEST_WORKERS | MANIFEST_WORKERS |
vcf_batch_size | int | batch size for VCF ingestion, defaults to VCF_BATCH_SIZE | VCF_BATCH_SIZE |
vcf_workers | int | number of workers for VCF ingestion, defaults to VCF_WORKERS | VCF_WORKERS |
vcf_threads | int | number of threads for VCF ingestion, defaults to VCF_THREADS | VCF_THREADS |
ingest_resources | Optional[Mapping[str, str]] | manual override for ingest UDF resources, defaults to None | None |
verbose | bool | verbose logging, defaults to False | False |
create_index | bool | force creation of a local index file, defaults to True | True |
trace_id | Optional[str] | trace ID for logging, defaults to None | None |
consolidate_stats | bool | consolidate the stats arrays, defaults to True | True |
aws_find_mode | bool | use AWS CLI to find VCFs, defaults to False | False |
use_remote_tmp | bool | use remote tmp space if VCFs need to be sorted and bgzipped, defaults to False (preferred for small VCFs) | False |
disable_manifest | bool | disable manifest creation, defaults to False | False |
ingest_vcf_annotations
cloud.vcf.ingestion.ingest_vcf_annotations(
dataset_uri*
=None
vcf_uri=None
search_uri=None
pattern=None
ignore=True
create_index=None
config=None
acn=None
namespace=None
register_name=None
ingest_resources=False
verbose )
Ingest annotation VCF into a dataset. For example, a ClinVar or gnomAD VCF.
Parameters
Name | Type | Description | Default |
---|---|---|---|
dataset_uri | str | dataset URI | required |
vcf_uri | Optional[str] | VCF URI, defaults to None | None |
search_uri | Optional[str] | URI to search for VCF files, defaults to None | None |
pattern | Optional[str] | Unix shell style pattern to match when searching for VCF files, defaults to None | None |
ignore | Optional[str] | Unix shell style pattern to ignore when searching for VCF files, defaults to None | None |
create_index | bool | force creation of a local index file, defaults to True | True |
config | config dictionary, defaults to None | None |
|
acn | Optional[str] | Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None | None |
namespace | Optional[str] | TileDB-Cloud namespace, defaults to None | None |
register_name | Optional[str] | name to register the dataset with on TileDB Cloud, defaults to None | None |
ingest_resources | Optional[Mapping[str, str]] | manual override for ingest UDF resources, defaults to None | None |
verbose | bool | verbose logging, defaults to False | False |
read_metadata_uris_udf
cloud.vcf.ingestion.read_metadata_uris_udf(
dataset_uri*
=None
config
metadata_uri='uri'
metadata_attr=None
max_files=False
verbose )
Read a list of URIs from a TileDB array. The URIs will be read from the attribute specified in the metadata_attr
argument.
Parameters
Name | Type | Description | Default |
---|---|---|---|
dataset_uri | str | dataset URI | required |
config | Optional[Mapping[str, Any]] | TileDB config, defaults to None | None |
metadata_uri | str | metadata array URI | required |
metadata_attr | str | name of metadata attribute containing URIs, defaults to “uri” | 'uri' |
max_files | Optional[int] | maximum number of URIs returned, defaults to None | None |
verbose | bool | verbose logging, defaults to False | False |
Returns
Name | Type | Description |
---|---|---|
Sequence[str] | list of URIs |
read_uris_udf
cloud.vcf.ingestion.read_uris_udf(
dataset_uri
list_uri*
=None
config=None
max_files=False
verbose )
Read a list of URIs from a URI.
Parameters
Name | Type | Description | Default |
---|---|---|---|
dataset_uri | str | dataset URI | required |
list_uri | str | URI of the list of URIs | required |
config | Optional[Mapping[str, Any]] | config dictionary, defaults to None | None |
max_files | Optional[int] | maximum number of URIs returned, defaults to None | None |
verbose | bool | verbose logging, defaults to False | False |
Returns
Name | Type | Description |
---|---|---|
Sequence[str] | list of URIs |
register_dataset_udf
cloud.vcf.ingestion.register_dataset_udf(
dataset_uri*
register_name
acn=None
namespace=None
config=False
verbose )
Register the dataset on TileDB Cloud.
Parameters
Name | Type | Description | Default |
---|---|---|---|
dataset_uri | str | dataset URI | required |
register_name | str | name to register the dataset with on TileDB Cloud | required |
namespace | Optional[str] | TileDB Cloud namespace, defaults to the user’s default namespace | None |
config | Optional[Mapping[str, Any]] | config dictionary, defaults to None | None |
verbose | bool | verbose logging, defaults to False | False |