vcf.ingestion
client.vcf.ingestion
Classes
| Name | Description |
|---|---|
| Contigs | The contigs to ingest. |
Contigs
client.vcf.ingestion.Contigs()The contigs to ingest.
ALL = all contigs CHROMOSOMES = all human chromosomes OTHER = all contigs other than the human chromosomes ALL_DISABLE_MERGE = all contigs with merging disabled, for non-human datasets
Functions
| Name | Description |
|---|---|
| consolidate_dataset_udf | Consolidate arrays in the dataset. |
| create_dataset_udf | Create a TileDB-VCF dataset. |
| create_manifest | Create a manifest array in the dataset. |
| filter_samples_udf | Return URIs for samples not already in the dataset. |
| filter_uris_udf | Return URIs from sample_uris that are not in the manifest. |
| find_uris_aws_udf | Find URIs matching a pattern in the search_uri path with an efficient |
| find_uris_udf | Find URIs matching a pattern in the search_uri path. |
| get_logger_wrapper | Get a logger instance and log version information. |
| ingest_manifest_dag | Create a DAG to load the manifest array. |
| ingest_manifest_udf | Ingest sample URIs into the manifest array. |
| ingest_samples_dag | Create a DAG to ingest samples into the dataset. |
| ingest_samples_udf | Ingest samples into the dataset. |
| ingest_vcf | Ingest samples into a dataset. |
| ingest_vcf_annotations | Ingest annotation VCF into a dataset. For example, a ClinVar or gnomAD VCF. |
| read_metadata_uris_udf | Read a list of URIs from a TileDB array. The URIs will be read from the |
| read_uris_udf | Read a list of URIs from a URI. |
| register_dataset_udf | Register the dataset on TileDB Cloud. |
consolidate_dataset_udf
client.vcf.ingestion.consolidate_dataset_udf(
dataset_uri,
*,
config=None,
exclude=MANIFEST_ARRAY,
include=None,
id='consolidate',
verbose=False,
)Consolidate arrays in the dataset.
:param dataset_uri: dataset URI :param config: config dictionary, defaults to None :param exclude: group members to exclude, defaults to MANIFEST_ARRAY :param include: group members to include, defaults to None :param id: profiler event id, defaults to “consolidate” :param verbose: verbose logging, defaults to False
create_dataset_udf
client.vcf.ingestion.create_dataset_udf(
dataset_uri,
*,
config=None,
extra_attrs=None,
vcf_attrs=None,
anchor_gap=None,
compression_level=None,
annotation_dataset=False,
verbose=False,
)Create a TileDB-VCF dataset.
:param dataset_uri: dataset URI :param config: config dictionary, defaults to None :param extra_attrs: INFO/FORMAT fields to materialize, defaults to None :param vcf_attrs: VCF with all INFO/FORMAT fields to materialize, defaults to None :param anchor_gap: anchor gap for VCF dataset, defaults to None :param compression_level: zstd compression level for the VCF dataset, defaults to None (uses the default level in TileDB-VCF) :param annotation_dataset: create an annotation dataset, defaults to False :param verbose: verbose logging, defaults to False :return: dataset URI
create_manifest
client.vcf.ingestion.create_manifest(dataset_uri, group)Create a manifest array in the dataset.
:param dataset_uri: dataset URI :param group: dataset group
filter_samples_udf
client.vcf.ingestion.filter_samples_udf(
dataset_uri,
*,
config=None,
verbose=False,
)Return URIs for samples not already in the dataset.
:param dataset_uri: dataset URI :param config: config dictionary, defaults to None :param verbose: verbose logging, defaults to False :return: sample URIs
filter_uris_udf
client.vcf.ingestion.filter_uris_udf(
dataset_uri,
sample_uris,
*,
config=None,
verbose=False,
)Return URIs from sample_uris that are not in the manifest.
:param dataset_uri: dataset URI :param sample_uris: sample URIs :param config: config dictionary, defaults to None :param verbose: verbose logging, defaults to False :return: filtered sample URIs
find_uris_aws_udf
client.vcf.ingestion.find_uris_aws_udf(
dataset_uri,
search_uri,
*,
config=None,
include=None,
exclude=None,
max_files=None,
verbose=False,
)Find URIs matching a pattern in the search_uri path with an efficient implementation for S3.
include and exclude patterns are Unix shell style (see fnmatch module).
:param dataset_uri: dataset URI :param search_uri: URI to search for VCF files :param config: config dictionary, defaults to None :param include: include pattern used in the search, defaults to None :param exclude: exclude pattern applied to the search results, defaults to None :param max_files: maximum number of URIs returned, defaults to None :param verbose: verbose logging, defaults to False :return: list of URIs
find_uris_udf
client.vcf.ingestion.find_uris_udf(
dataset_uri,
search_uri,
*,
config=None,
include=None,
exclude=None,
max_files=None,
verbose=False,
)Find URIs matching a pattern in the search_uri path.
include and exclude patterns are Unix shell style (see fnmatch module).
:param dataset_uri: dataset URI :param search_uri: URI to search for VCF files :param config: config dictionary, defaults to None :param include: include pattern used in the search, defaults to None :param exclude: exclude pattern applied to the search results, defaults to None :param max_files: maximum number of URIs returned, defaults to None :param verbose: verbose logging, defaults to False :return: list of URIs
get_logger_wrapper
client.vcf.ingestion.get_logger_wrapper(verbose=False)Get a logger instance and log version information.
:param verbose: verbose logging, defaults to False :return: logger instance
ingest_manifest_dag
client.vcf.ingestion.ingest_manifest_dag(
dataset_uri,
*,
acn=None,
config=None,
search_uri=None,
pattern=None,
ignore=None,
sample_list_uri=None,
metadata_uri=None,
metadata_attr='uri',
max_files=None,
batch_size=MANIFEST_BATCH_SIZE,
workers=MANIFEST_WORKERS,
extra_attrs=None,
vcf_attrs=None,
anchor_gap=None,
compression_level=None,
verbose=False,
aws_find_mode=False,
disable_manifest=False,
consolidate_resources=CONSOLIDATE_RESOURCES,
manifest_resources=MANIFEST_RESOURCES,
image_name='genomics',
)Create a DAG to load the manifest array.
:param dataset_uri: dataset URI :param acn: Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None :param config: config dictionary, defaults to None :param search_uri: URI to search for VCF files, defaults to None :param pattern: pattern to match when searching for VCF files, defaults to None :param ignore: pattern to ignore when searching for VCF files, defaults to None :param sample_list_uri: URI with a list of VCF URIs, defaults to None :param metadata_uri: URI of metadata array holding VCF URIs, defaults to None :param metadata_attr: name of metadata attribute containing URIs, defaults to “uri” :param max_files: maximum number of URIs to ingest, defaults to None :param batch_size: manifest batch size, defaults to MANIFEST_BATCH_SIZE :param workers: maximum number of parallel workers, defaults to MANIFEST_WORKERS :param extra_attrs: INFO/FORMAT fields to materialize, defaults to None :param vcf_attrs: VCF with all INFO/FORMAT fields to materialize, defaults to None :param anchor_gap: anchor gap for VCF dataset, defaults to None :param compression_level: zstd compression level for the VCF dataset, defaults to None (uses the default level in TileDB-VCF) :param verbose: verbose logging, defaults to False :param aws_find_mode: use AWS CLI to find VCFs, defaults to False :param disable_manifest: disable manifest creation, defaults to False :param consolidate_resources: manual override for consolidate UDF resources, defaults to CONSOLIDATE_RESOURCES :param manifest_resources: manual override for manifest UDF resources, defaults to MANIFEST_RESOURCES :param image_name: udf image name to use, useful for testing beta features
ingest_manifest_udf
client.vcf.ingestion.ingest_manifest_udf(
dataset_uri,
sample_uris,
*,
config=None,
id='manifest',
verbose=False,
)Ingest sample URIs into the manifest array.
:param dataset_uri: dataset URI :param sample_uris: sample URIs :param config: config dictionary, defaults to None :param id: profiler event id, defaults to “manifest” :param verbose: verbose logging, defaults to False
ingest_samples_dag
client.vcf.ingestion.ingest_samples_dag(
dataset_uri,
*,
acn=None,
config=None,
contigs=Contigs.ALL,
threads=VCF_THREADS,
batch_size=VCF_BATCH_SIZE,
workers=VCF_WORKERS,
max_samples=None,
resume=True,
ingest_resources=None,
verbose=False,
create_index=True,
trace_id=None,
consolidate_stats=False,
use_remote_tmp=False,
sample_list_uri=None,
consolidate_resources=CONSOLIDATE_RESOURCES,
image_name='genomics',
)Create a DAG to ingest samples into the dataset.
Note: If sample_list_uri is provided, the manifest is not checked for existing samples.
:param dataset_uri: dataset URI :param acn: Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None :param config: config dictionary, defaults to None :param contigs: contig mode (Contigs.ALL | Contigs.CHROMOSOMES | Contigs.OTHER | Contigs.ALL_DISABLE_MERGE) or list of contigs to ingest, defaults to Contigs.ALL :param threads: number of threads to use per ingestion task, defaults to VCF_THREADS :param batch_size: sample batch size, defaults to VCF_BATCH_SIZE :param workers: maximum number of parallel workers, defaults to VCF_WORKERS :param max_samples: maximum number of samples to ingest, defaults to None (no limit) :param resume: enable resume ingestion mode, defaults to True :param ingest_resources: manual override for ingest UDF resources, defaults to None :param verbose: verbose logging, defaults to False :param create_index: force creation of a local index file, defaults to True :param trace_id: trace ID for logging, defaults to None :param consolidate_stats: consolidate the stats arrays, defaults to False :param use_remote_tmp: use remote tmp space if VCFs need to be bgzipped, defaults to False (preferred for small VCFs) :param sample_list_uri: URI with a list of VCF URIs, defaults to None :param consolidate_resources: manual override for consolidate UDF resources, defaults to None :param image_name: udf image name to use, useful for testing beta features
ingest_samples_udf
client.vcf.ingestion.ingest_samples_udf(
dataset_uri,
sample_uris,
*,
config=None,
threads,
memory_mb,
sample_batch_size,
contig_mode='all',
contigs_to_keep_separate=None,
contig_fragment_merging=True,
resume=True,
create_index=True,
id='samples',
verbose=False,
trace_id=None,
use_remote_tmp=False,
)Ingest samples into the dataset.
:param dataset_uri: dataset URI :param sample_uris: sample URIs :param threads: number of threads to use for ingestion :param memory_mb: memory to use for ingestion in MiB :param sample_batch_size: sample batch size to use for ingestion :param config: config dictionary, defaults to None :param contig_mode: ingestion mode, defaults to “all” :param contigs_to_keep_separate: list of contigs to keep separate, defaults to None :param contig_fragment_merging: enable contig fragment merging, defaults to True :param resume: enable resume ingestion mode, defaults to True :param create_index: force creation of a local index file, defaults to True :param id: profiler event id, defaults to “samples” :param verbose: verbose logging, defaults to False :param trace_id: trace ID for logging, defaults to None :param use_remote_tmp: use remote tmp space if VCFs need to be bgzipped, defaults to False (preferred for small VCFs)
ingest_vcf
client.vcf.ingestion.ingest_vcf(
dataset_uri,
*,
acn=None,
config=None,
teamspace=None,
register_name=None,
search_uri=None,
pattern=None,
ignore=None,
sample_list_uri=None,
metadata_uri=None,
metadata_attr='uri',
max_files=None,
max_samples=None,
contigs=Contigs.ALL,
resume=True,
extra_attrs=DEFAULT_ATTRIBUTES,
vcf_attrs=None,
anchor_gap=None,
compression_level=None,
manifest_batch_size=MANIFEST_BATCH_SIZE,
manifest_workers=MANIFEST_WORKERS,
vcf_batch_size=VCF_BATCH_SIZE,
vcf_workers=VCF_WORKERS,
vcf_threads=VCF_THREADS,
ingest_resources=None,
verbose=False,
create_index=True,
trace_id=None,
consolidate_stats=True,
aws_find_mode=False,
use_remote_tmp=False,
disable_manifest=False,
consolidate_resources=CONSOLIDATE_RESOURCES,
manifest_resources=MANIFEST_RESOURCES,
vcf_image_name='genomics',
)Ingest samples into a dataset.
:param dataset_uri: dataset URI :param acn: Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None :param config: config dictionary, defaults to None :param teamspace: TileDB-Cloud teamspace, defaults to None :param register_name: name to register the dataset with on TileDB Cloud, defaults to None :param search_uri: URI to search for VCF files, defaults to None :param pattern: Unix shell style pattern to match when searching for VCF files, defaults to None :param ignore: Unix shell style pattern to ignore when searching for VCF files, defaults to None :param sample_list_uri: URI with a list of VCF URIs, defaults to None :param metadata_uri: URI of metadata array holding VCF URIs, defaults to None :param metadata_attr: name of metadata attribute containing URIs, defaults to “uri” :param max_files: maximum number of VCF URIs to read/find, defaults to None (no limit) :param max_samples: maximum number of samples to ingest, defaults to None (no limit) :param contigs: contig mode (Contigs.ALL | Contigs.CHROMOSOMES | Contigs.OTHER | Contigs.ALL_DISABLE_MERGE) or list of contigs to ingest, defaults to Contigs.ALL :param resume: enable resume ingestion mode, defaults to True :param extra_attrs: INFO/FORMAT fields to materialize, defaults to repr(DEFAULT_ATTRIBUTES) :param vcf_attrs: VCF with all INFO/FORMAT fields to materialize, defaults to None :param anchor_gap: anchor gap for VCF dataset, defaults to None :param compression_level: zstd compression level for the VCF dataset, defaults to None (uses the default level in TileDB-VCF) :param manifest_batch_size: batch size for manifest ingestion, defaults to MANIFEST_BATCH_SIZE :param manifest_workers: number of workers for manifest ingestion, defaults to MANIFEST_WORKERS :param vcf_batch_size: batch size for VCF ingestion, defaults to VCF_BATCH_SIZE :param vcf_workers: number of workers for VCF ingestion, defaults to VCF_WORKERS :param vcf_threads: number of threads for VCF ingestion, defaults to VCF_THREADS :param ingest_resources: manual override for ingest UDF resources, defaults to None :param verbose: verbose logging, defaults to False :param create_index: force creation of a local index file, defaults to True :param trace_id: trace ID for logging, defaults to None :param consolidate_stats: consolidate the stats arrays, defaults to True :param aws_find_mode: use AWS CLI to find VCFs, defaults to False :param use_remote_tmp: use remote tmp space if VCFs need to be sorted and bgzipped, defaults to False (preferred for small VCFs) :param disable_manifest: disable manifest creation, defaults to False :param consolidate_resources: manual override for consolidate UDF resources, defaults to CONSOLIDATE_RESOURCES :param manifest_resources: manual override for manifest UDF resources, defaults to MANIFEST_RESOURCES :param vcf_image_name: udf image name to use, useful for testing beta features
ingest_vcf_annotations
client.vcf.ingestion.ingest_vcf_annotations(
dataset_uri,
*,
vcf_uri=None,
search_uri=None,
pattern=None,
ignore=None,
create_index=True,
config=None,
acn=None,
teamspace=None,
register_name=None,
ingest_resources=None,
verbose=False,
consolidate_resources=CONSOLIDATE_RESOURCES,
vcf_image_name='genomics',
)Ingest annotation VCF into a dataset. For example, a ClinVar or gnomAD VCF.
:param dataset_uri: dataset URI :param vcf_uri: VCF URI, defaults to None :param search_uri: URI to search for VCF files, defaults to None :param pattern: Unix shell style pattern to match when searching for VCF files, defaults to None :param ignore: Unix shell style pattern to ignore when searching for VCF files, defaults to None :param create_index: force creation of a local index file, defaults to True :param config: config dictionary, defaults to None :param acn: Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None :param teamspace: TileDB-Cloud teamspace, defaults to None :param register_name: name to register the dataset with on TileDB Cloud, defaults to None :param ingest_resources: manual override for ingest UDF resources, defaults to None :param verbose: verbose logging, defaults to False :param consolidate_resources: manual override for consolidate UDF resources, defaults to None :param vcf_image_name: udf image name to use, useful for testing beta features
read_metadata_uris_udf
client.vcf.ingestion.read_metadata_uris_udf(
dataset_uri,
*,
config=None,
metadata_uri,
metadata_attr='uri',
max_files=None,
verbose=False,
)Read a list of URIs from a TileDB array. The URIs will be read from the attribute specified in the metadata_attr argument.
:param dataset_uri: dataset URI :param config: TileDB config, defaults to None :param metadata_uri: metadata array URI :param metadata_attr: name of metadata attribute containing URIs, defaults to “uri” :param max_files: maximum number of URIs returned, defaults to None :param verbose: verbose logging, defaults to False :return: list of URIs
read_uris_udf
client.vcf.ingestion.read_uris_udf(
dataset_uri,
list_uri,
*,
config=None,
max_files=None,
verbose=False,
)Read a list of URIs from a URI.
:param dataset_uri: dataset URI :param list_uri: URI of the list of URIs :param config: config dictionary, defaults to None :param max_files: maximum number of URIs returned, defaults to None :param verbose: verbose logging, defaults to False :return: list of URIs
register_dataset_udf
client.vcf.ingestion.register_dataset_udf(
dataset_uri,
*,
register_name,
acn,
teamspace,
config=None,
verbose=False,
)Register the dataset on TileDB Cloud.
:param dataset_uri: dataset URI :param register_name: name to register the dataset with on TileDB Cloud :param teamspace: TileDB-Cloud teamspace, defaults to None :param config: config dictionary, defaults to None :param verbose: verbose logging, defaults to False