vcf.ingestion

client.vcf.ingestion

Classes

Name Description
Contigs The contigs to ingest.

Contigs

client.vcf.ingestion.Contigs()

The contigs to ingest.

ALL = all contigs CHROMOSOMES = all human chromosomes OTHER = all contigs other than the human chromosomes ALL_DISABLE_MERGE = all contigs with merging disabled, for non-human datasets

Functions

Name Description
consolidate_dataset_udf Consolidate arrays in the dataset.
create_dataset_udf Create a TileDB-VCF dataset.
create_manifest Create a manifest array in the dataset.
filter_samples_udf Return URIs for samples not already in the dataset.
filter_uris_udf Return URIs from sample_uris that are not in the manifest.
find_uris_aws_udf Find URIs matching a pattern in the search_uri path with an efficient
find_uris_udf Find URIs matching a pattern in the search_uri path.
get_logger_wrapper Get a logger instance and log version information.
ingest_manifest_dag Create a DAG to load the manifest array.
ingest_manifest_udf Ingest sample URIs into the manifest array.
ingest_samples_dag Create a DAG to ingest samples into the dataset.
ingest_samples_udf Ingest samples into the dataset.
ingest_vcf Ingest samples into a dataset.
ingest_vcf_annotations Ingest annotation VCF into a dataset. For example, a ClinVar or gnomAD VCF.
read_metadata_uris_udf Read a list of URIs from a TileDB array. The URIs will be read from the
read_uris_udf Read a list of URIs from a URI.
register_dataset_udf Register the dataset on TileDB Cloud.

consolidate_dataset_udf

client.vcf.ingestion.consolidate_dataset_udf(
    dataset_uri,
    *,
    config=None,
    exclude=MANIFEST_ARRAY,
    include=None,
    id='consolidate',
    verbose=False,
)

Consolidate arrays in the dataset.

:param dataset_uri: dataset URI :param config: config dictionary, defaults to None :param exclude: group members to exclude, defaults to MANIFEST_ARRAY :param include: group members to include, defaults to None :param id: profiler event id, defaults to “consolidate” :param verbose: verbose logging, defaults to False

create_dataset_udf

client.vcf.ingestion.create_dataset_udf(
    dataset_uri,
    *,
    config=None,
    extra_attrs=None,
    vcf_attrs=None,
    anchor_gap=None,
    compression_level=None,
    annotation_dataset=False,
    verbose=False,
)

Create a TileDB-VCF dataset.

:param dataset_uri: dataset URI :param config: config dictionary, defaults to None :param extra_attrs: INFO/FORMAT fields to materialize, defaults to None :param vcf_attrs: VCF with all INFO/FORMAT fields to materialize, defaults to None :param anchor_gap: anchor gap for VCF dataset, defaults to None :param compression_level: zstd compression level for the VCF dataset, defaults to None (uses the default level in TileDB-VCF) :param annotation_dataset: create an annotation dataset, defaults to False :param verbose: verbose logging, defaults to False :return: dataset URI

create_manifest

client.vcf.ingestion.create_manifest(dataset_uri, group)

Create a manifest array in the dataset.

:param dataset_uri: dataset URI :param group: dataset group

filter_samples_udf

client.vcf.ingestion.filter_samples_udf(
    dataset_uri,
    *,
    config=None,
    verbose=False,
)

Return URIs for samples not already in the dataset.

:param dataset_uri: dataset URI :param config: config dictionary, defaults to None :param verbose: verbose logging, defaults to False :return: sample URIs

filter_uris_udf

client.vcf.ingestion.filter_uris_udf(
    dataset_uri,
    sample_uris,
    *,
    config=None,
    verbose=False,
)

Return URIs from sample_uris that are not in the manifest.

:param dataset_uri: dataset URI :param sample_uris: sample URIs :param config: config dictionary, defaults to None :param verbose: verbose logging, defaults to False :return: filtered sample URIs

find_uris_aws_udf

client.vcf.ingestion.find_uris_aws_udf(
    dataset_uri,
    search_uri,
    *,
    config=None,
    include=None,
    exclude=None,
    max_files=None,
    verbose=False,
)

Find URIs matching a pattern in the search_uri path with an efficient implementation for S3.

include and exclude patterns are Unix shell style (see fnmatch module).

:param dataset_uri: dataset URI :param search_uri: URI to search for VCF files :param config: config dictionary, defaults to None :param include: include pattern used in the search, defaults to None :param exclude: exclude pattern applied to the search results, defaults to None :param max_files: maximum number of URIs returned, defaults to None :param verbose: verbose logging, defaults to False :return: list of URIs

find_uris_udf

client.vcf.ingestion.find_uris_udf(
    dataset_uri,
    search_uri,
    *,
    config=None,
    include=None,
    exclude=None,
    max_files=None,
    verbose=False,
)

Find URIs matching a pattern in the search_uri path.

include and exclude patterns are Unix shell style (see fnmatch module).

:param dataset_uri: dataset URI :param search_uri: URI to search for VCF files :param config: config dictionary, defaults to None :param include: include pattern used in the search, defaults to None :param exclude: exclude pattern applied to the search results, defaults to None :param max_files: maximum number of URIs returned, defaults to None :param verbose: verbose logging, defaults to False :return: list of URIs

get_logger_wrapper

client.vcf.ingestion.get_logger_wrapper(verbose=False)

Get a logger instance and log version information.

:param verbose: verbose logging, defaults to False :return: logger instance

ingest_manifest_dag

client.vcf.ingestion.ingest_manifest_dag(
    dataset_uri,
    *,
    acn=None,
    config=None,
    search_uri=None,
    pattern=None,
    ignore=None,
    sample_list_uri=None,
    metadata_uri=None,
    metadata_attr='uri',
    max_files=None,
    batch_size=MANIFEST_BATCH_SIZE,
    workers=MANIFEST_WORKERS,
    extra_attrs=None,
    vcf_attrs=None,
    anchor_gap=None,
    compression_level=None,
    verbose=False,
    aws_find_mode=False,
    disable_manifest=False,
    consolidate_resources=CONSOLIDATE_RESOURCES,
    manifest_resources=MANIFEST_RESOURCES,
    image_name='genomics',
)

Create a DAG to load the manifest array.

:param dataset_uri: dataset URI :param acn: Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None :param config: config dictionary, defaults to None :param search_uri: URI to search for VCF files, defaults to None :param pattern: pattern to match when searching for VCF files, defaults to None :param ignore: pattern to ignore when searching for VCF files, defaults to None :param sample_list_uri: URI with a list of VCF URIs, defaults to None :param metadata_uri: URI of metadata array holding VCF URIs, defaults to None :param metadata_attr: name of metadata attribute containing URIs, defaults to “uri” :param max_files: maximum number of URIs to ingest, defaults to None :param batch_size: manifest batch size, defaults to MANIFEST_BATCH_SIZE :param workers: maximum number of parallel workers, defaults to MANIFEST_WORKERS :param extra_attrs: INFO/FORMAT fields to materialize, defaults to None :param vcf_attrs: VCF with all INFO/FORMAT fields to materialize, defaults to None :param anchor_gap: anchor gap for VCF dataset, defaults to None :param compression_level: zstd compression level for the VCF dataset, defaults to None (uses the default level in TileDB-VCF) :param verbose: verbose logging, defaults to False :param aws_find_mode: use AWS CLI to find VCFs, defaults to False :param disable_manifest: disable manifest creation, defaults to False :param consolidate_resources: manual override for consolidate UDF resources, defaults to CONSOLIDATE_RESOURCES :param manifest_resources: manual override for manifest UDF resources, defaults to MANIFEST_RESOURCES :param image_name: udf image name to use, useful for testing beta features

ingest_manifest_udf

client.vcf.ingestion.ingest_manifest_udf(
    dataset_uri,
    sample_uris,
    *,
    config=None,
    id='manifest',
    verbose=False,
)

Ingest sample URIs into the manifest array.

:param dataset_uri: dataset URI :param sample_uris: sample URIs :param config: config dictionary, defaults to None :param id: profiler event id, defaults to “manifest” :param verbose: verbose logging, defaults to False

ingest_samples_dag

client.vcf.ingestion.ingest_samples_dag(
    dataset_uri,
    *,
    acn=None,
    config=None,
    contigs=Contigs.ALL,
    threads=VCF_THREADS,
    batch_size=VCF_BATCH_SIZE,
    workers=VCF_WORKERS,
    max_samples=None,
    resume=True,
    ingest_resources=None,
    verbose=False,
    create_index=True,
    trace_id=None,
    consolidate_stats=False,
    use_remote_tmp=False,
    sample_list_uri=None,
    consolidate_resources=CONSOLIDATE_RESOURCES,
    image_name='genomics',
)

Create a DAG to ingest samples into the dataset.

Note: If sample_list_uri is provided, the manifest is not checked for existing samples.

:param dataset_uri: dataset URI :param acn: Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None :param config: config dictionary, defaults to None :param contigs: contig mode (Contigs.ALL | Contigs.CHROMOSOMES | Contigs.OTHER | Contigs.ALL_DISABLE_MERGE) or list of contigs to ingest, defaults to Contigs.ALL :param threads: number of threads to use per ingestion task, defaults to VCF_THREADS :param batch_size: sample batch size, defaults to VCF_BATCH_SIZE :param workers: maximum number of parallel workers, defaults to VCF_WORKERS :param max_samples: maximum number of samples to ingest, defaults to None (no limit) :param resume: enable resume ingestion mode, defaults to True :param ingest_resources: manual override for ingest UDF resources, defaults to None :param verbose: verbose logging, defaults to False :param create_index: force creation of a local index file, defaults to True :param trace_id: trace ID for logging, defaults to None :param consolidate_stats: consolidate the stats arrays, defaults to False :param use_remote_tmp: use remote tmp space if VCFs need to be bgzipped, defaults to False (preferred for small VCFs) :param sample_list_uri: URI with a list of VCF URIs, defaults to None :param consolidate_resources: manual override for consolidate UDF resources, defaults to None :param image_name: udf image name to use, useful for testing beta features

ingest_samples_udf

client.vcf.ingestion.ingest_samples_udf(
    dataset_uri,
    sample_uris,
    *,
    config=None,
    threads,
    memory_mb,
    sample_batch_size,
    contig_mode='all',
    contigs_to_keep_separate=None,
    contig_fragment_merging=True,
    resume=True,
    create_index=True,
    id='samples',
    verbose=False,
    trace_id=None,
    use_remote_tmp=False,
)

Ingest samples into the dataset.

:param dataset_uri: dataset URI :param sample_uris: sample URIs :param threads: number of threads to use for ingestion :param memory_mb: memory to use for ingestion in MiB :param sample_batch_size: sample batch size to use for ingestion :param config: config dictionary, defaults to None :param contig_mode: ingestion mode, defaults to “all” :param contigs_to_keep_separate: list of contigs to keep separate, defaults to None :param contig_fragment_merging: enable contig fragment merging, defaults to True :param resume: enable resume ingestion mode, defaults to True :param create_index: force creation of a local index file, defaults to True :param id: profiler event id, defaults to “samples” :param verbose: verbose logging, defaults to False :param trace_id: trace ID for logging, defaults to None :param use_remote_tmp: use remote tmp space if VCFs need to be bgzipped, defaults to False (preferred for small VCFs)

ingest_vcf

client.vcf.ingestion.ingest_vcf(
    dataset_uri,
    *,
    acn=None,
    config=None,
    teamspace=None,
    register_name=None,
    search_uri=None,
    pattern=None,
    ignore=None,
    sample_list_uri=None,
    metadata_uri=None,
    metadata_attr='uri',
    max_files=None,
    max_samples=None,
    contigs=Contigs.ALL,
    resume=True,
    extra_attrs=DEFAULT_ATTRIBUTES,
    vcf_attrs=None,
    anchor_gap=None,
    compression_level=None,
    manifest_batch_size=MANIFEST_BATCH_SIZE,
    manifest_workers=MANIFEST_WORKERS,
    vcf_batch_size=VCF_BATCH_SIZE,
    vcf_workers=VCF_WORKERS,
    vcf_threads=VCF_THREADS,
    ingest_resources=None,
    verbose=False,
    create_index=True,
    trace_id=None,
    consolidate_stats=True,
    aws_find_mode=False,
    use_remote_tmp=False,
    disable_manifest=False,
    consolidate_resources=CONSOLIDATE_RESOURCES,
    manifest_resources=MANIFEST_RESOURCES,
    vcf_image_name='genomics',
)

Ingest samples into a dataset.

:param dataset_uri: dataset URI :param acn: Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None :param config: config dictionary, defaults to None :param teamspace: TileDB-Cloud teamspace, defaults to None :param register_name: name to register the dataset with on TileDB Cloud, defaults to None :param search_uri: URI to search for VCF files, defaults to None :param pattern: Unix shell style pattern to match when searching for VCF files, defaults to None :param ignore: Unix shell style pattern to ignore when searching for VCF files, defaults to None :param sample_list_uri: URI with a list of VCF URIs, defaults to None :param metadata_uri: URI of metadata array holding VCF URIs, defaults to None :param metadata_attr: name of metadata attribute containing URIs, defaults to “uri” :param max_files: maximum number of VCF URIs to read/find, defaults to None (no limit) :param max_samples: maximum number of samples to ingest, defaults to None (no limit) :param contigs: contig mode (Contigs.ALL | Contigs.CHROMOSOMES | Contigs.OTHER | Contigs.ALL_DISABLE_MERGE) or list of contigs to ingest, defaults to Contigs.ALL :param resume: enable resume ingestion mode, defaults to True :param extra_attrs: INFO/FORMAT fields to materialize, defaults to repr(DEFAULT_ATTRIBUTES) :param vcf_attrs: VCF with all INFO/FORMAT fields to materialize, defaults to None :param anchor_gap: anchor gap for VCF dataset, defaults to None :param compression_level: zstd compression level for the VCF dataset, defaults to None (uses the default level in TileDB-VCF) :param manifest_batch_size: batch size for manifest ingestion, defaults to MANIFEST_BATCH_SIZE :param manifest_workers: number of workers for manifest ingestion, defaults to MANIFEST_WORKERS :param vcf_batch_size: batch size for VCF ingestion, defaults to VCF_BATCH_SIZE :param vcf_workers: number of workers for VCF ingestion, defaults to VCF_WORKERS :param vcf_threads: number of threads for VCF ingestion, defaults to VCF_THREADS :param ingest_resources: manual override for ingest UDF resources, defaults to None :param verbose: verbose logging, defaults to False :param create_index: force creation of a local index file, defaults to True :param trace_id: trace ID for logging, defaults to None :param consolidate_stats: consolidate the stats arrays, defaults to True :param aws_find_mode: use AWS CLI to find VCFs, defaults to False :param use_remote_tmp: use remote tmp space if VCFs need to be sorted and bgzipped, defaults to False (preferred for small VCFs) :param disable_manifest: disable manifest creation, defaults to False :param consolidate_resources: manual override for consolidate UDF resources, defaults to CONSOLIDATE_RESOURCES :param manifest_resources: manual override for manifest UDF resources, defaults to MANIFEST_RESOURCES :param vcf_image_name: udf image name to use, useful for testing beta features

ingest_vcf_annotations

client.vcf.ingestion.ingest_vcf_annotations(
    dataset_uri,
    *,
    vcf_uri=None,
    search_uri=None,
    pattern=None,
    ignore=None,
    create_index=True,
    config=None,
    acn=None,
    teamspace=None,
    register_name=None,
    ingest_resources=None,
    verbose=False,
    consolidate_resources=CONSOLIDATE_RESOURCES,
    vcf_image_name='genomics',
)

Ingest annotation VCF into a dataset. For example, a ClinVar or gnomAD VCF.

:param dataset_uri: dataset URI :param vcf_uri: VCF URI, defaults to None :param search_uri: URI to search for VCF files, defaults to None :param pattern: Unix shell style pattern to match when searching for VCF files, defaults to None :param ignore: Unix shell style pattern to ignore when searching for VCF files, defaults to None :param create_index: force creation of a local index file, defaults to True :param config: config dictionary, defaults to None :param acn: Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None :param teamspace: TileDB-Cloud teamspace, defaults to None :param register_name: name to register the dataset with on TileDB Cloud, defaults to None :param ingest_resources: manual override for ingest UDF resources, defaults to None :param verbose: verbose logging, defaults to False :param consolidate_resources: manual override for consolidate UDF resources, defaults to None :param vcf_image_name: udf image name to use, useful for testing beta features

read_metadata_uris_udf

client.vcf.ingestion.read_metadata_uris_udf(
    dataset_uri,
    *,
    config=None,
    metadata_uri,
    metadata_attr='uri',
    max_files=None,
    verbose=False,
)

Read a list of URIs from a TileDB array. The URIs will be read from the attribute specified in the metadata_attr argument.

:param dataset_uri: dataset URI :param config: TileDB config, defaults to None :param metadata_uri: metadata array URI :param metadata_attr: name of metadata attribute containing URIs, defaults to “uri” :param max_files: maximum number of URIs returned, defaults to None :param verbose: verbose logging, defaults to False :return: list of URIs

read_uris_udf

client.vcf.ingestion.read_uris_udf(
    dataset_uri,
    list_uri,
    *,
    config=None,
    max_files=None,
    verbose=False,
)

Read a list of URIs from a URI.

:param dataset_uri: dataset URI :param list_uri: URI of the list of URIs :param config: config dictionary, defaults to None :param max_files: maximum number of URIs returned, defaults to None :param verbose: verbose logging, defaults to False :return: list of URIs

register_dataset_udf

client.vcf.ingestion.register_dataset_udf(
    dataset_uri,
    *,
    register_name,
    acn,
    teamspace,
    config=None,
    verbose=False,
)

Register the dataset on TileDB Cloud.

:param dataset_uri: dataset URI :param register_name: name to register the dataset with on TileDB Cloud :param teamspace: TileDB-Cloud teamspace, defaults to None :param config: config dictionary, defaults to None :param verbose: verbose logging, defaults to False