vcf.ingestion

cloud.vcf.ingestion

Classes

Name Description
Contigs The contigs to ingest.

Contigs

cloud.vcf.ingestion.Contigs()

The contigs to ingest.

ALL = all contigs CHROMOSOMES = all human chromosomes OTHER = all contigs other than the human chromosomes ALL_DISABLE_MERGE = all contigs with merging disabled, for non-human datasets

Functions

Name Description
consolidate_dataset_udf Consolidate arrays in the dataset.
create_dataset_udf Create a TileDB-VCF dataset.
create_manifest Create a manifest array in the dataset.
filter_samples_udf Return URIs for samples not already in the dataset.
filter_uris_udf Return URIs from sample_uris that are not in the manifest.
find_uris_aws_udf Find URIs matching a pattern in the search_uri path with an efficient
find_uris_udf Find URIs matching a pattern in the search_uri path.
get_logger_wrapper Get a logger instance and log version information.
ingest_manifest_dag Create a DAG to load the manifest array.
ingest_manifest_udf Ingest sample URIs into the manifest array.
ingest_samples_dag Create a DAG to ingest samples into the dataset.
ingest_samples_udf Ingest samples into the dataset.
ingest_vcf Ingest samples into a dataset.
ingest_vcf_annotations Ingest annotation VCF into a dataset. For example, a ClinVar or gnomAD VCF.
read_metadata_uris_udf Read a list of URIs from a TileDB array. The URIs will be read from the
read_uris_udf Read a list of URIs from a URI.
register_dataset_udf Register the dataset on TileDB Cloud.

consolidate_dataset_udf

cloud.vcf.ingestion.consolidate_dataset_udf(
    dataset_uri,
    *,
    config=None,
    exclude=MANIFEST_ARRAY,
    include=None,
    id='consolidate',
    verbose=False,
)

Consolidate arrays in the dataset.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
exclude Optional[Union[Sequence[str], str]] group members to exclude, defaults to MANIFEST_ARRAY MANIFEST_ARRAY
include Optional[Union[Sequence[str], str]] group members to include, defaults to None None
id str profiler event id, defaults to “consolidate” 'consolidate'
verbose bool verbose logging, defaults to False False

create_dataset_udf

cloud.vcf.ingestion.create_dataset_udf(
    dataset_uri,
    *,
    config=None,
    extra_attrs=None,
    vcf_attrs=None,
    anchor_gap=None,
    compression_level=None,
    annotation_dataset=False,
    verbose=False,
)

Create a TileDB-VCF dataset.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
extra_attrs Optional[Union[Sequence[str], str]] INFO/FORMAT fields to materialize, defaults to None None
vcf_attrs Optional[str] VCF with all INFO/FORMAT fields to materialize, defaults to None None
anchor_gap Optional[int] anchor gap for VCF dataset, defaults to None None
compression_level Optional[int] zstd compression level for the VCF dataset, defaults to None (uses the default level in TileDB-VCF) None
annotation_dataset bool create an annotation dataset, defaults to False False
verbose bool verbose logging, defaults to False False

Returns

Name Type Description
str dataset URI

create_manifest

cloud.vcf.ingestion.create_manifest(dataset_uri, group)

Create a manifest array in the dataset.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
group tiledb.Group dataset group required

filter_samples_udf

cloud.vcf.ingestion.filter_samples_udf(
    dataset_uri,
    *,
    config=None,
    verbose=False,
)

Return URIs for samples not already in the dataset.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
verbose bool verbose logging, defaults to False False

Returns

Name Type Description
Sequence[str] sample URIs

filter_uris_udf

cloud.vcf.ingestion.filter_uris_udf(
    dataset_uri,
    sample_uris,
    *,
    config=None,
    verbose=False,
)

Return URIs from sample_uris that are not in the manifest.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
sample_uris Sequence[str] sample URIs required
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
verbose bool verbose logging, defaults to False False

Returns

Name Type Description
Sequence[str] filtered sample URIs

find_uris_aws_udf

cloud.vcf.ingestion.find_uris_aws_udf(
    dataset_uri,
    search_uri,
    *,
    config=None,
    include=None,
    exclude=None,
    max_files=None,
    verbose=False,
)

Find URIs matching a pattern in the search_uri path with an efficient implementation for S3.

include and exclude patterns are Unix shell style (see fnmatch module).

Parameters

Name Type Description Default
dataset_uri str dataset URI required
search_uri str URI to search for VCF files required
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
include Optional[str] include pattern used in the search, defaults to None None
exclude Optional[str] exclude pattern applied to the search results, defaults to None None
max_files Optional[int] maximum number of URIs returned, defaults to None None
verbose bool verbose logging, defaults to False False

Returns

Name Type Description
Sequence[str] list of URIs

find_uris_udf

cloud.vcf.ingestion.find_uris_udf(
    dataset_uri,
    search_uri,
    *,
    config=None,
    include=None,
    exclude=None,
    max_files=None,
    verbose=False,
)

Find URIs matching a pattern in the search_uri path.

include and exclude patterns are Unix shell style (see fnmatch module).

Parameters

Name Type Description Default
dataset_uri str dataset URI required
search_uri str URI to search for VCF files required
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
include Optional[str] include pattern used in the search, defaults to None None
exclude Optional[str] exclude pattern applied to the search results, defaults to None None
max_files Optional[int] maximum number of URIs returned, defaults to None None
verbose bool verbose logging, defaults to False False

Returns

Name Type Description
Sequence[str] list of URIs

get_logger_wrapper

cloud.vcf.ingestion.get_logger_wrapper(verbose=False)

Get a logger instance and log version information.

Parameters

Name Type Description Default
verbose bool verbose logging, defaults to False False

Returns

Name Type Description
logging.Logger logger instance

ingest_manifest_dag

cloud.vcf.ingestion.ingest_manifest_dag(
    dataset_uri,
    *,
    acn=None,
    config=None,
    namespace=None,
    search_uri=None,
    pattern=None,
    ignore=None,
    sample_list_uri=None,
    metadata_uri=None,
    metadata_attr='uri',
    max_files=None,
    batch_size=MANIFEST_BATCH_SIZE,
    workers=MANIFEST_WORKERS,
    extra_attrs=None,
    vcf_attrs=None,
    anchor_gap=None,
    compression_level=None,
    verbose=False,
    aws_find_mode=False,
    disable_manifest=False,
    consolidate_resources=CONSOLIDATE_RESOURCES,
    manifest_resources=MANIFEST_RESOURCES,
    create_resources=None,
    read_vcf_uris_resources=None,
    filter_uri_resources=None,
)

Create a DAG to load the manifest array.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
acn Optional[str] Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None None
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
namespace Optional[str] TileDB-Cloud namespace, defaults to None None
search_uri Optional[str] URI to search for VCF files, defaults to None None
pattern Optional[str] pattern to match when searching for VCF files, defaults to None None
ignore Optional[str] pattern to ignore when searching for VCF files, defaults to None None
sample_list_uri Optional[str] URI with a list of VCF URIs, defaults to None None
metadata_uri Optional[str] URI of metadata array holding VCF URIs, defaults to None None
metadata_attr str name of metadata attribute containing URIs, defaults to “uri” 'uri'
max_files Optional[int] maximum number of URIs to ingest, defaults to None None
batch_size int manifest batch size, defaults to MANIFEST_BATCH_SIZE MANIFEST_BATCH_SIZE
workers int maximum number of parallel workers, defaults to MANIFEST_WORKERS MANIFEST_WORKERS
extra_attrs Optional[Union[Sequence[str], str]] INFO/FORMAT fields to materialize, defaults to None None
vcf_attrs Optional[str] VCF with all INFO/FORMAT fields to materialize, defaults to None None
anchor_gap Optional[int] anchor gap for VCF dataset, defaults to None None
compression_level Optional[int] zstd compression level for the VCF dataset, defaults to None (uses the default level in TileDB-VCF) None
verbose bool verbose logging, defaults to False False
aws_find_mode bool use AWS CLI to find VCFs, defaults to False False
disable_manifest bool disable manifest creation, defaults to False False
consolidate_resources Optional[Mapping[str, str]] manual override for consolidate UDF resources, defaults to CONSOLIDATE_RESOURCES CONSOLIDATE_RESOURCES
manifest_resources Optional[Mapping[str, str]] manual override for manifest UDF resources, defaults to MANIFEST_RESOURCES MANIFEST_RESOURCES
create_resources Optional[Mapping[str, str]] manual override for create UDF resources, defaults to None None
read_vcf_uris_resources Optional[Mapping[str, str]] manual override for read VCF UDF resources, defaults to None None
filter_uri_resources Optional[Mapping[str, str]] manual override for filter VCF UDF resources, defaults to None None

ingest_manifest_udf

cloud.vcf.ingestion.ingest_manifest_udf(
    dataset_uri,
    sample_uris,
    *,
    config=None,
    id='manifest',
    verbose=False,
)

Ingest sample URIs into the manifest array.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
sample_uris Sequence[str] sample URIs required
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
id str profiler event id, defaults to “manifest” 'manifest'
verbose bool verbose logging, defaults to False False

ingest_samples_dag

cloud.vcf.ingestion.ingest_samples_dag(
    dataset_uri,
    *,
    acn=None,
    config=None,
    namespace=None,
    contigs=Contigs.ALL,
    threads=VCF_THREADS,
    batch_size=VCF_BATCH_SIZE,
    workers=VCF_WORKERS,
    max_samples=None,
    resume=True,
    verbose=False,
    create_index=True,
    trace_id=None,
    consolidate_stats=False,
    use_remote_tmp=False,
    sample_list_uri=None,
    ingest_resources=None,
    consolidate_resources=CONSOLIDATE_RESOURCES,
    filter_samples_resources=FILTER_SAMPLES_RESOURCES,
)

Create a DAG to ingest samples into the dataset.

Note: If sample_list_uri is provided, the manifest is not checked for existing samples.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
acn Optional[str] Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None None
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
namespace Optional[str] TileDB-Cloud namespace, defaults to None None
contigs Optional[Union[Sequence[str], Contigs]] contig mode (Contigs.ALL | Contigs.CHROMOSOMES | Contigs.OTHER | Contigs.ALL_DISABLE_MERGE) or list of contigs to ingest, defaults to Contigs.ALL Contigs.ALL
threads int number of threads to use per ingestion task, defaults to VCF_THREADS VCF_THREADS
batch_size int sample batch size, defaults to VCF_BATCH_SIZE VCF_BATCH_SIZE
workers int maximum number of parallel workers, defaults to VCF_WORKERS VCF_WORKERS
max_samples Optional[int] maximum number of samples to ingest, defaults to None (no limit) None
resume bool enable resume ingestion mode, defaults to True True
verbose bool verbose logging, defaults to False False
create_index bool force creation of a local index file, defaults to True True
trace_id Optional[str] trace ID for logging, defaults to None None
consolidate_stats bool consolidate the stats arrays, defaults to False False
use_remote_tmp bool use remote tmp space if VCFs need to be bgzipped, defaults to False (preferred for small VCFs) False
sample_list_uri Optional[str] URI with a list of VCF URIs, defaults to None None
ingest_resources Optional[Mapping[str, str]] manual override for ingest UDF resources, defaults to None None
consolidate_resources Optional[Mapping[str, str]] manual override for consolidate UDF resources, defaults to None CONSOLIDATE_RESOURCES
filter_samples_resources Optional[Mapping[str, str]] manual override for filter samples UDF resources, defaults to None FILTER_SAMPLES_RESOURCES

ingest_samples_udf

cloud.vcf.ingestion.ingest_samples_udf(
    dataset_uri,
    sample_uris,
    *,
    config=None,
    threads,
    memory_mb,
    sample_batch_size,
    contig_mode='all',
    contigs_to_keep_separate=None,
    contig_fragment_merging=True,
    resume=True,
    create_index=True,
    id='samples',
    verbose=False,
    trace_id=None,
    use_remote_tmp=False,
)

Ingest samples into the dataset.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
sample_uris Sequence[str] sample URIs required
threads int number of threads to use for ingestion required
memory_mb int memory to use for ingestion in MiB required
sample_batch_size int sample batch size to use for ingestion required
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
contig_mode str ingestion mode, defaults to “all” 'all'
contigs_to_keep_separate Optional[Sequence[str]] list of contigs to keep separate, defaults to None None
contig_fragment_merging bool enable contig fragment merging, defaults to True True
resume bool enable resume ingestion mode, defaults to True True
create_index bool force creation of a local index file, defaults to True True
id str profiler event id, defaults to “samples” 'samples'
verbose bool verbose logging, defaults to False False
trace_id Optional[str] trace ID for logging, defaults to None None
use_remote_tmp bool use remote tmp space if VCFs need to be bgzipped, defaults to False (preferred for small VCFs) False

ingest_vcf

cloud.vcf.ingestion.ingest_vcf(
    dataset_uri,
    *,
    acn=None,
    config=None,
    namespace=None,
    register_name=None,
    search_uri=None,
    pattern=None,
    ignore=None,
    sample_list_uri=None,
    metadata_uri=None,
    metadata_attr='uri',
    max_files=None,
    max_samples=None,
    contigs=Contigs.ALL,
    resume=True,
    extra_attrs=DEFAULT_ATTRIBUTES,
    vcf_attrs=None,
    anchor_gap=None,
    compression_level=None,
    manifest_batch_size=MANIFEST_BATCH_SIZE,
    manifest_workers=MANIFEST_WORKERS,
    vcf_batch_size=VCF_BATCH_SIZE,
    vcf_workers=VCF_WORKERS,
    vcf_threads=VCF_THREADS,
    verbose=False,
    create_index=True,
    trace_id=None,
    consolidate_stats=True,
    aws_find_mode=False,
    use_remote_tmp=False,
    disable_manifest=False,
    ingest_resources=None,
    consolidate_resources=CONSOLIDATE_RESOURCES,
    manifest_resources=MANIFEST_RESOURCES,
    create_resources=None,
    read_vcf_uris_resources=None,
    filter_uri_resources=None,
    filter_samples_resources=FILTER_SAMPLES_RESOURCES,
)

Ingest samples into a dataset.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
acn Optional[str] Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None None
config config dictionary, defaults to None None
namespace Optional[str] TileDB-Cloud namespace, defaults to None None
register_name Optional[str] name to register the dataset with on TileDB Cloud, defaults to None None
search_uri Optional[str] URI to search for VCF files, defaults to None None
pattern Optional[str] Unix shell style pattern to match when searching for VCF files, defaults to None None
ignore Optional[str] Unix shell style pattern to ignore when searching for VCF files, defaults to None None
sample_list_uri Optional[str] URI with a list of VCF URIs, defaults to None None
metadata_uri Optional[str] URI of metadata array holding VCF URIs, defaults to None None
metadata_attr str name of metadata attribute containing URIs, defaults to “uri” 'uri'
max_files Optional[int] maximum number of VCF URIs to read/find, defaults to None (no limit) None
max_samples Optional[int] maximum number of samples to ingest, defaults to None (no limit) None
contigs Optional[Union[Sequence[str], Contigs]] contig mode (Contigs.ALL | Contigs.CHROMOSOMES | Contigs.OTHER | Contigs.ALL_DISABLE_MERGE) or list of contigs to ingest, defaults to Contigs.ALL Contigs.ALL
resume bool enable resume ingestion mode, defaults to True True
extra_attrs Optional[Union[Sequence[str], str]] INFO/FORMAT fields to materialize, defaults to repr(DEFAULT_ATTRIBUTES) DEFAULT_ATTRIBUTES
vcf_attrs Optional[str] VCF with all INFO/FORMAT fields to materialize, defaults to None None
anchor_gap Optional[int] anchor gap for VCF dataset, defaults to None None
compression_level Optional[int] zstd compression level for the VCF dataset, defaults to None (uses the default level in TileDB-VCF) None
manifest_batch_size int batch size for manifest ingestion, defaults to MANIFEST_BATCH_SIZE MANIFEST_BATCH_SIZE
manifest_workers int number of workers for manifest ingestion, defaults to MANIFEST_WORKERS MANIFEST_WORKERS
vcf_batch_size int batch size for VCF ingestion, defaults to VCF_BATCH_SIZE VCF_BATCH_SIZE
vcf_workers int number of workers for VCF ingestion, defaults to VCF_WORKERS VCF_WORKERS
vcf_threads int number of threads for VCF ingestion, defaults to VCF_THREADS VCF_THREADS
verbose bool verbose logging, defaults to False False
create_index bool force creation of a local index file, defaults to True True
trace_id Optional[str] trace ID for logging, defaults to None None
consolidate_stats bool consolidate the stats arrays, defaults to True True
aws_find_mode bool use AWS CLI to find VCFs, defaults to False False
use_remote_tmp bool use remote tmp space if VCFs need to be sorted and bgzipped, defaults to False (preferred for small VCFs) False
disable_manifest bool disable manifest creation, defaults to False False
ingest_resources Optional[Mapping[str, str]] manual override for ingest UDF resources, defaults to None None
consolidate_resources Optional[Mapping[str, str]] manual override for consolidate UDF resources, defaults to CONSOLIDATE_RESOURCES CONSOLIDATE_RESOURCES
manifest_resources Optional[Mapping[str, str]] manual override for manifest UDF resources, defaults to MANIFEST_RESOURCES MANIFEST_RESOURCES
create_resources Optional[Mapping[str, str]] manual override for create UDF resources, defaults to None None
read_vcf_uris_resources Optional[Mapping[str, str]] manual override for read VCF UDF resources, defaults to None None
filter_uri_resources Optional[Mapping[str, str]] manual override for filter VCF UDF resources, defaults to None None
filter_samples_resources Optional[Mapping[str, str]] manual override for filter samples UDF resources, defaults to FILTER_SAMPLES_RESOURCES FILTER_SAMPLES_RESOURCES

ingest_vcf_annotations

cloud.vcf.ingestion.ingest_vcf_annotations(
    dataset_uri,
    *,
    vcf_uri=None,
    search_uri=None,
    pattern=None,
    ignore=None,
    create_index=True,
    config=None,
    acn=None,
    namespace=None,
    register_name=None,
    verbose=False,
    ingest_resources=None,
    consolidate_resources=CONSOLIDATE_RESOURCES,
    find_uris_resources=None,
    create_resources=None,
    register_resources=None,
)

Ingest annotation VCF into a dataset. For example, a ClinVar or gnomAD VCF.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
vcf_uri Optional[str] VCF URI, defaults to None None
search_uri Optional[str] URI to search for VCF files, defaults to None None
pattern Optional[str] Unix shell style pattern to match when searching for VCF files, defaults to None None
ignore Optional[str] Unix shell style pattern to ignore when searching for VCF files, defaults to None None
create_index bool force creation of a local index file, defaults to True True
config config dictionary, defaults to None None
acn Optional[str] Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None None
namespace Optional[str] TileDB-Cloud namespace, defaults to None None
register_name Optional[str] name to register the dataset with on TileDB Cloud, defaults to None None
verbose bool verbose logging, defaults to False False
ingest_resources Optional[Mapping[str, str]] manual override for ingest UDF resources, defaults to None None
consolidate_resources Optional[Mapping[str, str]] manual override for consolidate UDF resources, defaults to None CONSOLIDATE_RESOURCES
find_uris_resources Optional[Mapping[str, str]] manual override for find VCF UDF resources, defaults to None None
create_resources Optional[Mapping[str, str]] manual override for create UDF resources, defaults to None None
register_resources Optional[Mapping[str, str]] manual override for register UDF resources, defaults to None None

read_metadata_uris_udf

cloud.vcf.ingestion.read_metadata_uris_udf(
    dataset_uri,
    *,
    config=None,
    metadata_uri,
    metadata_attr='uri',
    max_files=None,
    verbose=False,
)

Read a list of URIs from a TileDB array. The URIs will be read from the attribute specified in the metadata_attr argument.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
config Optional[Mapping[str, Any]] TileDB config, defaults to None None
metadata_uri str metadata array URI required
metadata_attr str name of metadata attribute containing URIs, defaults to “uri” 'uri'
max_files Optional[int] maximum number of URIs returned, defaults to None None
verbose bool verbose logging, defaults to False False

Returns

Name Type Description
Sequence[str] list of URIs

read_uris_udf

cloud.vcf.ingestion.read_uris_udf(
    dataset_uri,
    list_uri,
    *,
    config=None,
    max_files=None,
    verbose=False,
)

Read a list of URIs from a URI.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
list_uri str URI of the list of URIs required
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
max_files Optional[int] maximum number of URIs returned, defaults to None None
verbose bool verbose logging, defaults to False False

Returns

Name Type Description
Sequence[str] list of URIs

register_dataset_udf

cloud.vcf.ingestion.register_dataset_udf(
    dataset_uri,
    *,
    register_name,
    acn,
    namespace=None,
    config=None,
    verbose=False,
)

Register the dataset on TileDB Cloud.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
register_name str name to register the dataset with on TileDB Cloud required
namespace Optional[str] TileDB Cloud namespace, defaults to the user’s default namespace None
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
verbose bool verbose logging, defaults to False False