vcf.ingestion

cloud.vcf.ingestion

Classes

Name Description
Contigs The contigs to ingest.

Contigs

cloud.vcf.ingestion.Contigs()

The contigs to ingest.

ALL = all contigs CHROMOSOMES = all human chromosomes OTHER = all contigs other than the human chromosomes ALL_DISABLE_MERGE = all contigs with merging disabled, for non-human datasets

Functions

Name Description
consolidate_dataset_udf Consolidate arrays in the dataset.
create_dataset_udf Create a TileDB-VCF dataset.
create_manifest Create a manifest array in the dataset.
filter_samples_udf Return URIs for samples not already in the dataset.
filter_uris_udf Return URIs from sample_uris that are not in the manifest.
find_uris_aws_udf Find URIs matching a pattern in the search_uri path with an efficient
find_uris_udf Find URIs matching a pattern in the search_uri path.
get_logger_wrapper Get a logger instance and log version information.
ingest_manifest_dag Create a DAG to load the manifest array.
ingest_manifest_udf Ingest sample URIs into the manifest array.
ingest_samples_dag Create a DAG to ingest samples into the dataset.
ingest_samples_udf Ingest samples into the dataset.
ingest_vcf Ingest samples into a dataset.
ingest_vcf_annotations Ingest annotation VCF into a dataset. For example, a ClinVar or gnomAD VCF.
read_metadata_uris_udf Read a list of URIs from a TileDB array. The URIs will be read from the
read_uris_udf Read a list of URIs from a URI.
register_dataset_udf Register the dataset on TileDB Cloud.

consolidate_dataset_udf

cloud.vcf.ingestion.consolidate_dataset_udf(
    dataset_uri
    *
    config=None
    exclude=MANIFEST_ARRAY
    include=None
    id='consolidate'
    verbose=False
)

Consolidate arrays in the dataset.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
exclude Optional[Union[Sequence[str], str]] group members to exclude, defaults to MANIFEST_ARRAY MANIFEST_ARRAY
include Optional[Union[Sequence[str], str]] group members to include, defaults to None None
id str profiler event id, defaults to “consolidate” 'consolidate'
verbose bool verbose logging, defaults to False False

create_dataset_udf

cloud.vcf.ingestion.create_dataset_udf(
    dataset_uri
    *
    config=None
    extra_attrs=None
    vcf_attrs=None
    anchor_gap=None
    compression_level=None
    annotation_dataset=False
    verbose=False
)

Create a TileDB-VCF dataset.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
extra_attrs Optional[Union[Sequence[str], str]] INFO/FORMAT fields to materialize, defaults to None None
vcf_attrs Optional[str] VCF with all INFO/FORMAT fields to materialize, defaults to None None
anchor_gap Optional[int] anchor gap for VCF dataset, defaults to None None
compression_level Optional[int] zstd compression level for the VCF dataset, defaults to None (uses the default level in TileDB-VCF) None
annotation_dataset bool create an annotation dataset, defaults to False False
verbose bool verbose logging, defaults to False False

Returns

Name Type Description
str dataset URI

create_manifest

cloud.vcf.ingestion.create_manifest(dataset_uri, group)

Create a manifest array in the dataset.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
group tiledb.Group dataset group required

filter_samples_udf

cloud.vcf.ingestion.filter_samples_udf(
    dataset_uri
    *
    config=None
    verbose=False
)

Return URIs for samples not already in the dataset.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
verbose bool verbose logging, defaults to False False

Returns

Name Type Description
Sequence[str] sample URIs

filter_uris_udf

cloud.vcf.ingestion.filter_uris_udf(
    dataset_uri
    sample_uris
    *
    config=None
    verbose=False
)

Return URIs from sample_uris that are not in the manifest.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
sample_uris Sequence[str] sample URIs required
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
verbose bool verbose logging, defaults to False False

Returns

Name Type Description
Sequence[str] filtered sample URIs

find_uris_aws_udf

cloud.vcf.ingestion.find_uris_aws_udf(
    dataset_uri
    search_uri
    *
    config=None
    include=None
    exclude=None
    max_files=None
    verbose=False
)

Find URIs matching a pattern in the search_uri path with an efficient implementation for S3.

include and exclude patterns are Unix shell style (see fnmatch module).

Parameters

Name Type Description Default
dataset_uri str dataset URI required
search_uri str URI to search for VCF files required
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
include Optional[str] include pattern used in the search, defaults to None None
exclude Optional[str] exclude pattern applied to the search results, defaults to None None
max_files Optional[int] maximum number of URIs returned, defaults to None None
verbose bool verbose logging, defaults to False False

Returns

Name Type Description
Sequence[str] list of URIs

find_uris_udf

cloud.vcf.ingestion.find_uris_udf(
    dataset_uri
    search_uri
    *
    config=None
    include=None
    exclude=None
    max_files=None
    verbose=False
)

Find URIs matching a pattern in the search_uri path.

include and exclude patterns are Unix shell style (see fnmatch module).

Parameters

Name Type Description Default
dataset_uri str dataset URI required
search_uri str URI to search for VCF files required
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
include Optional[str] include pattern used in the search, defaults to None None
exclude Optional[str] exclude pattern applied to the search results, defaults to None None
max_files Optional[int] maximum number of URIs returned, defaults to None None
verbose bool verbose logging, defaults to False False

Returns

Name Type Description
Sequence[str] list of URIs

get_logger_wrapper

cloud.vcf.ingestion.get_logger_wrapper(verbose=False)

Get a logger instance and log version information.

Parameters

Name Type Description Default
verbose bool verbose logging, defaults to False False

Returns

Name Type Description
logging.Logger logger instance

ingest_manifest_dag

cloud.vcf.ingestion.ingest_manifest_dag(
    dataset_uri
    *
    acn=None
    config=None
    namespace=None
    search_uri=None
    pattern=None
    ignore=None
    sample_list_uri=None
    metadata_uri=None
    metadata_attr='uri'
    max_files=None
    batch_size=MANIFEST_BATCH_SIZE
    workers=MANIFEST_WORKERS
    extra_attrs=None
    vcf_attrs=None
    anchor_gap=None
    compression_level=None
    verbose=False
    aws_find_mode=False
    disable_manifest=False
)

Create a DAG to load the manifest array.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
acn Optional[str] Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None None
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
namespace Optional[str] TileDB-Cloud namespace, defaults to None None
search_uri Optional[str] URI to search for VCF files, defaults to None None
pattern Optional[str] pattern to match when searching for VCF files, defaults to None None
ignore Optional[str] pattern to ignore when searching for VCF files, defaults to None None
sample_list_uri Optional[str] URI with a list of VCF URIs, defaults to None None
metadata_uri Optional[str] URI of metadata array holding VCF URIs, defaults to None None
metadata_attr str name of metadata attribute containing URIs, defaults to “uri” 'uri'
max_files Optional[int] maximum number of URIs to ingest, defaults to None None
batch_size int manifest batch size, defaults to MANIFEST_BATCH_SIZE MANIFEST_BATCH_SIZE
workers int maximum number of parallel workers, defaults to MANIFEST_WORKERS MANIFEST_WORKERS
extra_attrs Optional[Union[Sequence[str], str]] INFO/FORMAT fields to materialize, defaults to None None
vcf_attrs Optional[str] VCF with all INFO/FORMAT fields to materialize, defaults to None None
anchor_gap Optional[int] anchor gap for VCF dataset, defaults to None None
compression_level Optional[int] zstd compression level for the VCF dataset, defaults to None (uses the default level in TileDB-VCF) None
verbose bool verbose logging, defaults to False False
aws_find_mode bool use AWS CLI to find VCFs, defaults to False False
disable_manifest bool disable manifest creation, defaults to False False

ingest_manifest_udf

cloud.vcf.ingestion.ingest_manifest_udf(
    dataset_uri
    sample_uris
    *
    config=None
    id='manifest'
    verbose=False
)

Ingest sample URIs into the manifest array.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
sample_uris Sequence[str] sample URIs required
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
id str profiler event id, defaults to “manifest” 'manifest'
verbose bool verbose logging, defaults to False False

ingest_samples_dag

cloud.vcf.ingestion.ingest_samples_dag(
    dataset_uri
    *
    acn=None
    config=None
    namespace=None
    contigs=Contigs.ALL
    threads=VCF_THREADS
    batch_size=VCF_BATCH_SIZE
    workers=VCF_WORKERS
    max_samples=None
    resume=True
    ingest_resources=None
    verbose=False
    create_index=True
    trace_id=None
    consolidate_stats=False
    use_remote_tmp=False
    sample_list_uri=None
)

Create a DAG to ingest samples into the dataset.

Note: If sample_list_uri is provided, the manifest is not checked for existing samples.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
acn Optional[str] Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None None
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
namespace Optional[str] TileDB-Cloud namespace, defaults to None None
contigs Optional[Union[Sequence[str], Contigs]] contig mode (Contigs.ALL | Contigs.CHROMOSOMES | Contigs.OTHER | Contigs.ALL_DISABLE_MERGE) or list of contigs to ingest, defaults to Contigs.ALL Contigs.ALL
threads int number of threads to use per ingestion task, defaults to VCF_THREADS VCF_THREADS
batch_size int sample batch size, defaults to VCF_BATCH_SIZE VCF_BATCH_SIZE
workers int maximum number of parallel workers, defaults to VCF_WORKERS VCF_WORKERS
max_samples Optional[int] maximum number of samples to ingest, defaults to None (no limit) None
resume bool enable resume ingestion mode, defaults to True True
ingest_resources Optional[Mapping[str, str]] manual override for ingest UDF resources, defaults to None None
verbose bool verbose logging, defaults to False False
create_index bool force creation of a local index file, defaults to True True
trace_id Optional[str] trace ID for logging, defaults to None None
consolidate_stats bool consolidate the stats arrays, defaults to False False
use_remote_tmp bool use remote tmp space if VCFs need to be bgzipped, defaults to False (preferred for small VCFs) False
sample_list_uri Optional[str] URI with a list of VCF URIs, defaults to None None

ingest_samples_udf

cloud.vcf.ingestion.ingest_samples_udf(
    dataset_uri
    sample_uris
    *
    config=None
    threads
    memory_mb
    sample_batch_size
    contig_mode='all'
    contigs_to_keep_separate=None
    contig_fragment_merging=True
    resume=True
    create_index=True
    id='samples'
    verbose=False
    trace_id=None
    use_remote_tmp=False
)

Ingest samples into the dataset.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
sample_uris Sequence[str] sample URIs required
threads int number of threads to use for ingestion required
memory_mb int memory to use for ingestion in MiB required
sample_batch_size int sample batch size to use for ingestion required
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
contig_mode str ingestion mode, defaults to “all” 'all'
contigs_to_keep_separate Optional[Sequence[str]] list of contigs to keep separate, defaults to None None
contig_fragment_merging bool enable contig fragment merging, defaults to True True
resume bool enable resume ingestion mode, defaults to True True
create_index bool force creation of a local index file, defaults to True True
id str profiler event id, defaults to “samples” 'samples'
verbose bool verbose logging, defaults to False False
trace_id Optional[str] trace ID for logging, defaults to None None
use_remote_tmp bool use remote tmp space if VCFs need to be bgzipped, defaults to False (preferred for small VCFs) False

ingest_vcf

cloud.vcf.ingestion.ingest_vcf(
    dataset_uri
    *
    acn=None
    config=None
    namespace=None
    register_name=None
    search_uri=None
    pattern=None
    ignore=None
    sample_list_uri=None
    metadata_uri=None
    metadata_attr='uri'
    max_files=None
    max_samples=None
    contigs=Contigs.ALL
    resume=True
    extra_attrs=DEFAULT_ATTRIBUTES
    vcf_attrs=None
    anchor_gap=None
    compression_level=None
    manifest_batch_size=MANIFEST_BATCH_SIZE
    manifest_workers=MANIFEST_WORKERS
    vcf_batch_size=VCF_BATCH_SIZE
    vcf_workers=VCF_WORKERS
    vcf_threads=VCF_THREADS
    ingest_resources=None
    verbose=False
    create_index=True
    trace_id=None
    consolidate_stats=True
    aws_find_mode=False
    use_remote_tmp=False
    disable_manifest=False
)

Ingest samples into a dataset.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
acn Optional[str] Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None None
config config dictionary, defaults to None None
namespace Optional[str] TileDB-Cloud namespace, defaults to None None
register_name Optional[str] name to register the dataset with on TileDB Cloud, defaults to None None
search_uri Optional[str] URI to search for VCF files, defaults to None None
pattern Optional[str] Unix shell style pattern to match when searching for VCF files, defaults to None None
ignore Optional[str] Unix shell style pattern to ignore when searching for VCF files, defaults to None None
sample_list_uri Optional[str] URI with a list of VCF URIs, defaults to None None
metadata_uri Optional[str] URI of metadata array holding VCF URIs, defaults to None None
metadata_attr str name of metadata attribute containing URIs, defaults to “uri” 'uri'
max_files Optional[int] maximum number of VCF URIs to read/find, defaults to None (no limit) None
max_samples Optional[int] maximum number of samples to ingest, defaults to None (no limit) None
contigs Optional[Union[Sequence[str], Contigs]] contig mode (Contigs.ALL | Contigs.CHROMOSOMES | Contigs.OTHER | Contigs.ALL_DISABLE_MERGE) or list of contigs to ingest, defaults to Contigs.ALL Contigs.ALL
resume bool enable resume ingestion mode, defaults to True True
extra_attrs Optional[Union[Sequence[str], str]] INFO/FORMAT fields to materialize, defaults to repr(DEFAULT_ATTRIBUTES) DEFAULT_ATTRIBUTES
vcf_attrs Optional[str] VCF with all INFO/FORMAT fields to materialize, defaults to None None
anchor_gap Optional[int] anchor gap for VCF dataset, defaults to None None
compression_level Optional[int] zstd compression level for the VCF dataset, defaults to None (uses the default level in TileDB-VCF) None
manifest_batch_size int batch size for manifest ingestion, defaults to MANIFEST_BATCH_SIZE MANIFEST_BATCH_SIZE
manifest_workers int number of workers for manifest ingestion, defaults to MANIFEST_WORKERS MANIFEST_WORKERS
vcf_batch_size int batch size for VCF ingestion, defaults to VCF_BATCH_SIZE VCF_BATCH_SIZE
vcf_workers int number of workers for VCF ingestion, defaults to VCF_WORKERS VCF_WORKERS
vcf_threads int number of threads for VCF ingestion, defaults to VCF_THREADS VCF_THREADS
ingest_resources Optional[Mapping[str, str]] manual override for ingest UDF resources, defaults to None None
verbose bool verbose logging, defaults to False False
create_index bool force creation of a local index file, defaults to True True
trace_id Optional[str] trace ID for logging, defaults to None None
consolidate_stats bool consolidate the stats arrays, defaults to True True
aws_find_mode bool use AWS CLI to find VCFs, defaults to False False
use_remote_tmp bool use remote tmp space if VCFs need to be sorted and bgzipped, defaults to False (preferred for small VCFs) False
disable_manifest bool disable manifest creation, defaults to False False

ingest_vcf_annotations

cloud.vcf.ingestion.ingest_vcf_annotations(
    dataset_uri
    *
    vcf_uri=None
    search_uri=None
    pattern=None
    ignore=None
    create_index=True
    config=None
    acn=None
    namespace=None
    register_name=None
    ingest_resources=None
    verbose=False
)

Ingest annotation VCF into a dataset. For example, a ClinVar or gnomAD VCF.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
vcf_uri Optional[str] VCF URI, defaults to None None
search_uri Optional[str] URI to search for VCF files, defaults to None None
pattern Optional[str] Unix shell style pattern to match when searching for VCF files, defaults to None None
ignore Optional[str] Unix shell style pattern to ignore when searching for VCF files, defaults to None None
create_index bool force creation of a local index file, defaults to True True
config config dictionary, defaults to None None
acn Optional[str] Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None None
namespace Optional[str] TileDB-Cloud namespace, defaults to None None
register_name Optional[str] name to register the dataset with on TileDB Cloud, defaults to None None
ingest_resources Optional[Mapping[str, str]] manual override for ingest UDF resources, defaults to None None
verbose bool verbose logging, defaults to False False

read_metadata_uris_udf

cloud.vcf.ingestion.read_metadata_uris_udf(
    dataset_uri
    *
    config=None
    metadata_uri
    metadata_attr='uri'
    max_files=None
    verbose=False
)

Read a list of URIs from a TileDB array. The URIs will be read from the attribute specified in the metadata_attr argument.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
config Optional[Mapping[str, Any]] TileDB config, defaults to None None
metadata_uri str metadata array URI required
metadata_attr str name of metadata attribute containing URIs, defaults to “uri” 'uri'
max_files Optional[int] maximum number of URIs returned, defaults to None None
verbose bool verbose logging, defaults to False False

Returns

Name Type Description
Sequence[str] list of URIs

read_uris_udf

cloud.vcf.ingestion.read_uris_udf(
    dataset_uri
    list_uri
    *
    config=None
    max_files=None
    verbose=False
)

Read a list of URIs from a URI.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
list_uri str URI of the list of URIs required
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
max_files Optional[int] maximum number of URIs returned, defaults to None None
verbose bool verbose logging, defaults to False False

Returns

Name Type Description
Sequence[str] list of URIs

register_dataset_udf

cloud.vcf.ingestion.register_dataset_udf(
    dataset_uri
    *
    register_name
    acn
    namespace=None
    config=None
    verbose=False
)

Register the dataset on TileDB Cloud.

Parameters

Name Type Description Default
dataset_uri str dataset URI required
register_name str name to register the dataset with on TileDB Cloud required
namespace Optional[str] TileDB Cloud namespace, defaults to the user’s default namespace None
config Optional[Mapping[str, Any]] config dictionary, defaults to None None
verbose bool verbose logging, defaults to False False