Dataset

Dataset(self, uri, mode='r', cfg=None, stats=False, verbose=False, tiledb_config=None)

A class that provides read/write access to a TileDB-VCF dataset.

Parameters

Name Type Description Default
uri str URI of the dataset. required
mode str Mode of operation (‘r’|‘w’) 'r'
cfg ReadConfig TileDB-VCF configuration. None
stats bool Enable internal TileDB statistics. False
verbose bool Enable verbose output. False
tiledb_config dict TileDB configuration, alternative to cfg.tiledb_config. None

Methods

Name Description
attributes Return a list of queryable attributes available in the VCF dataset.
close Close the dataset and release resources.
continue_read Continue an incomplete read.
continue_read_arrow Continue an incomplete read.
count Count records in the dataset.
create_dataset Create a new dataset.
delete Delete the dataset.
export Exports data to multiple VCF files or a combined VCF file.
ingest_samples Ingest VCF files into the dataset.
read Read data from the dataset into a Pandas DataFrame.
read_allele_count Read allele count from the dataset into a Pandas DataFrame
read_arrow Read data from the dataset into a PyArrow Table.
read_completed Returns true if the previous read operation was complete.
read_iter Iterator version of read().
read_variant_stats Read variant stats from the dataset into a Pandas DataFrame
sample_count Get the number of samples in the dataset.
samples Get the list of samples in the dataset.
schema_version Get the VCF schema version of the dataset.
tiledb_stats Get TileDB stats as a string.
version Return the TileDB-VCF version used to create the dataset.

attributes

Dataset.attributes(attr_type='all')

Return a list of queryable attributes available in the VCF dataset.

Parameters

Name Type Description Default
attr_type str The subset of attributes to retrieve; “info” or “fmt” will only retrieve attributes ingested from the VCF INFO and FORMAT fields, respectively, “builtin” retrieves the static attributes defined in TileDB-VCF’s schema, “all” (the default) returns all queryable attributes. 'all'

Returns

Type Description
list A list of attribute names.

close

Dataset.close()

Close the dataset and release resources.

continue_read

Dataset.continue_read(release_buffers=True)

Continue an incomplete read.

Parameters

Name Type Description Default
release_buffers bool Release the buffers after reading. True

Returns

Type Description
pd.DataFrame The next batch of data as a Pandas DataFrame.

continue_read_arrow

Dataset.continue_read_arrow(release_buffers=True)

Continue an incomplete read.

Parameters

Name Type Description Default
release_buffers bool Release the buffers after reading. True

Returns

Type Description
pa.Table The next batch of data as a PyArrow Table.

count

Dataset.count(samples=None, regions=None)

Count records in the dataset.

Parameters

Name Type Description Default
samples (str, List[str]) Sample names to include in the count. None
regions (str, List[str]) Genomic regions to include in the count. None

Returns

Type Description
int Number of intersecting records in the dataset.

create_dataset

Dataset.create_dataset(extra_attrs=None, vcf_attrs=None, tile_capacity=10000, anchor_gap=1000, checksum_type='sha256', allow_duplicates=True, enable_allele_count=True, enable_variant_stats=True, enable_sample_stats=True, compress_sample_dim=True, compression_level=4, variant_stats_version=2)

Create a new dataset.

Parameters

Name Type Description Default
extra_attrs str CSV list of extra attributes to materialize from fmt and info fields. None
vcf_attrs str URI of VCF file with all fmt and info fields to materialize in the dataset. None
tile_capacity int Tile capacity to use for the array schema. 10000
anchor_gap int Length of gaps between inserted anchor records in bases. 1000
checksum_type str Optional checksum type for the dataset, “sha256” or “md5”. 'sha256'
allow_duplicates bool Allow records with duplicate start positions to be written to the array. True
enable_allele_count bool Enable the allele count ingestion task. True
enable_variant_stats bool Enable the variant stats ingestion task. True
enable_sample_stats bool Enable the sample stats ingestion task. True
compress_sample_dim bool Enable compression on the sample dimension. True
compression_level int Compression level for zstd compression. 4
variant_stats_version int Version of the variant stats array. 2

delete

Dataset.delete(uri, *, config=None)

Delete the dataset.

Parameters

Name Type Description Default
uri str URI of the dataset. required
config dict TileDB configuration. None

export

Dataset.export(samples=None, regions=None, samples_file=None, bed_file=None, skip_check_samples=False, enable_progress_estimation=False, merge=False, output_format='z', output_path='', output_dir='.')

Exports data to multiple VCF files or a combined VCF file.

Parameters

Name Type Description Default
samples (str, List[str]) Sample names to be read. None
regions (str, List[str]) Genomic regions to be read. None
samples_file str URI of file containing sample names to be read, one per line. None
bed_file str URI of a BED file of genomic regions to be read. None
skip_check_samples bool Skip checking if the samples in samples_file exist in the dataset. False
set_af_filter Filter variants by internal allele frequency. For example, to include variants with AF > 0.1, set this to “>0.1”. required
scan_all_samples Scan all samples when computing internal allele frequency. required
enable_progress_estimation bool DEPRECATED - This parameter will be removed in a future release. False
merge bool Merge samples to create a combined VCF file. False
output_format str Export file format: ‘b’: bcf (compressed), ‘u’: bcf, ‘z’:vcf.gz, ‘v’: vcf. 'z'
output_path str Combined VCF output file. ''
output_dir str Directory used for local output of exported samples. '.'

ingest_samples

Dataset.ingest_samples(sample_uris=None, threads=None, total_memory_budget_mb=None, total_memory_percentage=None, ratio_tiledb_memory=None, max_tiledb_memory_mb=None, input_record_buffer_mb=None, avg_vcf_record_size=None, ratio_task_size=None, ratio_output_flush=None, scratch_space_path=None, scratch_space_size=None, sample_batch_size=None, resume=False, contig_fragment_merging=True, contigs_to_keep_separate=None, contigs_to_allow_merging=None, contig_mode='all', thread_task_size=None, memory_budget_mb=None, record_limit=None)

Ingest VCF files into the dataset.

Parameters

Name Type Description Default
sample_uris List[str] List of sample URIs to ingest. None
threads int Set the number of threads used for ingestion. None
total_memory_budget_mb int Total memory budget for ingestion (MiB). None
total_memory_percentage float Percentage of total system memory used for ingestion (overrides ‘total_memory_budget_mb’). None
ratio_tiledb_memory float Ratio of memory budget allocated to TileDB::sm.mem.total_budget. None
max_tiledb_memory_mb int Maximum memory allocated to TileDB::sm.mem.total_budget (MiB). None
input_record_buffer_mb int Size of input record buffer for each sample file (MiB). None
avg_vcf_record_size int Average VCF record size (bytes). None
ratio_task_size float Ratio of worker task size to computed task size. None
ratio_output_flush float Ratio of output buffer capacity that triggers a flush to TileDB. None
scratch_space_path str Directory used for local storage of downloaded remote samples. None
scratch_space_size int Amount of local storage that can be used for downloading remote samples (MB). None
sample_batch_size int Number of samples per batch for ingestion (default 10). None
resume bool Whether to check and attempt to resume a partial completed ingestion. False
contig_fragment_merging bool Whether to enable merging of contigs into fragments. This overrides the contigs-to-keep-separate/contigs-to-allow- merging options. Generally contig fragment merging is good, this is a performance optimization to reduce the prefixes on a s3/azure/gcs bucket when there is a large number of pseudo contigs which are small in size. True
contigs_to_keep_separate List[str] List of contigs that should not be merged into combined fragments. The default list includes all standard human chromosomes in both UCSC (e.g., chr1) and Ensembl (e.g., 1) formats. None
contigs_to_allow_merging List[str] List of contigs that should be allowed to be merged into combined fragments. None
contig_mode str Select which contigs are ingested: ‘all’, ‘separate’, or ‘merged’. 'all'
thread_task_size int DEPRECATED - This parameter will be removed in a future release. None
memory_budget_mb int DEPRECATED - This parameter will be removed in a future release. None
record_limit int DEPRECATED - This parameter will be removed in a future release. None

read

Dataset.read(attrs=DEFAULT_ATTRS, samples=None, regions=None, samples_file=None, bed_file=None, skip_check_samples=False, set_af_filter='', scan_all_samples=False, enable_progress_estimation=False)

Read data from the dataset into a Pandas DataFrame.

For large datasets, a call to read() may not be able to fit all results in memory. In that case, the returned table will contain as many results as possible, and in order to retrieve the rest of the results, use the continue_read() function.

You can also use the Python generator version, read_iter().

Parameters

Name Type Description Default
attrs List[str] List of attribute names to be read. DEFAULT_ATTRS
samples (str, List[str]) Sample names to be read. None
regions (str, List[str]) Genomic regions to be read. None
samples_file str URI of file containing sample names to be read, one per line. None
bed_file str URI of a BED file of genomic regions to be read. None
skip_check_samples bool Skip checking if the samples in samples_file exist in the dataset. False
set_af_filter str Filter variants by internal allele frequency. For example, to include variants with AF > 0.1, set this to “>0.1”. ''
enable_progress_estimation bool DEPRECATED - This parameter will be removed in a future release. False

Returns

Type Description
pd.DataFrame Query results as a Pandas DataFrame.

read_allele_count

Dataset.read_allele_count(region=None)

Read allele count from the dataset into a Pandas DataFrame

Parameters

Name Type Description Default
region str Genomic region to be queried. None

read_arrow

Dataset.read_arrow(attrs=DEFAULT_ATTRS, samples=None, regions=None, samples_file=None, bed_file=None, skip_check_samples=False, set_af_filter='', scan_all_samples=False, enable_progress_estimation=False)

Read data from the dataset into a PyArrow Table.

For large queries, a call to read_arrow() may not be able to fit all results in memory. In that case, the returned table will contain as many results as possible, and in order to retrieve the rest of the results, use the continue_read() function.

Parameters

Name Type Description Default
attrs List[str] List of attribute names to be read. DEFAULT_ATTRS
samples (str, List[str]) Sample names to be read. None
regions (str, List[str]) Genomic regions to be read. None
samples_file str URI of file containing sample names to be read, one per line. None
bed_file str URI of a BED file of genomic regions to be read. None
skip_check_samples bool Skip checking if the samples in samples_file exist in the dataset. False
set_af_filter str Filter variants by internal allele frequency. For example, to include variants with AF > 0.1, set this to “>0.1”. ''
scan_all_samples bool Scan all samples when computing internal allele frequency. False
enable_progress_estimation bool DEPRECATED - This parameter will be removed in a future release. False

Returns

Type Description
pa.Table Query results as a PyArrow Table.

read_completed

Dataset.read_completed()

Returns true if the previous read operation was complete. A read is considered complete if the resulting dataframe contained all results.

Returns

Type Description
True if the previous read operation was complete.

read_iter

Dataset.read_iter(attrs=DEFAULT_ATTRS, samples=None, regions=None, samples_file=None, bed_file=None)

Iterator version of read().

Parameters

Name Type Description Default
attrs List[str] List of attribute names to be read. DEFAULT_ATTRS
samples (str, List[str]) Sample names to be read. None
regions (str, List[str]) Genomic regions to be read. None
samples_file str URI of file containing sample names to be read, one per line. None
bed_file str URI of a BED file of genomic regions to be read. None

read_variant_stats

Dataset.read_variant_stats(region=None)

Read variant stats from the dataset into a Pandas DataFrame

Parameters

Name Type Description Default
region str Genomic region to be queried. None

sample_count

Dataset.sample_count()

Get the number of samples in the dataset.

Returns

Type Description
int Number of samples in the dataset.

samples

Dataset.samples()

Get the list of samples in the dataset.

Returns

Type Description
list List of samples in the dataset.

schema_version

Dataset.schema_version()

Get the VCF schema version of the dataset.

Returns

Type Description
int VCF schema version of the dataset.

tiledb_stats

Dataset.tiledb_stats()

Get TileDB stats as a string.

Returns

Type Description
str TileDB stats as a string.

version

Dataset.version()

Return the TileDB-VCF version used to create the dataset.

Returns

Type Description
str The TileDB-VCF version.