Dataset

Dataset(self, uri, mode='r', cfg=None, stats=False, verbose=False, tiledb_config=None)

A class that provides read/write access to a TileDB-VCF dataset.

Parameters

Name	Type	Description	Default
`uri`	str	URI of the dataset.	required
`mode`	str	Mode of operation (‘r’\|‘w’)	`'r'`
`cfg`	ReadConfig	TileDB-VCF configuration.	`None`
`stats`	bool	Enable internal TileDB statistics.	`False`
`verbose`	bool	Enable verbose output.	`False`
`tiledb_config`	dict	TileDB configuration, alternative to `cfg.tiledb_config`.	`None`

Methods

Name	Description
attributes	Return a list of queryable attributes available in the VCF dataset.
close	Close the dataset and release resources.
continue_read	Continue an incomplete read.
continue_read_arrow	Continue an incomplete read.
count	Count records in the dataset.
create_dataset	Create a new dataset.
delete	Delete the dataset.
export	Exports data to multiple VCF files or a combined VCF file.
ingest_samples	Ingest VCF files into the dataset.
read	Read data from the dataset into a Pandas DataFrame.
read_allele_count	Read allele count from the dataset into a Pandas DataFrame
read_arrow	Read data from the dataset into a PyArrow Table.
read_completed	Returns true if the previous read operation was complete.
read_iter	Iterator version of `read()`.
read_variant_stats	Read variant stats from the dataset into a Pandas DataFrame
sample_count	Get the number of samples in the dataset.
samples	Get the list of samples in the dataset.
schema_version	Get the VCF schema version of the dataset.
tiledb_stats	Get TileDB stats as a string.
version	Return the TileDB-VCF version used to create the dataset.

attributes

Dataset.attributes(attr_type='all')

Return a list of queryable attributes available in the VCF dataset.

Parameters

Name	Type	Description	Default
`attr_type`	str	The subset of attributes to retrieve; “info” or “fmt” will only retrieve attributes ingested from the VCF INFO and FORMAT fields, respectively, “builtin” retrieves the static attributes defined in TileDB-VCF’s schema, “all” (the default) returns all queryable attributes.	`'all'`

Returns

Type	Description
list	A list of attribute names.

close

Dataset.close()

Close the dataset and release resources.

continue_read

Dataset.continue_read(release_buffers=True)

Continue an incomplete read.

Parameters

Name	Type	Description	Default
`release_buffers`	bool	Release the buffers after reading.	`True`

Returns

Type	Description
pd.DataFrame	The next batch of data as a Pandas DataFrame.

continue_read_arrow

Dataset.continue_read_arrow(release_buffers=True)

Continue an incomplete read.

Parameters

Name	Type	Description	Default
`release_buffers`	bool	Release the buffers after reading.	`True`

Returns

Type	Description
pa.Table	The next batch of data as a PyArrow Table.

count

Dataset.count(samples=None, regions=None)

Count records in the dataset.

Parameters

Name	Type	Description	Default
`samples`	(str, List[str])	Sample names to include in the count.	`None`
`regions`	(str, List[str])	Genomic regions to include in the count.	`None`

Returns

Type	Description
int	Number of intersecting records in the dataset.

create_dataset

Dataset.create_dataset(extra_attrs=None, vcf_attrs=None, tile_capacity=10000, anchor_gap=1000, checksum_type='sha256', allow_duplicates=True, enable_allele_count=True, enable_variant_stats=True, enable_sample_stats=True, compress_sample_dim=True, compression_level=4, variant_stats_version=2)

Create a new dataset.

Parameters

Name	Type	Description	Default
`extra_attrs`	str	CSV list of extra attributes to materialize from fmt and info fields.	`None`
`vcf_attrs`	str	URI of VCF file with all fmt and info fields to materialize in the dataset.	`None`
`tile_capacity`	int	Tile capacity to use for the array schema.	`10000`
`anchor_gap`	int	Length of gaps between inserted anchor records in bases.	`1000`
`checksum_type`	str	Optional checksum type for the dataset, “sha256” or “md5”.	`'sha256'`
`allow_duplicates`	bool	Allow records with duplicate start positions to be written to the array.	`True`
`enable_allele_count`	bool	Enable the allele count ingestion task.	`True`
`enable_variant_stats`	bool	Enable the variant stats ingestion task.	`True`
`enable_sample_stats`	bool	Enable the sample stats ingestion task.	`True`
`compress_sample_dim`	bool	Enable compression on the sample dimension.	`True`
`compression_level`	int	Compression level for zstd compression.	`4`
`variant_stats_version`	int	Version of the variant stats array.	`2`

delete

Dataset.delete(uri, *, config=None)

Delete the dataset.

Parameters

Name	Type	Description	Default
`uri`	str	URI of the dataset.	required
`config`	dict	TileDB configuration.	`None`

export

Dataset.export(samples=None, regions=None, samples_file=None, bed_file=None, skip_check_samples=False, enable_progress_estimation=False, merge=False, output_format='z', output_path='', output_dir='.')

Exports data to multiple VCF files or a combined VCF file.

Parameters

Name	Type	Description	Default
`samples`	(str, List[str])	Sample names to be read.	`None`
`regions`	(str, List[str])	Genomic regions to be read.	`None`
`samples_file`	str	URI of file containing sample names to be read, one per line.	`None`
`bed_file`	str	URI of a BED file of genomic regions to be read.	`None`
`skip_check_samples`	bool	Skip checking if the samples in `samples_file` exist in the dataset.	`False`
`set_af_filter`		Filter variants by internal allele frequency. For example, to include variants with AF > 0.1, set this to “>0.1”.	required
`scan_all_samples`		Scan all samples when computing internal allele frequency.	required
`enable_progress_estimation`	bool	DEPRECATED - This parameter will be removed in a future release.	`False`
`merge`	bool	Merge samples to create a combined VCF file.	`False`
`output_format`	str	Export file format: ‘b’: bcf (compressed), ‘u’: bcf, ‘z’:vcf.gz, ‘v’: vcf.	`'z'`
`output_path`	str	Combined VCF output file.	`''`
`output_dir`	str	Directory used for local output of exported samples.	`'.'`

ingest_samples

Dataset.ingest_samples(sample_uris=None, threads=None, total_memory_budget_mb=None, total_memory_percentage=None, ratio_tiledb_memory=None, max_tiledb_memory_mb=None, input_record_buffer_mb=None, avg_vcf_record_size=None, ratio_task_size=None, ratio_output_flush=None, scratch_space_path=None, scratch_space_size=None, sample_batch_size=None, resume=False, contig_fragment_merging=True, contigs_to_keep_separate=None, contigs_to_allow_merging=None, contig_mode='all', thread_task_size=None, memory_budget_mb=None, record_limit=None)

Ingest VCF files into the dataset.

Parameters

Name	Type	Description	Default
`sample_uris`	List[str]	List of sample URIs to ingest.	`None`
`threads`	int	Set the number of threads used for ingestion.	`None`
`total_memory_budget_mb`	int	Total memory budget for ingestion (MiB).	`None`
`total_memory_percentage`	float	Percentage of total system memory used for ingestion (overrides ‘total_memory_budget_mb’).	`None`
`ratio_tiledb_memory`	float	Ratio of memory budget allocated to `TileDB::sm.mem.total_budget`.	`None`
`max_tiledb_memory_mb`	int	Maximum memory allocated to TileDB::sm.mem.total_budget (MiB).	`None`
`input_record_buffer_mb`	int	Size of input record buffer for each sample file (MiB).	`None`
`avg_vcf_record_size`	int	Average VCF record size (bytes).	`None`
`ratio_task_size`	float	Ratio of worker task size to computed task size.	`None`
`ratio_output_flush`	float	Ratio of output buffer capacity that triggers a flush to TileDB.	`None`
`scratch_space_path`	str	Directory used for local storage of downloaded remote samples.	`None`
`scratch_space_size`	int	Amount of local storage that can be used for downloading remote samples (MB).	`None`
`sample_batch_size`	int	Number of samples per batch for ingestion (default 10).	`None`
`resume`	bool	Whether to check and attempt to resume a partial completed ingestion.	`False`
`contig_fragment_merging`	bool	Whether to enable merging of contigs into fragments. This overrides the contigs-to-keep-separate/contigs-to-allow- merging options. Generally contig fragment merging is good, this is a performance optimization to reduce the prefixes on a s3/azure/gcs bucket when there is a large number of pseudo contigs which are small in size.	`True`
`contigs_to_keep_separate`	List[str]	List of contigs that should not be merged into combined fragments. The default list includes all standard human chromosomes in both UCSC (e.g., chr1) and Ensembl (e.g., 1) formats.	`None`
`contigs_to_allow_merging`	List[str]	List of contigs that should be allowed to be merged into combined fragments.	`None`
`contig_mode`	str	Select which contigs are ingested: ‘all’, ‘separate’, or ‘merged’.	`'all'`
`thread_task_size`	int	DEPRECATED - This parameter will be removed in a future release.	`None`
`memory_budget_mb`	int	DEPRECATED - This parameter will be removed in a future release.	`None`
`record_limit`	int	DEPRECATED - This parameter will be removed in a future release.	`None`

read

Dataset.read(attrs=DEFAULT_ATTRS, samples=None, regions=None, samples_file=None, bed_file=None, skip_check_samples=False, set_af_filter='', scan_all_samples=False, enable_progress_estimation=False)

Read data from the dataset into a Pandas DataFrame.

For large datasets, a call to read() may not be able to fit all results in memory. In that case, the returned table will contain as many results as possible, and in order to retrieve the rest of the results, use the continue_read() function.

You can also use the Python generator version, read_iter().

Parameters

Name	Type	Description	Default
`attrs`	List[str]	List of attribute names to be read.	`DEFAULT_ATTRS`
`samples`	(str, List[str])	Sample names to be read.	`None`
`regions`	(str, List[str])	Genomic regions to be read.	`None`
`samples_file`	str	URI of file containing sample names to be read, one per line.	`None`
`bed_file`	str	URI of a BED file of genomic regions to be read.	`None`
`skip_check_samples`	bool	Skip checking if the samples in `samples_file` exist in the dataset.	`False`
`set_af_filter`	str	Filter variants by internal allele frequency. For example, to include variants with AF > 0.1, set this to “>0.1”.	`''`
`enable_progress_estimation`	bool	DEPRECATED - This parameter will be removed in a future release.	`False`

Returns

Type	Description
pd.DataFrame	Query results as a Pandas DataFrame.

read_allele_count

Dataset.read_allele_count(region=None)

Read allele count from the dataset into a Pandas DataFrame

Parameters

Name	Type	Description	Default
`region`	str	Genomic region to be queried.	`None`

read_arrow

Dataset.read_arrow(attrs=DEFAULT_ATTRS, samples=None, regions=None, samples_file=None, bed_file=None, skip_check_samples=False, set_af_filter='', scan_all_samples=False, enable_progress_estimation=False)

Read data from the dataset into a PyArrow Table.

For large queries, a call to read_arrow() may not be able to fit all results in memory. In that case, the returned table will contain as many results as possible, and in order to retrieve the rest of the results, use the continue_read() function.

Parameters

Name	Type	Description	Default
`attrs`	List[str]	List of attribute names to be read.	`DEFAULT_ATTRS`
`samples`	(str, List[str])	Sample names to be read.	`None`
`regions`	(str, List[str])	Genomic regions to be read.	`None`
`samples_file`	str	URI of file containing sample names to be read, one per line.	`None`
`bed_file`	str	URI of a BED file of genomic regions to be read.	`None`
`skip_check_samples`	bool	Skip checking if the samples in `samples_file` exist in the dataset.	`False`
`set_af_filter`	str	Filter variants by internal allele frequency. For example, to include variants with AF > 0.1, set this to “>0.1”.	`''`
`scan_all_samples`	bool	Scan all samples when computing internal allele frequency.	`False`
`enable_progress_estimation`	bool	DEPRECATED - This parameter will be removed in a future release.	`False`

Returns

Type	Description
pa.Table	Query results as a PyArrow Table.

read_completed

Dataset.read_completed()

Returns true if the previous read operation was complete. A read is considered complete if the resulting dataframe contained all results.

Returns

Type	Description
True if the previous read operation was complete.

read_iter

Dataset.read_iter(attrs=DEFAULT_ATTRS, samples=None, regions=None, samples_file=None, bed_file=None)

Iterator version of read().

Parameters

Name	Type	Description	Default
`attrs`	List[str]	List of attribute names to be read.	`DEFAULT_ATTRS`
`samples`	(str, List[str])	Sample names to be read.	`None`
`regions`	(str, List[str])	Genomic regions to be read.	`None`
`samples_file`	str	URI of file containing sample names to be read, one per line.	`None`
`bed_file`	str	URI of a BED file of genomic regions to be read.	`None`

read_variant_stats

Dataset.read_variant_stats(region=None)

Read variant stats from the dataset into a Pandas DataFrame

Parameters

Name	Type	Description	Default
`region`	str	Genomic region to be queried.	`None`

sample_count

Dataset.sample_count()

Get the number of samples in the dataset.

Returns

Type	Description
int	Number of samples in the dataset.

samples

Dataset.samples()

Get the list of samples in the dataset.

Returns

Type	Description
list	List of samples in the dataset.

schema_version

Dataset.schema_version()

Get the VCF schema version of the dataset.

Returns

Type	Description
int	VCF schema version of the dataset.

tiledb_stats

Dataset.tiledb_stats()

Get TileDB stats as a string.

Returns

Type	Description
str	TileDB stats as a string.

version

Dataset.version()

Return the TileDB-VCF version used to create the dataset.

Returns

Type	Description
str	The TileDB-VCF version.