Dataset
Dataset(self, uri, mode='r', cfg=None, stats=False, verbose=False, tiledb_config=None)
A class that provides read/write access to a TileDB-VCF dataset.
Parameters
Name | Type | Description | Default |
---|---|---|---|
uri |
str | URI of the dataset. | required |
mode |
str | Mode of operation (‘r’|‘w’) | 'r' |
cfg |
ReadConfig | TileDB-VCF configuration. | None |
stats |
bool | Enable internal TileDB statistics. | False |
verbose |
bool | Enable verbose output. | False |
tiledb_config |
dict | TileDB configuration, alternative to cfg.tiledb_config . |
None |
Methods
Name | Description |
---|---|
attributes | Return a list of queryable attributes available in the VCF dataset. |
close | Close the dataset and release resources. |
continue_read | Continue an incomplete read. |
continue_read_arrow | Continue an incomplete read. |
count | Count records in the dataset. |
create_dataset | Create a new dataset. |
delete | Delete the dataset. |
export | Exports data to multiple VCF files or a combined VCF file. |
ingest_samples | Ingest VCF files into the dataset. |
read | Read data from the dataset into a Pandas DataFrame. |
read_allele_count | Read allele count from the dataset into a Pandas DataFrame |
read_arrow | Read data from the dataset into a PyArrow Table. |
read_completed | Returns true if the previous read operation was complete. |
read_iter | Iterator version of read() . |
read_variant_stats | Read variant stats from the dataset into a Pandas DataFrame |
sample_count | Get the number of samples in the dataset. |
samples | Get the list of samples in the dataset. |
schema_version | Get the VCF schema version of the dataset. |
tiledb_stats | Get TileDB stats as a string. |
version | Return the TileDB-VCF version used to create the dataset. |
attributes
Dataset.attributes(attr_type='all')
Return a list of queryable attributes available in the VCF dataset.
Parameters
Name | Type | Description | Default |
---|---|---|---|
attr_type |
str | The subset of attributes to retrieve; “info” or “fmt” will only retrieve attributes ingested from the VCF INFO and FORMAT fields, respectively, “builtin” retrieves the static attributes defined in TileDB-VCF’s schema, “all” (the default) returns all queryable attributes. | 'all' |
Returns
Type | Description |
---|---|
list | A list of attribute names. |
close
Dataset.close()
Close the dataset and release resources.
continue_read
Dataset.continue_read(release_buffers=True)
Continue an incomplete read.
Parameters
Name | Type | Description | Default |
---|---|---|---|
release_buffers |
bool | Release the buffers after reading. | True |
Returns
Type | Description |
---|---|
pd.DataFrame | The next batch of data as a Pandas DataFrame. |
continue_read_arrow
Dataset.continue_read_arrow(release_buffers=True)
Continue an incomplete read.
Parameters
Name | Type | Description | Default |
---|---|---|---|
release_buffers |
bool | Release the buffers after reading. | True |
Returns
Type | Description |
---|---|
pa.Table | The next batch of data as a PyArrow Table. |
count
Dataset.count(samples=None, regions=None)
Count records in the dataset.
Parameters
Name | Type | Description | Default |
---|---|---|---|
samples |
(str, List[str]) | Sample names to include in the count. | None |
regions |
(str, List[str]) | Genomic regions to include in the count. | None |
Returns
Type | Description |
---|---|
int | Number of intersecting records in the dataset. |
create_dataset
Dataset.create_dataset(extra_attrs=None, vcf_attrs=None, tile_capacity=10000, anchor_gap=1000, checksum_type='sha256', allow_duplicates=True, enable_allele_count=True, enable_variant_stats=True, enable_sample_stats=True, compress_sample_dim=True, compression_level=4, variant_stats_version=2)
Create a new dataset.
Parameters
Name | Type | Description | Default |
---|---|---|---|
extra_attrs |
str | CSV list of extra attributes to materialize from fmt and info fields. | None |
vcf_attrs |
str | URI of VCF file with all fmt and info fields to materialize in the dataset. | None |
tile_capacity |
int | Tile capacity to use for the array schema. | 10000 |
anchor_gap |
int | Length of gaps between inserted anchor records in bases. | 1000 |
checksum_type |
str | Optional checksum type for the dataset, “sha256” or “md5”. | 'sha256' |
allow_duplicates |
bool | Allow records with duplicate start positions to be written to the array. | True |
enable_allele_count |
bool | Enable the allele count ingestion task. | True |
enable_variant_stats |
bool | Enable the variant stats ingestion task. | True |
enable_sample_stats |
bool | Enable the sample stats ingestion task. | True |
compress_sample_dim |
bool | Enable compression on the sample dimension. | True |
compression_level |
int | Compression level for zstd compression. | 4 |
variant_stats_version |
int | Version of the variant stats array. | 2 |
delete
Dataset.delete(uri, *, config=None)
Delete the dataset.
Parameters
Name | Type | Description | Default |
---|---|---|---|
uri |
str | URI of the dataset. | required |
config |
dict | TileDB configuration. | None |
export
Dataset.export(samples=None, regions=None, samples_file=None, bed_file=None, skip_check_samples=False, enable_progress_estimation=False, merge=False, output_format='z', output_path='', output_dir='.')
Exports data to multiple VCF files or a combined VCF file.
Parameters
Name | Type | Description | Default |
---|---|---|---|
samples |
(str, List[str]) | Sample names to be read. | None |
regions |
(str, List[str]) | Genomic regions to be read. | None |
samples_file |
str | URI of file containing sample names to be read, one per line. | None |
bed_file |
str | URI of a BED file of genomic regions to be read. | None |
skip_check_samples |
bool | Skip checking if the samples in samples_file exist in the dataset. |
False |
set_af_filter |
Filter variants by internal allele frequency. For example, to include variants with AF > 0.1, set this to “>0.1”. | required | |
scan_all_samples |
Scan all samples when computing internal allele frequency. | required | |
enable_progress_estimation |
bool | DEPRECATED - This parameter will be removed in a future release. | False |
merge |
bool | Merge samples to create a combined VCF file. | False |
output_format |
str | Export file format: ‘b’: bcf (compressed), ‘u’: bcf, ‘z’:vcf.gz, ‘v’: vcf. | 'z' |
output_path |
str | Combined VCF output file. | '' |
output_dir |
str | Directory used for local output of exported samples. | '.' |
ingest_samples
Dataset.ingest_samples(sample_uris=None, threads=None, total_memory_budget_mb=None, total_memory_percentage=None, ratio_tiledb_memory=None, max_tiledb_memory_mb=None, input_record_buffer_mb=None, avg_vcf_record_size=None, ratio_task_size=None, ratio_output_flush=None, scratch_space_path=None, scratch_space_size=None, sample_batch_size=None, resume=False, contig_fragment_merging=True, contigs_to_keep_separate=None, contigs_to_allow_merging=None, contig_mode='all', thread_task_size=None, memory_budget_mb=None, record_limit=None)
Ingest VCF files into the dataset.
Parameters
Name | Type | Description | Default |
---|---|---|---|
sample_uris |
List[str] | List of sample URIs to ingest. | None |
threads |
int | Set the number of threads used for ingestion. | None |
total_memory_budget_mb |
int | Total memory budget for ingestion (MiB). | None |
total_memory_percentage |
float | Percentage of total system memory used for ingestion (overrides ‘total_memory_budget_mb’). | None |
ratio_tiledb_memory |
float | Ratio of memory budget allocated to TileDB::sm.mem.total_budget . |
None |
max_tiledb_memory_mb |
int | Maximum memory allocated to TileDB::sm.mem.total_budget (MiB). | None |
input_record_buffer_mb |
int | Size of input record buffer for each sample file (MiB). | None |
avg_vcf_record_size |
int | Average VCF record size (bytes). | None |
ratio_task_size |
float | Ratio of worker task size to computed task size. | None |
ratio_output_flush |
float | Ratio of output buffer capacity that triggers a flush to TileDB. | None |
scratch_space_path |
str | Directory used for local storage of downloaded remote samples. | None |
scratch_space_size |
int | Amount of local storage that can be used for downloading remote samples (MB). | None |
sample_batch_size |
int | Number of samples per batch for ingestion (default 10). | None |
resume |
bool | Whether to check and attempt to resume a partial completed ingestion. | False |
contig_fragment_merging |
bool | Whether to enable merging of contigs into fragments. This overrides the contigs-to-keep-separate/contigs-to-allow- merging options. Generally contig fragment merging is good, this is a performance optimization to reduce the prefixes on a s3/azure/gcs bucket when there is a large number of pseudo contigs which are small in size. | True |
contigs_to_keep_separate |
List[str] | List of contigs that should not be merged into combined fragments. The default list includes all standard human chromosomes in both UCSC (e.g., chr1) and Ensembl (e.g., 1) formats. | None |
contigs_to_allow_merging |
List[str] | List of contigs that should be allowed to be merged into combined fragments. | None |
contig_mode |
str | Select which contigs are ingested: ‘all’, ‘separate’, or ‘merged’. | 'all' |
thread_task_size |
int | DEPRECATED - This parameter will be removed in a future release. | None |
memory_budget_mb |
int | DEPRECATED - This parameter will be removed in a future release. | None |
record_limit |
int | DEPRECATED - This parameter will be removed in a future release. | None |
read
Dataset.read(attrs=DEFAULT_ATTRS, samples=None, regions=None, samples_file=None, bed_file=None, skip_check_samples=False, set_af_filter='', scan_all_samples=False, enable_progress_estimation=False)
Read data from the dataset into a Pandas DataFrame.
For large datasets, a call to read()
may not be able to fit all results in memory. In that case, the returned table will contain as many results as possible, and in order to retrieve the rest of the results, use the continue_read()
function.
You can also use the Python generator version, read_iter()
.
Parameters
Name | Type | Description | Default |
---|---|---|---|
attrs |
List[str] | List of attribute names to be read. | DEFAULT_ATTRS |
samples |
(str, List[str]) | Sample names to be read. | None |
regions |
(str, List[str]) | Genomic regions to be read. | None |
samples_file |
str | URI of file containing sample names to be read, one per line. | None |
bed_file |
str | URI of a BED file of genomic regions to be read. | None |
skip_check_samples |
bool | Skip checking if the samples in samples_file exist in the dataset. |
False |
set_af_filter |
str | Filter variants by internal allele frequency. For example, to include variants with AF > 0.1, set this to “>0.1”. | '' |
enable_progress_estimation |
bool | DEPRECATED - This parameter will be removed in a future release. | False |
Returns
Type | Description |
---|---|
pd.DataFrame | Query results as a Pandas DataFrame. |
read_allele_count
Dataset.read_allele_count(region=None)
Read allele count from the dataset into a Pandas DataFrame
Parameters
Name | Type | Description | Default |
---|---|---|---|
region |
str | Genomic region to be queried. | None |
read_arrow
Dataset.read_arrow(attrs=DEFAULT_ATTRS, samples=None, regions=None, samples_file=None, bed_file=None, skip_check_samples=False, set_af_filter='', scan_all_samples=False, enable_progress_estimation=False)
Read data from the dataset into a PyArrow Table.
For large queries, a call to read_arrow()
may not be able to fit all results in memory. In that case, the returned table will contain as many results as possible, and in order to retrieve the rest of the results, use the continue_read()
function.
Parameters
Name | Type | Description | Default |
---|---|---|---|
attrs |
List[str] | List of attribute names to be read. | DEFAULT_ATTRS |
samples |
(str, List[str]) | Sample names to be read. | None |
regions |
(str, List[str]) | Genomic regions to be read. | None |
samples_file |
str | URI of file containing sample names to be read, one per line. | None |
bed_file |
str | URI of a BED file of genomic regions to be read. | None |
skip_check_samples |
bool | Skip checking if the samples in samples_file exist in the dataset. |
False |
set_af_filter |
str | Filter variants by internal allele frequency. For example, to include variants with AF > 0.1, set this to “>0.1”. | '' |
scan_all_samples |
bool | Scan all samples when computing internal allele frequency. | False |
enable_progress_estimation |
bool | DEPRECATED - This parameter will be removed in a future release. | False |
Returns
Type | Description |
---|---|
pa.Table | Query results as a PyArrow Table. |
read_completed
Dataset.read_completed()
Returns true if the previous read operation was complete. A read is considered complete if the resulting dataframe contained all results.
Returns
Type | Description |
---|---|
True if the previous read operation was complete. |
read_iter
Dataset.read_iter(attrs=DEFAULT_ATTRS, samples=None, regions=None, samples_file=None, bed_file=None)
Iterator version of read()
.
Parameters
Name | Type | Description | Default |
---|---|---|---|
attrs |
List[str] | List of attribute names to be read. | DEFAULT_ATTRS |
samples |
(str, List[str]) | Sample names to be read. | None |
regions |
(str, List[str]) | Genomic regions to be read. | None |
samples_file |
str | URI of file containing sample names to be read, one per line. | None |
bed_file |
str | URI of a BED file of genomic regions to be read. | None |
read_variant_stats
Dataset.read_variant_stats(region=None)
Read variant stats from the dataset into a Pandas DataFrame
Parameters
Name | Type | Description | Default |
---|---|---|---|
region |
str | Genomic region to be queried. | None |
sample_count
Dataset.sample_count()
Get the number of samples in the dataset.
Returns
Type | Description |
---|---|
int | Number of samples in the dataset. |
samples
Dataset.samples()
Get the list of samples in the dataset.
Returns
Type | Description |
---|---|
list | List of samples in the dataset. |
schema_version
Dataset.schema_version()
Get the VCF schema version of the dataset.
Returns
Type | Description |
---|---|
int | VCF schema version of the dataset. |
tiledb_stats
Dataset.tiledb_stats()
Get TileDB stats as a string.
Returns
Type | Description |
---|---|
str | TileDB stats as a string. |
version
Dataset.version()
Return the TileDB-VCF version used to create the dataset.
Returns
Type | Description |
---|---|
str | The TileDB-VCF version. |