CLI

Create a new dataset

Note

The files used in the following examples can be obtained here.

The first step is to create an empty dataset. Let’s save the dataset in a new array called small_dataset:

tiledbvcf create --uri small_dataset

Store samples

Ingest from local storage

We’ll start with a small example using 3 synthetic VCF files, assuming they are locally available in a folder data/vcfs:

tree data
## data
## ├── gene-promoters-hg38.bed
## ├── s3-bcf-samples.txt
## └── vcfs
##     ├── G1.vcf.gz
##     ├── G1.vcf.gz.csi
##     ├── G2.vcf.gz
##     ├── G2.vcf.gz.csi
##     ├── G3.vcf.gz
##     └── G3.vcf.gz.csi
##
## 1 directory, 8 files

Index files are required for ingestion. If your VCF/BCF files have not been indexed you can use bcftools to do so:

for f in data/vcfs/*.vcf.gz; do bcftools index -c $f; done

We can ingest these files into small_dataset as follows:

tiledbvcf store --uri small_dataset data/vcfs/G*.vcf.gz

That’s it! Let’s verify everything went okay using the stat command to provide high-level statistics about our dataset including the number of samples it contains and the variant attributes it includes.

tiledbvcf stat --uri small_dataset
## Statistics for dataset 'small_dataset':
## - Version: 4
## - Tile capacity: 10,000
## - Anchor gap: 1,000
## - Number of registered samples: 3
## - Extracted attributes: none

At this point you have successfully created and populated a TileDB VCF dataset using data stored locally on your machine. Next we’ll look at the more common scenario of working with files stored on a cloud object store.

Ingest from S3

TileDB Embedded’s native cloud features make it possible to ingest samples directly from remote locations. Here, we’ll ingest the following samples located on AWS S3:

cat data/s3-bcf-samples.txt
## s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G4.bcf
## s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G5.bcf
## s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G6.bcf
## s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G7.bcf
## s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G8.bcf
## s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G9.bcf
## s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G10.bcf
Note

Samples in this second batch are stored as BCF files which are also supported by TileDB-VCF.

This process is identical to the steps perfo_r_med above, the only changes needed to our code involve setting --scratch-mb to allocate some temporary space for downloading the files and providing the URLs for the remote files. In this case, we’ll simply pass the s3-bcf-samples.txt file, which includes a list of the BCF files we want to ingest.

Note

When ingesting samples from S3, you must configure enough scratch space to hold at least 20 samples. In general, you need 2 × the sample dimension’s sample_bactch_size (which by default is 10). You can read more about the data model here.

You can add the --verbose flag to print out more information during the store phase.

tiledbvcf store \
  --uri small_dataset \
  --samples-file data/s3-bcf-samples.txt \
  --verbose
  
## Initialization completed in 3.17565 sec.
## ...
## Done. Ingested 1,391 records (+ 69,548 anchors) from 7 samples in 10.6751
## seconds.

Consolidating and vacuuming fragment metadata and commits are recommended after creating a new dataset or adding several new samples to an existing dataset.

tiledbvcf utils consolidate fragment_meta --uri small_dataset
tiledbvcf utils consolidate commits --uri small_dataset
tiledbvcf utils vacuum fragment_meta --uri small_dataset
tiledbvcf utils vacuum commits --uri small_dataset

Incremental Updates

A key advantage to using TileDB as a data store for genomic variant data is the ability to efficiently add new samples as they become available. The dataset creation command should be called once. Then you can invoke the store command multiple commands.

Note

The store command is thread- and process-safe, even on the cloud. That means it can be invoked in parallel, arbitrarily scaling out the ingestion of massive datasets.

Suppose we run the store for the first 3 local samples, followed by the store command for the 7 samples stored on the S3. If we run the stat command, we can verify that our dataset now includes 10 samples.

tiledbvcf stat --uri small_dataset
## Statistics for dataset 'small_dataset':
## - Version: 4
## - Tile capacity: 10,000
## - Anchor gap: 1,000
## - Number of registered samples: 10
## - Extracted attributes: none

Because TileDB is designed to be updatable, the store process happens efficiently and without ever touching any previously or concurrently ingested data, avoiding computationally expensive operations like regenerating combined VCF files.