Work with Cloud Object Stores

TileDB Embedded provides native support for reading from and writing to cloud object stores like AWS S3, Google Cloud Storage, and Microsoft Azure Blob Store. This guide will cover some considerations for using TileDB-VCF with these services. The examples will focus exclusively on S3, which is the most widely used, but note any of the aforementioned services can be substituted, as well as on-premise services like MinIO that provide S3-compatible APIs.

Remote Datasets

The process of creating a TileDB-VCF dataset on S3 is nearly identical to creating a local dataset. The only difference being an s3:// address is passed to the --uri argument rather than a local file path.

tiledbvcf create --uri s3://my-bucket/my_dataset

This also works when querying a TileDB-VCF dataset located on S3.

tiledbvcf export \
  --uri s3://tiledb-inc-demo-data/tiledbvcf-arrays/v4/vcf-samples-20 \
  --sample-names v2-tJjMfKyL,v2-eBAdKwID \
  -Ot --tsv-fields "CHR,POS,REF,S:GT" \
  --regions "chr7:144000320-144008793,chr11:56490349-56491395"

Remote VCF Files

VCF files located on S3 can be ingested directly into a TileDBVCF dataset using 1 of 2 different possible approaches.

Direct Ingestion

The first approach is the easiest, you simply pass the tiledbvcf store command a list of S3 URIs and TileDB-VCF takes care of the rest:

tiledbvcf store \
    --uri my_dataset \
    s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G4.bcf

In this approach, remote VCF index files (which are relatively tiny) are downloaded locally, allowing TileDB-VCF to retrieve chunks of variant data from the remote VCF files without having to download them in full. By default, index files are downloaded to your current working directory, however, you can choose to store them in different location (e.g., a temporary directory) using the --scratch-dir argument.

Batched Downloading

The second approach is to download batches of VCF files in their entirety before ingestion, which may slightly improve ingestion performance. This approach requires allocating TileDB-VCF with scratch disk space using the --scratch-mb and --scratch-dir arguments.

tiledbvcf store \
    --uri my_dataset \
    --scratch-dir "$TMPDIR" \
    --scratch-mb 4096 \
    s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G4.bcf

The number of VCF files that are downloaded at a time is determined by the --sample-batch-size parameter, which defaults to 10. Downloading and ingestion happens asynchronously, so, for example, batch 3 will be downloaded as batch 2 is being ingestion. As a result, you must configure enough scratch space to store at least 20 samples, assuming a batch size of 10.

Authentication

For TileDB to access a remote storage bucket you must be properly authenticated on the machine running TileDB. For S3, this means having access to the appropriate AWS access key ID and secret access key. This typically happens in one of three ways:

1. Using the AWS CLI

If the AWS Command Line Interface (CLI) is installed on your machine, running aws configure will store your credentials in a local profile that TileDB can access. You can verify the CLI has been previously configured by running:

aws s3 ls

If properly configured, this will output a list of the S3 buckets you (and thus TileDB) can access.

2. Using Configuration Parameters

You can pass your AWS access key ID and secret access key to TileDB-VCF directly via the --tiledb-config argument, which expects a comma-separated string:

tiledbvcf store \
    --uri my_dataset \
    --tiledb-config vfs.s3.aws_access_key_id=<id>,vfs.s3.aws_secret_access_key=<secret> \
    s3://tiledb-inc-demo-data/examples/notebooks/vcfs/G4.bcf

3. Using Environment Variables

Your AWS credentials can also be passed to TileDB by defining the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.