Distributed Ingestion
TileDB has built in support for scalable distributed ingestion.
ingest is a simply python command that will dispatch and run a task graph to load VCF samples in parallel across a number of machines.
Example
In order to run this example please make sure to have install TileDB-Cloud-Py with pip install --user tiledb-cloud.
import tiledb.cloud
from tiledb.cloud.vcf import ingest
s3_storage_uri = "s3://my_bucket/my_array"
vcf_location = "s3://1000genomes-dragen-v3.7.6/data/individuals/hg38-graph-based"
pattern = "*.hard-filtered.vcf.gz"
max_files = 75
name = f"dragen-v3.7.6-example-{max_files}"
namespace = "my-organization"
tiledb_uri = f"tiledb://{namespace}/{name}"
# Define which contigs we want to ingest
contigs = Contigs.CHROMOSOMES
ingest(
s3_storage_uri,
config=config,
search_uri=vcf_location,
pattern=pattern,
max_files=max_files,
contigs=contigs,
)Contigs
TileDB support specifying the contigs you wish to ingest. The default behavior is to ingestion all contigs present in a VCF file. However you can specify if you’d like to restrict to a specific list or a predefined list
Options for contigs:
ALLCHROMOSOMESOTHER