Distributed Ingestion
TileDB has built in support for scalable distributed ingestion.
ingest is a simply python command that will dispatch and run a task graph to load VCF samples in parallel across a number of machines.
Example
In order to run this example please make sure to have install TileDB-Cloud-Py with pip install --user tiledb-cloud.
import tiledb.cloud
from tiledb.cloud.vcf import ingest
s3_storage_uri = "s3://my_bucket/my_array"
vcf_location = "s3://1000genomes-dragen-v3.7.6/data/individuals/hg38-graph-based"
pattern = "*.hard-filtered.vcf.gz"
max_files = 75
name = f"dragen-v3.7.6-example-{max_files}"
namespace = "my-organization"
tiledb_uri = f"tiledb://{namespace}/{name}"
# Define which contigs we want to ingest
contigs = Contigs.CHROMOSOMES
ingest(
    s3_storage_uri,
    config=config,
    search_uri=vcf_location,
    pattern=pattern,
    max_files=max_files,
    contigs=contigs,
)Contigs
TileDB support specifying the contigs you wish to ingest. The default behavior is to ingestion all contigs present in a VCF file. However you can specify if you’d like to restrict to a specific list or a predefined list
Options for contigs:
- ALL
- CHROMOSOMES
- OTHER