Distributed Ingestion
TileDB has built in support for scalable distributed ingestion.
ingest
is a simply python command that will dispatch and run a task graph to load VCF samples in parallel across a number of machines.
Example
In order to run this example please make sure to have install TileDB-Cloud-Py with pip install --user tiledb-cloud
.
import tiledb.cloud
from tiledb.cloud.vcf import ingest
= "s3://my_bucket/my_array"
s3_storage_uri = "s3://1000genomes-dragen-v3.7.6/data/individuals/hg38-graph-based"
vcf_location = "*.hard-filtered.vcf.gz"
pattern = 75
max_files = f"dragen-v3.7.6-example-{max_files}"
name
= "my-organization"
namespace = f"tiledb://{namespace}/{name}"
tiledb_uri
# Define which contigs we want to ingest
= Contigs.CHROMOSOMES
contigs
ingest(
s3_storage_uri,=config,
config=vcf_location,
search_uri=pattern,
pattern=max_files,
max_files=contigs,
contigs )
Contigs
TileDB support specifying the contigs you wish to ingest. The default behavior is to ingestion all contigs present in a VCF file. However you can specify if you’d like to restrict to a specific list or a predefined list
Options for contigs:
ALL
CHROMOSOMES
OTHER