TileDB-VCF inherits TileDB’s native support for working with remote object stores, so we can ingest samples directly from remote services like AWS S3, Google Cloud, and Azure. Here, we’ll use single sample BCF files with chromosome 1 data for phase 3 samples from the 1KG project stored in a publicly readable S3 bucket.
For demonstration purposes we’ll ingest individual batches of samples one at a time. To do this, we’ll create a list of S3 URIs pointing to 5 sample BCF files.
batch1_samples = ["HG00096.bcf", "HG00097.bcf", "HG00099.bcf", "HG00100.bcf", "HG00101.bcf"]batch1_uris = [f"{vcf_bucket}/{s}"for s in batch1_samples]batch1_uris
Before we can ingest VCF data into TileDB we must first create our new dataset. First, we’ll create a TileDB-VCF Dataset instance that opens a connection to our desired array_uri in write mode.
ds = tiledbvcf.Dataset(uri=array_uri, mode="w")ds
<tiledbvcf.dataset.Dataset at 0x7efc1c203350>
Next, we’ll create the empty TileDB-VCF dataset.
Note: We can optionally pass a VCF file to the vcf_attrs argument to automatically materialize all of the INFO and FORMAT fields as separate attributes in the array (rather than htslib-encoded blobs), which can improve query performance.
ds.create_dataset(vcf_attrs=batch1_uris[0], enable_allele_count=True, enable_variant_stats=True)# verify the array existsos.listdir(array_uri)
CPU times: user 10.2 s, sys: 1.55 s, total: 11.8 s
Wall time: 44.8 s
This should take approximately 30s when ingesting from S3. Note most of this time is spent streaming the data, ingesting local copies of these files takes about 12s.
We can verify the samples were ingested by listing the sample IDs contained within the new array:
The process is identical: we need to reopen the Dataset in write mode and then pass another list of VCF URIs to ingest_samples().
%%time ds = tiledbvcf.Dataset(uri=array_uri, mode="w") #Incremental update to the array, previous data is not touched ds.ingest_samples(sample_uris = batch2_uris)
CPU times: user 10.2 s, sys: 1.57 s, total: 11.8 s
Wall time: 46.9 s
ds = tiledbvcf.Dataset(array_uri, mode ="r") #Open array in read mode, and print out the samples we just ingested to verifyds.samples()
Key takeaways: - new samples are easily added to existing datasets - incremental updates are extremely efficient
3. Basic TileDB-VCF Query
Open the array in read mode, and display attributes
%%timeregions = ["1:11106535-11262551"]
CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 9.54 µs
Query for all variants located within MTOR1 for a particular sample and return the results as a Pandas.Dataframe, which each row is a variant indexed by the chromosome, chromosome position, and sample.
CPU times: user 32.4 ms, sys: 84.5 ms, total: 117 ms
Wall time: 110 ms
Again, we can use bcftools to examine it.
!bcftools view --no-header /tmp/combined.vcf | head -10
1 10177 rs367896724 A AC 100 PASS AC=4;VT=INDEL;AN=8;DP=412608;AA=|||unknown(NO_COVERAGE);END=10177 GT 1|0 0|1 0|1 1|0 ./.
1 10352 rs555500075 T TA 100 PASS AC=5;VT=INDEL;AN=10;DP=444575;AA=|||unknown(NO_COVERAGE);END=10352 GT 1|0 1|0 0|1 0|1 1|0
1 10616 rs376342519 CCGCCGTTGCAAAGGCGCGCCG C 100 PASS AC=10;VT=INDEL;AN=10;DP=11825;END=10637 GT 1|1 1|1 1|1 1|1 1|1
1 13110 rs540538026 G A 100 PASS AC=1;VT=SNP;AN=2;DP=23422;AA=g|||;END=13110 GT ./. 1|0 ./. ./. ./.
1 13116 rs62635286 T G 100 PASS AC=2;VT=SNP;AN=4;DP=44680;AA=t|||;END=13116 GT ./. 1|0 ./. ./. 1|0
1 13118 rs200579949 A G 100 PASS AC=2;VT=SNP;AN=4;DP=42790;AA=a|||;END=13118 GT ./. 1|0 ./. ./. 1|0
1 14464 rs546169444 A T 100 PASS AC=3;VT=SNP;AN=4;DP=53522;AA=a|||;END=14464 GT 1|1 ./. 1|0 ./. ./.
1 14599 rs531646671 T A 100 PASS AC=2;VT=SNP;AN=4;DP=64162;AA=t|||;END=14599 GT ./. 0|1 1|0 ./. ./.
1 14604 rs541940975 A G 100 PASS AC=2;VT=SNP;AN=4;DP=58462;AA=a|||;END=14604 GT ./. 0|1 1|0 ./. ./.
1 14930 rs75454623 A G 100 PASS AC=5;VT=SNP;AN=10;DP=211155;AA=a|||;END=14930 GT 1|0 0|1 0|1 1|0 1|0
Key take aways: - slice efficiently by genomic region and/or sample - selectively retrieve specific VCF attributes - queries are returned as pyarrow Tables or pandas DataFrames - query and export results to single and combined VCFs
5. Partitioned Queries
Large queries can be split into partitions of regions, samples, or both by setting the appropriate partitioning parameter, which consists of 2 integers: 1. The partition index 2. The number of partitions
Query Sample Partition 1
Open the array in read mode with the sample_partition parameter set to query the 1st of 2 sample partitions.
CPU times: user 8.24 ms, sys: 72 ms, total: 80.2 ms
Wall time: 67.1 ms
sample_name
contig
pos_start
pos_end
alleles
fmt_GT
0
HG00096
1
43337897
43337897
[A, G]
[1, 0]
1
HG00096
1
43339092
43339092
[C, T]
[1, 1]
2
HG00097
1
43339092
43339092
[C, T]
[1, 1]
3
HG00099
1
43339092
43339092
[C, T]
[1, 1]
4
HG00100
1
43339092
43339092
[C, T]
[1, 1]
...
...
...
...
...
...
...
105
HG00099
1
43352685
43352685
[A, G]
[1, 1]
106
HG00100
1
43352685
43352685
[A, G]
[1, 1]
107
HG00101
1
43352685
43352685
[A, G]
[1, 1]
108
HG00099
1
43352723
43352723
[C, G]
[1, 0]
109
HG00100
1
43352723
43352723
[C, G]
[1, 0]
110 rows × 6 columns
Query Sample Partition 2
Re-open the array with the sample_partition parameter set to query the 2nd sample partition (i.e., samples 6–10).
cfg = tiledbvcf.ReadConfig(sample_partition=(1, 2)) #This will return the second sub-set of the samples ds = tiledbvcf.Dataset(array_uri, mode="r", cfg=cfg)
CPU times: user 14.2 ms, sys: 66.8 ms, total: 81 ms
Wall time: 67.7 ms
sample_name
contig
pos_start
pos_end
alleles
fmt_GT
0
HG00102
1
43337978
43337978
[T, C]
[0, 1]
1
HG00102
1
43339092
43339092
[C, T]
[1, 1]
2
HG00103
1
43339092
43339092
[C, T]
[1, 1]
3
HG00105
1
43339092
43339092
[C, T]
[1, 1]
4
HG00106
1
43339092
43339092
[C, T]
[1, 1]
...
...
...
...
...
...
...
112
HG00106
1
43352685
43352685
[A, G]
[1, 1]
113
HG00107
1
43352685
43352685
[A, G]
[1, 1]
114
HG00102
1
43352723
43352723
[C, G]
[1, 0]
115
HG00103
1
43352723
43352723
[C, G]
[1, 0]
116
HG00107
1
43352723
43352723
[C, G]
[0, 1]
117 rows × 6 columns
Key takeaways: - TileDB support fast, parallel slicing - TileDB-VCF provides a convenient API to to partition queries across sample/genomic dimensions
6. Parallel Queries with TileDB
Setup
Import the Delayed module from tiledb.cloud. This is allows us to convert a normal python function into one that can be Delayed and serverlessly executed.
CPU times: user 15.1 ms, sys: 3.82 ms, total: 19 ms
Wall time: 8.21 s
sample_name
contig
pos_start
pos_end
fmt_GT
0
HG00099
chr1
43337743
43338411
[0, 0]
1
HG00106
chr1
43337749
43338055
[0, 0]
2
HG00102
chr1
43337756
43338088
[0, 0]
3
HG00097
chr1
43337791
43338053
[0, 0]
4
HG00100
chr1
43337812
43337850
[0, 0]
...
...
...
...
...
...
22442
HG00105
chr1
43352768
43352779
[0, 0]
22443
HG00106
chr1
43352769
43352769
[0, 0]
22444
HG00106
chr1
43352770
43352791
[0, 0]
22445
HG00100
chr1
43352771
43352771
[0, 0]
22446
HG00100
chr1
43352772
43352781
[0, 0]
22447 rows × 5 columns
Key takeaways: - TileDB-VCF Datasets can be sliced directly from remote object stores - TileDB provides a flexible platform for building serverless data processing pipelines
Run Multiple Batches of Queries in Parallel
Add another UDF to combine the outputs into a single result.
def combine_results(dfs): #Taking resluts and combining them into one tableimport pyarrow as pa out = pa.concat_tables([x for x in dfs if x isnotNone])return out
Assemble and visualize the task graph.
nparts =10delayed_results = []for i inrange(nparts): #splitting up 100 samples into 10 partitions, that will run in parallel, only thing that changes is the sample index delayed_results.append( Delayed( func_exec=vcf_query, result_format=tiledb.cloud.UDFResultType.ARROW, )( uri=array_uri, attrs=query_attrs, regions=query_regions, samples=query_samples, sample_partition=(i, nparts), ) )delayed_results = Delayed(combine_results, local=True)(delayed_results) #Taking all the results and combining them into a single table, via the `combine_results` functiondelayed_results.visualize()