Read from the Dataset
Basic Utils
Before slicing the data, you may wish to get some information about your dataset, such as the sample names, the attributes you can query, etc.
You can get the sample names as follows:
import tiledbvcf
= "my_vcf_dataset"
uri = tiledbvcf.Dataset(uri, mode = "r") # open in "Read" mode
ds ds.samples()
tiledbvcf list -u my_vcf_dataset
You can get the attributes as follows:
import tiledbvcf
= "my_vcf_dataset"
uri = tiledbvcf.Dataset(uri, mode = "r") # open in "Read" mode
ds # will print all queryable attributes
ds.attributes() = "builtin") # will print all materialized attributes ds.attributes(attr_type
tiledbvcf stat -u my_vcf_datset
Reading
You can rapidly read from a TileDB-VCF dataset by providing three main parameters (all optional):
- A subset of the samples
- A subset of the attributes
- One or more genomic ranges
- Either as strings in format
chr:pos_range
- Or via a BED file
- Either as strings in format
import tiledbvcf
= "my_vcf_dataset"
uri = tiledbvcf.Dataset(uri, mode = "r") # open in "Read" mode
ds
ds.read(= ["alleles", "pos_start", "pos_end"],
attrs = ["1:113409605-113475691", "1:113500000-113600000"],
regions # or pass regions as follows:
# bed_file = <bed_filename>
= ['HG0099', 'HG00100']
samples )
tiledbvcf export \
--uri my_vcf_dataset \
--output-format t \
--tsv-fields ALT,Q:POS,Q:END
--sample-names HG0099,HG00100
--regions 1:113409605-113475691,1:113500000-113600000
# or pass the regions in a BED file as follows:
# --regions-file <bed_filename>