vcf.split

cloud.vcf.split

Split samples from multi-sample VCF.

Functions

Name Description
ls_samples List samples in an aggregate VCF.
split_one_sample Split one sample from multi-sample VCF.
split_vcf Split individual sample VCFs from an aggreate VCF.

ls_samples

cloud.vcf.split.ls_samples(vcf_uri, config=None)

List samples in an aggregate VCF.

Parameters

Name Type Description Default
vcf_uri str S3 path to aggregate VCF. required
config Optional[Mapping[str, str]] TileDB config params. None

Returns

Name Type Description
list[str] Samples included in VCF.

split_one_sample

cloud.vcf.split.split_one_sample(vcf_uri, sample, output_uri, config=None)

Split one sample from multi-sample VCF.

Parameters

Name Type Description Default
vcf_uri str URI of VCF to isolate from. required
sample str Sample name to isolate. required
output_uri str URI to deposit isolated VCF. required
config Optional[Mapping[str, str]] TileDB config object. None

Returns

Name Type Description
str URI of isolated sample.

split_vcf

cloud.vcf.split.split_vcf(
    vcf_uri
    output_uri
    namespace
    acn
    resources={'cpu': '2', 'memory': '30Gi'}
    compute=True
    verbose=False
    samples=None
    retry_count=1
    max_workers=100
    config=None
)

Split individual sample VCFs from an aggreate VCF.

Given an aggregate VCF file containing multiple samples, split all samples into isolated VCFs, one per sample. Alternatively, specify sample(s) to split apart from VCF if not all isolated VCFs are needed.

Parameters

Name Type Description Default
vcf_uri str Aggregate VCF URI. required
output_uri str Output URI to write isolated VCFs. required
namespace str TileDB Cloud namespace to process task graph. required
acn str Access credential friendly name to auth storage i/o. required
resources Mapping[str, str] Resources applied to splitting UDF (start with default). {'cpu': '2', 'memory': '30Gi'}
compute bool Whether to execute DAG. True
verbose bool Logging verbosity. False
samples Optional[Sequence[str]] Indicate a batch of sample names within vcf_uri to isolate if it is undesired to isolate all samples (default). None
retry_count int Number of Node retries. 1
max_workers int Max workers to engage simultaneously. 100
config Optional[Mapping[str, int]] TileDB configuration parameters used to configure virtual filesystem handler. None

Returns

Name Type Description
DAG DAG instantiated as specified.