GWAS with the 1000 Genomes TileDB-VCF Dataset

In this notebook we’ll perform a rudimentary genome-wide association study using the 1000 Genomes (1KG) dataset. The goal of this tutorial is to demonstrate the mechanics of performing genome-wide analyses using variant call data stored with TileDB-VCF and how such analyses can be easily scaled using TileDB’s serverless computation platform.

To get started we’ll load a few required packages and define several variables that we’ll refer to throughout the notebook.

Setup

Packages

import tiledb
import tiledbvcf
import tiledb.cloud
from tiledb.cloud.compute import Delayed

import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

print(
    f"tileDB-vcf v{tiledbvcf.version}\n"
    f"tileDB-cloud v{tiledb.cloud.version.version}"
)
tileDB-vcf v0.22.1.dev27
tileDB-cloud v0.10.0

Variables

# variables
genome = "hg19"
array_uri = "tiledb://TileDB-Inc/vcf-1kg-phase3"
sample_array = "tiledb://TileDB-Inc/vcf-1kg-sample-metadata"

# vcf attributes to include
attrs = [
    "sample_name", 
    "contig",
    "pos_start",
    "pos_end", 
    "fmt_GT"
]
!tiledbvcf stat --uri tiledb://TileDB-Inc/vcf-1kg-phase3
/bin/bash: line 1: tiledbvcf: command not found

Sample Phenotypes

Because no phenotypic data was collected for the 1KG dataset, we’ll create a dummy trait indicating whether or not a sample identified as female.

with tiledb.open(sample_array) as A:
    df_samples = A.df[:]
    
# dummy trait
df_samples["is_female"] = df_samples.gender == "female"
df_samples
pop super_pop gender is_female
sampleuid
HG00096 GBR EUR male False
HG00097 GBR EUR female True
HG00099 GBR EUR female True
HG00100 GBR EUR female True
HG00101 GBR EUR male False
... ... ... ... ...
NA21137 GIH SAS female True
NA21141 GIH SAS female True
NA21142 GIH SAS female True
NA21143 GIH SAS female True
NA21144 GIH SAS female True

2504 rows × 4 columns

Scaling GWAS with Serverless UDFs

We can parallelize the association analysis by dividing the genome into disjoint partitions that are queried, filtered, and analyzed by separate serverless nodes.

Binned Genomic Regions

To get started we’ll use the genome_windows UDF to divide a genome into windows (i.e., bins) and return a list of regions in the chr:start-end format that tiledbvcf expects.

bed_regions = tiledb.cloud.udf.exec(
    func = "aaronwolen/genome_windows", 
    genome = "hg19", 
    window = 100000,
    style = "NCBI",
)

For now we’ll filter this list to include only regions from chromosome 20.

bed_regions = [r for r in bed_regions if r.startswith("20:")]
print(f"{len(bed_regions)} total regions")
bed_regions[:5]
631 total regions
['20:1-100000',
 '20:100001-200000',
 '20:200001-300000',
 '20:300001-400000',
 '20:400001-500000']

GWAS Steps

Each of the individual tasks we performed above must be wrapped in functions so they can be packaged as UDFs and shipped to the serverless compute infrastructure. We’ll define two separate functions:

1. vcf_snp_query()

  • queries a specified window from the genome
  • filters the sites based on the defined criteria
%%time
res = tiledb.cloud.udf.exec(
    func = "aaronwolen/vcf_snp_query",
    uri = array_uri,
    attrs = attrs,
    regions = bed_regions[:2],
)

res.to_pandas()
CPU times: user 94.3 ms, sys: 62.6 ms, total: 157 ms
Wall time: 9.48 s
sample_name contig pos_start pos_end fmt_GT
0 HG00097 20 60828 60828 [0, 1]
1 HG00101 20 61098 61098 [0, 1]
2 HG00103 20 61098 61098 [1, 0]
3 HG00109 20 61098 61098 [0, 1]
4 HG00111 20 61098 61098 [0, 1]
... ... ... ... ... ...
577809 NA21142 20 195696 195696 [1, 1]
577810 NA21143 20 195696 195696 [1, 1]
577811 NA21144 20 195696 195696 [1, 1]
577812 NA21142 20 196604 196604 [0, 1]
577813 NA21144 20 196604 196604 [1, 0]

551155 rows × 5 columns

2. filter_variants()

  • calculates allele count and minor allele frequencies
  • optinally filters for variants using the maf threshold
%%time
res2 = tiledb.cloud.udf.exec(
    func = "aaronwolen/filter_variants", 
    df = res,
    maf = (0.05, 0.95),
)

res2.to_pandas()
CPU times: user 183 ms, sys: 108 ms, total: 292 ms
Wall time: 7.69 s
sample_name pos_end fmt_GT dose
contig pos_start
20 61098 HG00101 61098 [0, 1] 1
61098 HG00103 61098 [1, 0] 1
61098 HG00109 61098 [0, 1] 1
61098 HG00111 61098 [0, 1] 1
61098 HG00115 61098 [1, 0] 1
... ... ... ... ...
195696 NA21142 195696 [1, 1] 2
195696 NA21143 195696 [1, 1] 2
195696 NA21144 195696 [1, 1] 2
196604 NA21142 196604 [0, 1] 1
196604 NA21144 196604 [1, 0] 1

488449 rows × 4 columns

3. calc_gwas()

Reshapes the Dataframe of variants into a variant by sample matrix and performs a chi-squared analysis on site for the included trait

%%time
res3 = tiledb.cloud.udf.exec(
    func = "aaronwolen/calc_gwas",
    df = res2,
    trait = df_samples.is_female,
)

res3.to_pandas()
CPU times: user 90.5 ms, sys: 23 ms, total: 113 ms
Wall time: 9.66 s
contig pos_start oddsratio pvalue
0 20 61098 0.494083 0.482112
1 20 61138 0.289018 0.590851
2 20 61795 0.068686 0.793260
3 20 63231 0.851720 0.356066
4 20 63244 0.004420 0.946994
... ... ... ... ...
423 20 198965 0.150694 0.697873
424 20 198977 0.025577 0.872937
425 20 199078 0.000250 0.987379
426 20 199114 0.000250 0.987379
427 20 199986 0.000250 0.987379

428 rows × 4 columns

4. combine_results()

  • creates a single dataframe containing individual results from each partition
def combine_results(dfs):
    """
    Combine GWAS Results

    :param list of dataframes: Region-specific GWAS results to combine
    """
    import pyarrow as pa
    print(f"Input list contains {len(dfs)} Arrow tables")
    out = pa.concat_tables([x for x in dfs if x is not None])
    return out

Construct UDF-based Task Graph

To run the complete pipeline serverlessly we need to create Delayed versions of these functions with the appropriate parameters. For query_region() we need to create one delayed instance for each of the genomic regions to be queried in parallel.

nregions = len(bed_regions[:100])
nparts = nregions // 5
print(f"Analyzing {nregions} regions across {nparts} partitions")

delayed_queries = []
for i in range(nparts):
    delayed_queries.append(
        Delayed(
            "aaronwolen/vcf_snp_query",
            result_format=tiledb.cloud.UDFResultType.ARROW,
            name=f"VCF SNP Query {i}",
        )(
            uri=array_uri,
            attrs=attrs,
            regions=bed_regions[:nregions],
            region_partition=(i, nparts),
            memory_budget_mb=1024,
        )
    )

delayed_filters = [
    Delayed("aaronwolen/filter_variants", name=f"VCF Filter {i[0]}")(
        df=i[1], maf=(0.05, 0.95)
    )
    for i in enumerate(delayed_queries)
]

delayed_gwas = [
    Delayed("aaronwolen/calc_gwas", name=f"VCF GWAS {i[0]}")(
        df=i[1], trait=df_samples.is_female
    )
    for i in enumerate(delayed_filters)
]

delayed_results = Delayed(combine_results, local=True)(delayed_gwas)
delayed_results.visualize()
Analyzing 100 regions across 20 partitions

Finally, we can execute our distributed GWAS and collect the results.

%%time
results = delayed_results.compute(name="GWAS Tutorial")
Input list contains 20 Arrow tables
CPU times: user 18.4 s, sys: 11.7 s, total: 30.1 s
Wall time: 3min 11s
results.to_pandas()
contig pos_start oddsratio pvalue
0 20 61098 0.494083 0.482112
1 20 61138 0.289018 0.590851
2 20 61795 0.068686 0.793260
3 20 63231 0.851720 0.356066
4 20 63244 0.004420 0.946994
... ... ... ... ...
30836 20 9999302 1.173945 0.278592
30837 20 9999622 1.173945 0.278592
30838 20 9999661 1.173945 0.278592
30839 20 9999900 0.999755 0.317370
30840 20 9999996 0.261170 0.609317

30841 rows × 4 columns

And no association test would be complete without a manhattan plot:

results = results.to_pandas()
results["log_pvalue"] = -np.log10(results.pvalue)

plt.figure(figsize = (16, 3))

ax = sns.scatterplot(
    x = "pos_start", 
    y = "log_pvalue", 
    data = results
)

ax.set(xlabel = "Chr 1 Position", ylabel = "-log10 p-value")

Summary

Here we’ve demonstrated how genome-wide pipelines can be encapsulated as distinct tasks, distributed as UDFs, and easily deployed in parallel at scale.