SparkBeagle: Scalable Genotype Imputation from Distributed Whole-Genome Reference Panels in the Cloud

Show full item record



Permalink

http://hdl.handle.net/10138/330360

Citation

Maarala , A I , Pärn , K , Nunez Fontarnau , J & Heljanko , K 2020 , SparkBeagle: Scalable Genotype Imputation from Distributed Whole-Genome Reference Panels in the Cloud . in BCB '20: Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics . , 97 , ACM , ACM Conference on Bioinformatics, Computational Biology, and Health Informatics , 21/09/2019 . https://doi.org/10.1145/3388440.3414860

Title: SparkBeagle: Scalable Genotype Imputation from Distributed Whole-Genome Reference Panels in the Cloud
Author: Maarala, Altti Ilari; Pärn, Kalle; Nunez Fontarnau, Javier; Heljanko, Keijo
Contributor: University of Helsinki, Department of Computer Science
University of Helsinki, Institute for Molecular Medicine Finland
University of Helsinki, Genomics of Neurological and Neuropsychiatric Disorders
University of Helsinki, Helsinki Institute for Information Technology
Publisher: ACM
Date: 2020-09
Language: eng
Number of pages: 8
Belongs to series: BCB '20: Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
ISBN: 978-1-4503-7964-9
URI: http://hdl.handle.net/10138/330360
Abstract: Massive whole-genome genotype reference panels now provide accurate and fast genotyping by imputation for high-resolution genome-wide association (GWA) studies. Imputation-assisted genotyping can increase the genomic coverage of genotypes and thus satisfy the resolution required in comprehensive GWA studies in a cost-effective manner. However, the imputation of missing genotypes from large reference panels is a compute-intensive process that requires high-performance computing (HPC). Although HPC uses extremely distributed and parallel computing, current imputation tools, and existing algorithms have not been developed to fully exploit the power of distributed computing. To this end, we have developed SparkBeagle, a scalable, fast, and accurate distributed genotype imputation tool based on popular Beagle software. SparkBeagle is designed for HPC and cloud computing environments and it is implemented on top of the Apache Spark distributed computing framework. We have carried out scalability experiments by imputing 64,976,316 variants of 2504 samples from the 1000 Genomes reference panel in the cloud. SparkBeagle shows near-linear scalability while increasing the number of computing nodes. A speedup of 30x was achieved with 40 nodes. The imputation time of the whole data set decreased from 565 minutes to 18 minutes compared to a single node parallel execution. Near identical imputation accuracy was measured in the concordance analysis between the original Beagle and the distributed SparkBeagle tool.
Subject: 113 Computer and information sciences
Rights:


Files in this item

Total number of downloads: Loading...

Files Size Format View
3388440.3414860.pdf 1.293Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record