Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools

06/03/2018
by   Nicholas Tucci, et al.
0

Scalable and efficient processing of genome sequence data, i.e. for variant discovery, is key to the mainstream adoption of High Throughput technology for disease prevention and for clinical use. Achieving scalability, however, requires a significant effort to enable the parallel execution of the analysis tools that make up the pipelines. This is facilitated by the new Spark versions of the well-known GATK toolkit, which offer a black-box approach by transparently exploiting the underlying Map Reduce architecture. In this paper we report on our experience implementing a standard variant discovery pipeline using GATK 4.0 with Docker-based deployment over a cluster. We provide a preliminary performance analysis, comparing the processing times and cost to those of the new Microsoft Genomics Services.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/01/2020

Computational Performance of a Germline Variant Calling Pipeline for Next Generation Sequencing

With the booming of next generation sequencing technology and its implem...
research
09/18/2022

GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping

Nanopore sequencing is a widely-used high-throughput genome sequencing t...
research
05/05/2020

A Pipeline for Integrated Theory and Data-Driven Modeling of Genomic and Clinical Data

High throughput genome sequencing technologies such as RNA-Seq and Micro...
research
04/30/2023

Accelerating Genome Analysis via Algorithm-Architecture Co-Design

High-throughput sequencing (HTS) technologies have revolutionized the fi...
research
08/13/2023

InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models

Deep learning-based recommender models (DLRMs) have become an essential ...
research
11/11/2019

Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons

Jaccard Similarity index is an important measure of the overlap of two s...
research
06/01/2023

Scaling Expected Force: Efficient Identification of Key Nodes in Network-based Epidemic Models

Centrality measures are fundamental tools of network analysis as they hi...

Please sign up or login with your details

Forgot password? Click here to reset