The Parallelism Motifs of Genomic Data Analysis

01/20/2020
by   Katherine Yelick, et al.
0

Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing.

READ FULL TEXT
research
01/10/2018

DuctTeip: An efficient programming model for distributed task based parallel computing

Current high-performance computer systems used for scientific computing ...
research
09/13/2022

High Performance Dataframes from Parallel Processing Patterns

The data science community today has embraced the concept of Dataframes ...
research
10/14/2019

BACKUS: Comprehensive High-Performance Research Software Engineering Approach for Simulations in Supercomputing Systems

High-Performance Computing (HPC) platforms enable scientific software to...
research
09/21/2019

Gene-Patterns: Should Architecture be Customized for Each Application?

Providing architectural support is crucial for newly arising application...
research
05/02/2022

A Case Study on Parallel HDF5 Dataset Concatenation for High Energy Physics Data Analysis

In High Energy Physics (HEP), experimentalists generate large volumes of...
research
01/05/2016

Resource Sharing for Multi-Tenant NoSQL Data Store in Cloud

Multi-tenancy hosting of users in cloud NoSQL data stores is favored by ...
research
01/11/2018

A parallel workload has extreme variability in a production environment

Writing data in parallel is a common operation in some computing environ...

Please sign up or login with your details

Forgot password? Click here to reset