Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

05/30/2018
by   Alex Gittens, et al.
0

Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations---in particular, many linear algebra computations that are the basis for solving common machine learning problems---are significantly slower in Spark than when done using libraries written in a high-performance computing framework such as the Message-Passing Interface (MPI). To remedy this, we introduce Alchemist, a system designed to call MPI-based libraries from Apache Spark. Using Alchemist with Spark helps accelerate linear algebra, machine learning, and related computations, while still retaining the benefits of working within the Spark environment. We discuss the motivation behind the development of Alchemist, and we provide a brief overview of its design and implementation. We also compare the performances of pure Spark implementations with those of Spark implementations that leverage MPI-based codes via Alchemist. To do so, we use data science case studies: a large-scale application of the conjugate gradient method to solve very large linear systems arising in a speech classification problem, where we see an improvement of an order of magnitude; and the truncated singular value decomposition (SVD) of a 400GB three-dimensional ocean temperature data set, where we see a speedup of up to 7.9x. We also illustrate that the truncated SVD computation is easily scalable to terabyte-sized data by applying it to data sets of sizes up to 17.6TB.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/03/2018

Alchemist: An Apache Spark <=> MPI Interface

The Apache Spark framework for distributed computation is popular in the...
research
03/17/2021

PythonFOAM: In-situ data analyses with OpenFOAM and Python

In this article, we outline the development of a general-purpose Python-...
research
11/23/2015

Developing a High Performance Software Library with MPI and CUDA for Matrix Computations

Nowadays, the paradigm of parallel computing is changing. CUDA is now a ...
research
02/25/2021

The PetscSF Scalable Communication Layer

PetscSF, the communication component of the Portable, Extensible Toolkit...
research
07/05/2016

PRIMME_SVDS: A High-Performance Preconditioned SVD Solver for Accurate Large-Scale Computations

The increasing number of applications requiring the solution of large sc...
research
01/18/2019

TuckerMPI: A Parallel C++/MPI Software Package for Large-scale Data Compression via the Tucker Tensor Decomposition

Our goal is compression of massive-scale grid-structured data, such as t...
research
07/27/2020

HeAT – a Distributed and GPU-accelerated Tensor Framework for Data Analytics

To cope with the rapid growth in available data, the efficiency of data ...

Please sign up or login with your details

Forgot password? Click here to reset