Alchemist: An Apache Spark <=> MPI Interface

06/03/2018
by   Alex Gittens, et al.
0

The Apache Spark framework for distributed computation is popular in the data analytics community due to its ease of use, but its MapReduce-style programming model can incur significant overheads when performing computations that do not map directly onto this model. One way to mitigate these costs is to off-load computations onto MPI codes. In recent work, we introduced Alchemist, a system for the analysis of large-scale data sets. Alchemist calls MPI-based libraries from within Spark applications, and it has minimal coding, communication, and memory overheads. In particular, Alchemist allows users to retain the productivity benefits of working within the Spark software ecosystem without sacrificing performance efficiency in linear algebra, machine learning, and other related computations. In this paper, we discuss the motivation behind the development of Alchemist, and we provide a detailed overview its design and usage. We also demonstrate the efficiency of our approach on medium-to-large data sets, using some standard linear algebra operations, namely matrix multiplication and the truncated singular value decomposition of a dense matrix, and we compare the performance of Spark with that of Spark+Alchemist. These computations are run on the NERSC supercomputer Cori Phase 1, a Cray XC40.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/30/2018

Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

Apache Spark is a popular system aimed at the analysis of large data set...
research
10/23/2018

numpywren: serverless linear algebra

Linear algebra operations are widely used in scientific computing and ma...
research
01/18/2019

TuckerMPI: A Parallel C++/MPI Software Package for Large-scale Data Compression via the Tucker Tensor Decomposition

Our goal is compression of massive-scale grid-structured data, such as t...
research
10/24/2016

Large Scale Parallel Computations in R through Elemental

Even though in recent years the scale of statistical analysis problems h...
research
06/20/2023

A C++20 Interface for MPI 4.0

We present a modern C++20 interface for MPI 4.0. The interface utilizes ...
research
07/27/2020

HeAT – a Distributed and GPU-accelerated Tensor Framework for Data Analytics

To cope with the rapid growth in available data, the efficiency of data ...
research
05/30/2014

Online and Adaptive Pseudoinverse Solutions for ELM Weights

The ELM method has become widely used for classification and regressions...

Please sign up or login with your details

Forgot password? Click here to reset