Blaze: Simplified High Performance Cluster Computing

02/04/2019
by   Junhao Li, et al.
0

MapReduce and its variants have significantly simplified and accelerated the process of developing parallel programs. However, most MapReduce implementations focus on data-intensive tasks while many real-world tasks are compute intensive and their data can fit distributedly into the memory. For these tasks, the speed of MapReduce programs can be much slower than those hand-optimized ones. We present Blaze, a C++ library that makes it easy to develop high performance parallel programs for such compute intensive tasks. At the core of Blaze is a highly-optimized in-memory MapReduce function, which has three main improvements over conventional MapReduce implementations: eager reduction, fast serialization, and special treatment for a small fixed key range. We also offer additional conveniences that make developing parallel programs similar to developing serial programs. These improvements make Blaze an easy-to-use cluster computing library that approaches the speed of hand-optimized parallel code. We apply Blaze to some common data mining tasks, including word frequency count, PageRank, k-means, expectation maximization (Gaussian mixture model), and k-nearest neighbors. Blaze outperforms Apache Spark by more than 10 times on average for these tasks, and the speed of Blaze scales almost linearly with the number of nodes. In addition, Blaze uses only the MapReduce function and 3 utility functions in its implementation while Spark uses almost 30 different parallel primitives in its official implementation.

READ FULL TEXT
research
11/23/2015

Developing a High Performance Software Library with MPI and CUDA for Matrix Computations

Nowadays, the paradigm of parallel computing is changing. CUDA is now a ...
research
06/02/2020

PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives

Deep Neural Networks (DNNs) have revolutionized many aspects of our live...
research
05/21/2021

Conduit: A C++ Library for Best-effort High Performance Computing

Developing software to effectively take advantage of growth in parallel ...
research
02/11/2020

AnySeq: A High Performance Sequence Alignment Library based on Partial Evaluation

Sequence alignments are fundamental to bioinformatics which has resulted...
research
05/22/2019

A Quick Introduction to Functional Verification of Array-Intensive Programs

Array-intensive programs are often amenable to parallelization across ma...
research
02/12/2020

Eigenvector Component Calculation Speedup over NumPy for High-Performance Computing

Applications related to artificial intelligence, machine learning, and s...
research
07/12/2021

Faster Math Functions, Soundly

Standard library implementations of functions like sin and exp optimize ...

Please sign up or login with your details

Forgot password? Click here to reset