Exoshuffle: Large-Scale Shuffle at the Application Level

03/09/2022
by   Frank Sifei Luan, et al.
0

Shuffle is a key primitive in large-scale data processing applications. The difficulty of large-scale shuffle has inspired a myriad of implementations. While these have greatly improved shuffle performance and reliability over time, it comes at a cost: flexibility. First, each shuffle system is essentially built from scratch, which is a significant developer effort. Second, because each shuffle system is monolithic, they are not flexible to supporting other applications, such as online aggregation of shuffle results. We show that shuffle can be implemented with high performance and reliability on a general-purpose abstraction for distributed computing: distributed futures. While previous systems have implemented shuffle on top of distributed futures before, we are the first to identify and build the common components necessary to support a large-scale shuffle. We show that it is possible to: (1) express optimizations from previous shuffle systems in a few hundred lines of purely application-level Python code, and (2) achieve interoperability with other data processing applications without modifying the shuffle system. Thus, we present Exoshuffle, an application-level shuffle system that outperforms Spark and achieves 82

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/18/2018

A Survey on Spark Ecosystem for Big Data Processing

With the explosive increase of big data in industry and academic fields,...
research
10/26/2021

Evaluating Serverless Architecture for Big Data Enterprise Applications

In this paper, we investigate serverless computing for performing large ...
research
06/16/2023

An approach to provide serverless scientific pipelines within the context of SKA

Function-as-a-Service (FaaS) is a type of serverless computing that allo...
research
06/21/2019

The Coming Age of Pervasive Data Processing

Emerging Big Data analytics and machine learning applications require a ...
research
01/29/2023

Large-scale Data Modelling in Hive and Distributed Query Processing using MapReduce and Tez

Huge amounts of data being generated continuously by digitally interconn...
research
04/11/2021

GraphGuess: Approximate Graph Processing System with Adaptive Correction

Graph-based data structures have drawn great attention in recent years. ...
research
10/16/2018

Optimizing AIREBO: Navigating the Journey from Complex Legacy Code to High Performance

Despite initiatives to improve the quality of scientific codes, there st...

Please sign up or login with your details

Forgot password? Click here to reset