Reshape: Adaptive Result-aware Skew Handling for Exploratory Analysis on Big Data

by   Avinash Kumar, et al.

The process of data analysis, especially in GUI-based analytics systems, is highly exploratory. The user iteratively refines a workflow multiple times before arriving at the final workflow. In such an exploratory setting, it is valuable to the user if the initial results of the workflow are representative of the final answers so that the user can refine the workflow without waiting for the completion of its execution. Partitioning skew may lead to the production of misleading initial results during the execution. In this paper, we explore skew and its mitigation strategies from the perspective of the results shown to the user. We present a novel framework called Reshape that can adaptively handle partitioning skew in pipelined execution. Reshape employs a two-phase approach that transfers load in a fine-tuned manner to mitigate skew iteratively during execution, thus enabling it to handle changes in input-data distribution. Reshape has the ability to adaptively adjust skew-handling parameters, which reduces the technical burden on the users. Reshape supports a variety of operators such as HashJoin, Group-by, and Sort. We implemented Reshape on top of two big data engines, namely Amber and Flink, to demonstrate its generality and efficiency, and report an experimental evaluation using real and synthetic datasets.


page 1

page 2

page 3

page 4


Towards Interactive, Adaptive and Result-aware Big Data Analytics

As data volumes grow across applications, analytics of large amounts of ...

StreamFlow: cross-breeding cloud with HPC

Workflows are among the most commonly used tools in a variety of executi...

Scheduling Algorithms for Efficient Execution of Stream Workflow Applications in Multicloud Environments

Big data processing applications are becoming more and more complex. The...

Adaptive Scheduling for Efficient Execution of Dynamic Stream Workflows

Stream workflow application such as online anomaly detection or online t...

Towards a Unified User Interface for Visual Analysis of Retinal Data in Ophthalmology

The visual analysis of retinal data contributes to the understanding of ...

SWARM: Adaptive Load Balancing in Distributed Streaming Systems for Big Spatial Data

The proliferation of GPS-enabled devices has led to the development of n...

A Framework for Online Investment Algorithms

The artificial segmentation of an investment management process into a w...

Please sign up or login with your details

Forgot password? Click here to reset