Towards Interactive, Adaptive and Result-aware Big Data Analytics

by   Avinash Kumar, et al.

As data volumes grow across applications, analytics of large amounts of data is becoming increasingly important. Big data processing frameworks such as Apache Hadoop, Apache AsterixDB, and Apache Spark have been built to meet this demand. A common objective pursued by these traditional cluster-based big data processing frameworks is high performance, which often means low end-to-end execution time or latency. The widespread adoption of data analytics has led to a call to improve the traditional ways of big data processing. There have been demands for making the analytics process more interactive and adaptive, especially for long running jobs. The importance of initial results in the iterative process of data wrangling has motivated a result-aware approach to big data analytics. This dissertation is motivated by these calls for improvement in data processing and the experiences over the past few years while working on the Texera project, which is a collaborative data analytics service being developed at UC Irvine. This dissertation mainly consists of three parts. The first part is about the design of the Amber engine that serves as the backend data processing framework for the Texera service. The second part is about an adaptive and result-aware skew-handling framework called Reshape. Reshape uses fast control messages to implement iterative skew mitigation techniques for a wide variety of operators. The mitigation techniques in Reshape have also been analyzed from the perspective of their effects on the results shown to the user. The last part is about a result-aware workflow scheduling framework called Maestro. This part talks about how to schedule a workflow for execution on computing clusters and make result-aware decisions while doing so. This work improves the data analytics process by bringing interactivity, adaptivity and result-awareness into the process.


page 1

page 2

page 3

page 4


Big Data Dwarfs: Towards Fully Understanding Big Data Analytics Workloads

Though the big data benchmark suites like BigDataBench and CloudSuite ha...

Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study

While cluster computing frameworks are continuously evolving to provide ...

Reshape: Adaptive Result-aware Skew Handling for Exploratory Analysis on Big Data

The process of data analysis, especially in GUI-based analytics systems,...

Learning-based Automatic Parameter Tuning for Big Data Analytics Frameworks

Big data analytics frameworks (BDAFs) have been widely used for data pro...

A Synopses Data Engine for Interactive Extreme-Scale Analytics

In this work, we detail the design and structure of a Synopses Data Engi...

Serverless Data Analytics with Flint

Serverless architectures organized around loosely-coupled function invoc...

A Scalable and Dependable Data Analytics Platform for Water Infrastructure Monitoring

With weather becoming more extreme both in terms of longer dry periods a...

Please sign up or login with your details

Forgot password? Click here to reset