Communication Efficient Checking of Big Data Operations

We propose fast probabilistic algorithms with low (i.e., sublinear in the input size) communication volume to check the correctness of operations in Big Data processing frameworks and distributed databases. Our checkers cover many of the commonly used operations, including sum, average, median, and minimum aggregation, as well as sorting, union, merge, and zip. An experimental evaluation of our implementation in Thrill (Bingmann et al., 2016) confirms the low overhead and high failure detection rate predicted by theoretical analysis.

READ FULL TEXT

page 1

page 2

page 3

page 4

07/06/2017

A Survey on Geographically Distributed Big-Data Processing using MapReduce

Hadoop and Spark are widely used distributed processing frameworks for l...
08/05/2021

An Abstract View of Big Data Processing Programs

This paper proposes a model for specifying data flow based parallel data...
03/25/2021

Understanding the Challenges and Assisting Developers with Developing Spark Applications

To process data more efficiently, big data frameworks provide data abstr...
03/12/2019

Distributed Dependency Discovery

We analyze the problem of discovering dependencies from distributed big ...
01/31/2020

Similarità per la ricerca del dominio di una frase

English. This document aims to study the best algorithms to verify the b...
05/23/2020

Benchmarking and Performance Modelling of MapReduce Communication Pattern

Understanding and predicting the performance of big data applications ru...
08/18/2020

Addestramento con Dataset Sbilanciati

English. The following document pursues the objective of comparing some ...

Please sign up or login with your details

Forgot password? Click here to reset