Communication Efficient Checking of Big Data Operations

We propose fast probabilistic algorithms with low (i.e., sublinear in the input size) communication volume to check the correctness of operations in Big Data processing frameworks and distributed databases. Our checkers cover many of the commonly used operations, including sum, average, median, and minimum aggregation, as well as sorting, union, merge, and zip. An experimental evaluation of our implementation in Thrill (Bingmann et al., 2016) confirms the low overhead and high failure detection rate predicted by theoretical analysis.


page 1

page 2

page 3

page 4


A Survey on Geographically Distributed Big-Data Processing using MapReduce

Hadoop and Spark are widely used distributed processing frameworks for l...

An Abstract View of Big Data Processing Programs

This paper proposes a model for specifying data flow based parallel data...

Understanding the Challenges and Assisting Developers with Developing Spark Applications

To process data more efficiently, big data frameworks provide data abstr...

Distributed Dependency Discovery

We analyze the problem of discovering dependencies from distributed big ...

Similarità per la ricerca del dominio di una frase

English. This document aims to study the best algorithms to verify the b...

Benchmarking and Performance Modelling of MapReduce Communication Pattern

Understanding and predicting the performance of big data applications ru...

Addestramento con Dataset Sbilanciati

English. The following document pursues the objective of comparing some ...

Please sign up or login with your details

Forgot password? Click here to reset