DeepAI AI Chat
Log In Sign Up

Approximation with Error Bounds in Spark

by   Guangyan Hu, et al.

We introduce a sampling framework to support approximate computing with estimated error bounds in Spark. Our framework allows sampling to be performed at the beginning of a sequence of multiple transformations ending in an aggregation operation. The framework constructs a data provenance graph as the computation proceeds, then combines the graph with multi-stage sampling and population estimation theories to compute error bounds for the aggregation. When information about output keys are available early, the framework can also use adaptive stratified reservoir sampling to avoid (or reduce) key losses in the final output and to achieve more consistent error bounds across popular and rare keys. Finally, the framework includes an algorithm to dynamically choose sampling rates to meet user specified constraints on the CDF of error bounds in the outputs. We have implemented a prototype of our framework called ApproxSpark, and used it to implement five approximate applications from different domains. Evaluation results show that ApproxSpark can (a) significantly reduce execution time if users can tolerate small amounts of uncertainties and, in many cases, loss of rare keys, and (b) automatically find sampling rates to meet user specified constraints on error bounds. We also explore and discuss extensively trade-offs between sampling rates, execution time, accuracy and key loss.


EntropyDB: A Probabilistic Approach to Approximate Query Processing

We present EntropyDB, an interactive data exploration system that uses a...

Approximate Distributed Joins in Apache Spark

The join operation is a fundamental building block of parallel data proc...

Approximate Edge Analytics for the IoT Ecosystem

IoT-enabled devices continue to generate a massive amount of data. Trans...

The Adaptive sampling revisited

The problem of estimating the number n of distinct keys of a large colle...

Tight bounds for popping algorithms

We sharpen run-time analysis for algorithms under the partial rejection ...

A Novel Approach for Fast and Accurate Mean Error Distance Computation in Approximate Adders

In error-tolerant applications, approximate adders have been exploited e...

Bayesian Design of Sampling Set for Bandlimited Graph Signals

The design of sampling set (DoS) for bandlimited graph signals (GS) has ...