Approximate Distributed Joins in Apache Spark

05/15/2018
by   Do Le Quoc, et al.
0

The join operation is a fundamental building block of parallel data processing. Unfortunately, it is very resource-intensive to compute an equi-join across massive datasets. The approximate computing paradigm allows users to trade accuracy and latency for expensive data processing operations. The equi-join operator is thus a natural candidate for optimization using approximation techniques. Although sampling-based approaches are widely used for approximation, sampling over joins is a compelling but challenging task regarding the output quality. Naive approaches, which perform joins over dataset samples, would not preserve statistical properties of the join output. To realize this potential, we interweave Bloom filter sketching and stratified sampling with the join computation in a new operator, ApproxJoin, that preserves the statistical properties of the join output. ApproxJoin leverages a Bloom filter to avoid shuffling non-joinable data items around the network and then applies stratified sampling to obtain a representative sample of the join output. Our analysis shows that ApproxJoin scales well and significantly reduces data movement, without sacrificing tight error bounds on the accuracy of the final results. We implemented ApproxJoin in Apache Spark and evaluated ApproxJoin using microbenchmarks and real-world case studies. The evaluation shows that ApproxJoin achieves a speedup of 6-9x over unmodified Spark-based joins with the same sampling rate. Furthermore, the speedup is accompanied by a significant reduction in the shuffled data volume, which is 5-82x less than unmodified Spark-based joins.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/15/2018

Approximate Edge Analytics for the IoT Ecosystem

IoT-enabled devices continue to generate a massive amount of data. Trans...
research
07/28/2023

Predicate Transfer: Efficient Pre-Filtering on Multi-Join Queries

This paper presents predicate transfer, a novel method that optimizes jo...
research
12/05/2018

Approximation with Error Bounds in Spark

We introduce a sampling framework to support approximate computing with ...
research
10/23/2020

The Case for Distance-Bounded Spatial Approximations

Spatial approximations have been traditionally used in spatial databases...
research
12/07/2019

Joins on Samples: A Theoretical Guide for Practitioners

Despite decades of research on approximate query processing (AQP), our u...
research
12/18/2017

Error-Tolerant Big Data Processing

Real-world data contains various kinds of errors. Before analyzing data,...
research
01/07/2022

Weighted Random Sampling over Joins

Joining records with all other records that meet a linkage condition can...

Please sign up or login with your details

Forgot password? Click here to reset