Weighted Random Sampling over Joins

01/07/2022
by   Michael Shekelyan, et al.
0

Joining records with all other records that meet a linkage condition can result in an astronomically large number of combinations due to many-to-many relationships. For such challenging (acyclic) joins, a random sample over the join result is a practical alternative to working with the oversized join result. Whereas prior works are limited to uniform join sampling where each join row is assigned the same probability, the scope is extended in this work to weighted sampling to support emerging applications such as scientific discovery in observational data and privacy-preserving query answering. Notwithstanding some naive methods, this work presents the first approach for weighted random sampling from join results. Due to a lack of baselines, experiments over various join types and real-world data sets are conducted to show substantial memory savings and competitive performance with main-memory index-based approaches in the equal-probability setting. In contrast to existing uniform sampling approaches that require prepared structures that occupy contested resources to squeeze out slightly faster query-times, the proposed approaches exhibit qualities that are urgently needed in practice, namely reduced memory footprint, streaming operation, support for selections, outer joins, semi joins and anti joins and unequal-probability sampling. All pertinent code and data can be found at: https://github.com/shekelyan/weightedjoinsampling

READ FULL TEXT VIEW PDF
12/07/2019

Joins on Samples: A Theoretical Guide for Practitioners

Despite decades of research on approximate query processing (AQP), our u...
05/15/2018

Approximate Distributed Joins in Apache Spark

The join operation is a fundamental building block of parallel data proc...
06/21/2019

Learning to Sample: Counting with Complex Queries

In this paper we present a suite of methods to efficiently estimate coun...
12/11/2020

Discovering Multi-Table Functional Dependencies Without Full Join Computation

In this paper, we study the problem of discovering join FDs, i.e., funct...
10/15/2019

Optimizing Semi-Stream CACHEJOIN for Near-Real-Time Data Warehousing

Streaming data join is a critical process in the field of near-real-time...
02/12/2019

Generalized Lineage-Aware Temporal Windows: Supporting Outer and Anti Joins in Temporal-Probabilistic Databases

The result of a temporal-probabilistic (TP) join with negation includes,...
06/10/2022

Density-optimized Intersection-free Mapping and Matrix Multiplication for Join-Project Operations (extended version)

A Join-Project operation is a join operation followed by a duplicate eli...

Stream Join Sampler

Introduction

The basic idea of join sampling approaches is to extend a tuple table-by-table. For a uniform sample, the extension must be drawn with probability proportional to the size of the (sub)join of a tuple with the tables downstream [DBLP:conf/sigmod/ChaudhuriMN99, DBLP:conf/sigmod/ZhaoC0HY18]. This insight carries over to weighted join sampling, where the probability must be proportional to the total weight of the subjoin. In [DBLP:conf/sigmod/ZhaoC0HY18] a tuple-oriented approach is taken to satisfy this constraint, where tuples are individually extended table-by-table, which is reliant on random access and index structures. Note that despite the use of indices [DBLP:conf/sigmod/ZhaoC0HY18] still requires time linear in the size of the tables to collect the full sample. Hence, this work instead proposes a table-oriented approach, i.e., tables are scanned sequentially one-by-one. Intuitively, this is achieved by processing all extensions by one single table in one go akin to bread-first-search, rather than extending each tuple by all tables akin to depth-first-search. The foundation of such an approach is a proposed multipartite graph formulation that adds support to weighted sampling, non-equi joins, outer joins, selections, semi- and anti-joins (none of which are supported in prior work [shanghooshabad2021pgmjoins, DBLP:conf/sigmod/ChaudhuriMN99, olken1993random, DBLP:conf/sigmod/ZhaoC0HY18]). Notation. Throughout this section, table names are used that match the columns, e.g., if a table has columns and the table will be called . If two tables have the same column name and are joined together, the join conditions will be to enforce equality across all columns with the same name as in a natural join. For attribute values of table rows for a column instead the lower case is used to denote various independent values of a column .

Multipartite Graph Formulation

[black,fill=black!30] (0,0) circle [radius=0.06cm]; join node   [black] (0,0) circle [radius=0.06cm]; row node   [black] (0,0) circle [radius=0.06cm]; [black] (0,0) circle [radius=0.03cm]; null node op= [matrix of nodes, inner sep=0.5cm,nodes = style= circle,fill=black!30,draw=black, inner sep=0.04cm,column sep = 1.5cm, row sep = 0.75cm] tab= [rounded corners,op, draw=black, row sep = 0.75cm,nodes=style= circle,fill=white,draw=black] edge = [color=black] leftedge = [color=black, dashed] rightedge = [color=black, thick, dotted] select = [decorate, decoration=snake, amplitude=0.05cm] lab= [, font=,node distance=0.1cm, anchor=north, pattern=crosshatch, pattern color=white, outer sep=0.001cm, inner sep=0.01cm] wleft= [, font=black!90,node distance=0.15cm, anchor=east, fill=white, outer sep=0.01cm, inner sep=0.01cm] wright= [, font=black!90,node distance=0.15cm, anchor=west, fill=white, outer sep=0.01cm, inner sep=0.01cm] oplab= [, font=black!90,node distance=0.1cm, anchor=north west, pattern=crosshatch, pattern color=white, outer sep=0.001cm, inner sep=0.01cm] tabname= [node distance=0.3cm, anchor=south] [xscale=0.9] (ab) [tab, row sep = 0.5cm, inner sep=0.25cm] ; (b) [op, xshift=0.75cm, yshift=-0.25cm,row sep = 0.5cm] ; (bc) [tab, xshift=1.5cm,row sep = 0.5cm, inner sep=0.25cm] ; [black] (ab-5-1) circle [radius=0.03cm]; [black] (bc-5-1) circle [radius=0.03cm]; [edge] (ab-1-1) – (b-1-1); [edge] (ab-2-1) – (b-1-1); [edge] (ab-3-1) – (b-2-1); [edge] (ab-3-1) – (b-2-1); [leftedge] (ab-5-1) – (b-3-1); [leftedge] (ab-5-1) – (b-5-1); [edge] (ab-4-1) – (b-4-1); [edge] (b-1-1) – (bc-1-1); [edge] (b-2-1) – (bc-2-1); [edge] (b-2-1) – (bc-3-1); [edge] (b-3-1) – (bc-4-1); [rightedge] (b-4-1) – (bc-5-1); [rightedge] (b-5-1) – (bc-5-1); [above of=ab-1-1,tabname]; [above of=bc-1-1,tabname]; [left of=ab-1-1, wleft]; [left of=ab-2-1, wleft]; [left of=ab-3-1, wleft]; [left of=ab-4-1, wleft]; [left of=ab-5-1, wleft]; [below of=b-1-1, oplab]; [below of=b-5-1, oplab]; [below of=b-2-1, oplab]; [below of=b-3-1, oplab]; [below of=b-4-1, oplab]; [above of=bc-1-1,tabname]; [right of=bc-1-1, wright]; [right of=bc-2-1, wright]; [right of=bc-3-1, wright]; [right of=bc-4-1, wright]; [right of=bc-5-1, wright]; Each join node is the root of a sub-tree (left to right). op= [matrix of nodes, inner sep=0.5cm,nodes = style= circle,fill=black!30,draw=black, inner sep=0.04cm,column sep = 1.5cm, row sep = 0.75cm] tab= [rounded corners,op, draw=black, row sep = 0.75cm,nodes=style= circle,fill=white,draw=black] edge = [color=black] leftedge = [color=black, dashed] rightedge = [color=black, thick, dotted] select = [decorate, decoration=snake, amplitude=0.05cm] lab= [, font=,node distance=0.1cm, anchor=north, pattern=crosshatch, pattern color=white, outer sep=0.001cm, inner sep=0.01cm] wleft= [, font=black!90,node distance=0.15cm, anchor=east, fill=white, outer sep=0.01cm, inner sep=0.01cm] wright= [, font=black!90,node distance=0.15cm, anchor=west, fill=white, outer sep=0.01cm, inner sep=0.01cm] oplab= [, font=black!90,node distance=0.1cm, anchor=north west, pattern=crosshatch, pattern color=white, outer sep=0.001cm, inner sep=0.01cm] tabname= [node distance=0.3cm, anchor=south] [xscale=0.9] (ab) [tab, row sep = 0.5cm, inner sep=0.25cm] ; (b) [op, xshift=0.75cm, yshift=-0.25cm,row sep = 0.5cm] ; [black] (ab-5-1) circle [radius=0.03cm]; [edge] (ab-1-1) – (b-1-1); [edge] (ab-2-1) – (b-1-1); [edge] (ab-3-1) – (b-2-1); [edge] (ab-3-1) – (b-2-1); [leftedge] (ab-5-1) – (b-3-1); [leftedge] (ab-5-1) – (b-5-1); [edge] (ab-4-1) – (b-4-1); [above of=ab-1-1,tabname]; [left of=ab-1-1, wleft]; [left of=ab-2-1, wleft]; [left of=ab-3-1, wleft]; [left of=ab-4-1, wleft]; [left of=ab-5-1, wleft]; [below of=b-1-1, oplab]; [below of=b-5-1, oplab]; [below of=b-2-1, oplab]; [below of=b-3-1, oplab]; [below of=b-4-1, oplab]; Weights of each sub-tree can be transferred into join node.
Figure : Graph formulation for two-way join
[black,fill=black!30] (0,0) circle [radius=0.06cm]; join node   [black] (0,0) circle [radius=0.06cm]; row node   [black] (0,0) circle [radius=0.06cm]; [black] (0,0) circle [radius=0.03cm]; null node op= [matrix of nodes, inner sep=0.5cm,nodes = style= circle,fill=black!30,draw=black, inner sep=0.04cm,column sep = 1.5cm, row sep = 0.75cm] tab= [rounded corners,op, draw=black, row sep = 0.75cm,nodes=style= circle,fill=white,draw=black] edge = [color=black] leftedge = [color=black, dashed] rightedge = [color=black, thick, dotted] select = [decorate, decoration=snake, amplitude=0.05cm] lab= [, font=,node distance=0.1cm, anchor=north, pattern=crosshatch, pattern color=white, outer sep=0.001cm, inner sep=0.01cm] wleft= [, font=black!90,node distance=0.15cm, anchor=east, fill=white, outer sep=0.01cm, inner sep=0.01cm] wright= [, font=black!90,node distance=0.15cm, anchor=west, fill=white, outer sep=0.01cm, inner sep=0.01cm] oplab= [, font=black!90,node distance=0.1cm, anchor=north west, pattern=crosshatch, pattern color=white, outer sep=0.001cm, inner sep=0.01cm] tabname= [node distance=0.3cm, anchor=south] CD= []C= [] AB= [] BC= [] B= [] FA= [] A= [] remove = [fill=none,draw=none,color=white,nodes = style= circle,draw=none, fill=none, inner sep=0.04cm] double= [select] double= [draw=none] select= [color=black, thin] (-1.82cm,-1.25cm) rectangle (3.32cm,1.5cm); (ab) [tab, row sep = 0.25cm, inner sep=0.25cm,AB] ; (b) [op, xshift=0.75cm, yshift=-0.25cm,row sep = 0.25cm,B] ; (bc) [tab, xshift=1.5cm,row sep = 0.25cm, inner sep=0.25cm,BC] ; [edge, ultra thick, orange, transform canvas=yshift=0.025cm, double,B] (ab-3-1) – (b-2-1); [edge, ultra thick, teal ,transform canvas=yshift=-0.025cm,, xshift=-0.01cm,double,B] (ab-3-1) – (b-2-1); [edge, white, thick, select,B] (ab-3-1) – (b-2-1); [edge, thick, blue, select,B] (ab-5-1) – (b-3-1); [edge,B] (ab-5-1) – (b-5-1); [edge,B] (ab-4-1) – (b-4-1); [edge, thick, violet, select,B] (ab-1-1) – (b-1-1); [edge, very thick, gray, select,B] (ab-2-1) – (b-1-1); [edge, ultra thick, violet, transform canvas=yshift=0.025cm,select,BC,double] (b-1-1) – (bc-1-1); [edge, ultra thick, gray ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,BC,double] (b-1-1) – (bc-1-1); [edge, white, thick,select,BC] (b-1-1) – (bc-1-1); [edge, thick, orange,select,BC] (b-2-1) – (bc-2-1); [edge,, thick, teal,select,BC] (b-2-1) – (bc-3-1); [edge,thick, blue, select,BC] (b-3-1) – (bc-4-1); [above of=b-1-1, tabname, yshift=0.25cm,BC]; [black,AB] (ab-5-1) circle [radius=0.03cm]; [black,BC] (bc-5-1) circle [radius=0.03cm]; [above of=ab-1-1,tabname,AB]; [above of=bc-1-1,tabname,BC]; (a) [op, xshift=-0.75cm, yshift=-0.25cm,row sep = 0.25cm,A] ; [above of=a-1-1, tabname, yshift=0.25cm,FA]; (fa) [tab, xshift=-1.5cm,row sep = 0.25cm, inner sep=0.25cm,FA] ; [above of=fa-1-1,tabname,FA]; [black,FA] (fa-5-1) circle [radius=0.03cm]; [edge,FA] (fa-5-1) – (a-5-1); [edge, thick, teal, select, FA] (fa-5-1) – (a-3-1); [edge, ultra thick, white, select, FA] (fa-4-1) – (a-4-1); [edge, thick, blue, select, FA] (fa-4-1) – (a-4-1); [edge, thick, teal, select, A] (a-3-1) – (ab-3-1); [edge,AB] (ab-5-1) – (a-5-1); [edge,AB, thick, violet, select] (ab-1-1) – (a-1-1); [edge,very thick, gray, select,AB] (ab-2-1) – (a-1-1); [edge, thick, blue,select,AB] (ab-5-1) – (a-4-1); [edge, ultra thick, violet, transform canvas=yshift=0.025cm,select,FA,double] (fa-1-1) – (a-1-1); [edge, ultra thick, gray ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,FA,double] (fa-1-1) – (a-1-1); [edge, white, thick,select,FA] (fa-1-1) – (a-1-1); [edge, ultra thick, orange, transform canvas=yshift=0.025cm,select,FA,double] (fa-5-1) – (a-3-1); [edge, ultra thick, teal ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,FA,double] (fa-5-1) – (a-3-1); [edge, white, thick,select,FA] (fa-5-1) – (a-3-1); [edge, ultra thick, orange, transform canvas=yshift=0.025cm,select,AB,double] (a-3-1) – (ab-3-1); [edge, ultra thick, teal ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,AB,double] (a-3-1) – (ab-3-1); [edge, white, thick,select,AB] (a-3-1) – (ab-3-1); [edge,AB] (a-2-1) – (ab-4-1); [edge,FA] (fa-2-1) – (a-2-1); [edge,FA] (fa-3-1) – (a-2-1); (c) [op, xshift=2.25cm, yshift=-0.25cm,row sep = 0.25cm,C] ; [above of=c-1-1, tabname, yshift=0.25cm,CD]; (cd) [tab, xshift=3cm,row sep = 0.25cm, inner sep=0.25cm, fill=black!5,draw=black!30,CD] ; [above of=cd-1-1,tabname,CD]; [black,CD] (cd-5-1) circle [radius=0.03cm]; [edge, ultra thick, violet, transform canvas=yshift=0.025cm,select,C,double] (bc-1-1) – (c-1-1); [edge, ultra thick, gray ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,C,double] (bc-1-1) – (c-1-1); [edge, white, thick,select,C] (bc-1-1) – (c-1-1); [edge, thick, orange,select,C] (bc-2-1) – (c-1-1); [edge, thick, teal,select,C] (bc-3-1) – (c-2-1); [edge, thick, blue,select,C] (bc-4-1) – (c-3-1); [edge, thick, dotted,CD] (cd-4-1) – (c-4-1); [edge, thick, dotted,CD] (cd-1-1) – (c-1-1); [edge, thick, dotted,CD] (cd-2-1) – (c-2-1); [edge, thick, dotted,CD] (cd-3-1) – (c-2-1); [below right of=c-3-1, node distance=0.2cm,C]; [below right of=c-5-1, node distance=0.2cm,C]; Graph Formulation for Running Example op= [matrix of nodes, inner sep=0.5cm,nodes = style= circle,fill=black!30,draw=black, inner sep=0.04cm,column sep = 1.5cm, row sep = 0.75cm] tab= [rounded corners,op, draw=black, row sep = 0.75cm,nodes=style= circle,fill=white,draw=black] edge = [color=black] leftedge = [color=black, dashed] rightedge = [color=black, thick, dotted] select = [decorate, decoration=snake, amplitude=0.05cm] lab= [, font=,node distance=0.1cm, anchor=north, pattern=crosshatch, pattern color=white, outer sep=0.001cm, inner sep=0.01cm] wleft= [, font=black!90,node distance=0.15cm, anchor=east, fill=white, outer sep=0.01cm, inner sep=0.01cm] wright= [, font=black!90,node distance=0.15cm, anchor=west, fill=white, outer sep=0.01cm, inner sep=0.01cm] oplab= [, font=black!90,node distance=0.1cm, anchor=north west, pattern=crosshatch, pattern color=white, outer sep=0.001cm, inner sep=0.01cm] tabname= [node distance=0.3cm, anchor=south] CD= []C= [] AB= [] BC= [] B= [] FA= [] A= [] remove = [fill=none,draw=none,color=white,nodes = style= circle,draw=none, fill=none, inner sep=0.04cm] double= [select] double= [draw=none] select= [color=black, thin] select = [decorate, decoration=snake, amplitude=0.05cm] double= [select] CD = [remove] (-1.82cm,-1.25cm) rectangle (2.5cm,1.5cm); (ab) [tab, row sep = 0.25cm, inner sep=0.25cm,AB] ; (b) [op, xshift=0.75cm, yshift=-0.25cm,row sep = 0.25cm,B] ; (bc) [tab, xshift=1.5cm,row sep = 0.25cm, inner sep=0.25cm,BC] ; [edge, ultra thick, orange, transform canvas=yshift=0.025cm, double,B] (ab-3-1) – (b-2-1); [edge, ultra thick, teal ,transform canvas=yshift=-0.025cm,, xshift=-0.01cm,double,B] (ab-3-1) – (b-2-1); [edge, white, thick, select,B] (ab-3-1) – (b-2-1); [edge, thick, blue, select,B] (ab-5-1) – (b-3-1); [edge,B] (ab-5-1) – (b-5-1); [edge,B] (ab-4-1) – (b-4-1); [edge, thick, violet, select,B] (ab-1-1) – (b-1-1); [edge, very thick, gray, select,B] (ab-2-1) – (b-1-1); [edge, ultra thick, violet, transform canvas=yshift=0.025cm,select,BC,double] (b-1-1) – (bc-1-1); [edge, ultra thick, gray ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,BC,double] (b-1-1) – (bc-1-1); [edge, white, thick,select,BC] (b-1-1) – (bc-1-1); [edge, thick, orange,select,BC] (b-2-1) – (bc-2-1); [edge,, thick, teal,select,BC] (b-2-1) – (bc-3-1); [edge,thick, blue, select,BC] (b-3-1) – (bc-4-1); [above of=b-1-1, tabname, yshift=0.25cm,BC]; [black,AB] (ab-5-1) circle [radius=0.03cm]; [black,BC] (bc-5-1) circle [radius=0.03cm]; [above of=ab-1-1,tabname,AB]; [above of=bc-1-1,tabname,BC]; (a) [op, xshift=-0.75cm, yshift=-0.25cm,row sep = 0.25cm,A] ; [above of=a-1-1, tabname, yshift=0.25cm,FA]; (fa) [tab, xshift=-1.5cm,row sep = 0.25cm, inner sep=0.25cm,FA] ; [above of=fa-1-1,tabname,FA]; [black,FA] (fa-5-1) circle [radius=0.03cm]; [edge,FA] (fa-5-1) – (a-5-1); [edge, thick, teal, select, FA] (fa-5-1) – (a-3-1); [edge, ultra thick, white, select, FA] (fa-4-1) – (a-4-1); [edge, thick, blue, select, FA] (fa-4-1) – (a-4-1); [edge, thick, teal, select, A] (a-3-1) – (ab-3-1); [edge,AB] (ab-5-1) – (a-5-1); [edge,AB, thick, violet, select] (ab-1-1) – (a-1-1); [edge,very thick, gray, select,AB] (ab-2-1) – (a-1-1); [edge, thick, blue,select,AB] (ab-5-1) – (a-4-1); [edge, ultra thick, violet, transform canvas=yshift=0.025cm,select,FA,double] (fa-1-1) – (a-1-1); [edge, ultra thick, gray ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,FA,double] (fa-1-1) – (a-1-1); [edge, white, thick,select,FA] (fa-1-1) – (a-1-1); [edge, ultra thick, orange, transform canvas=yshift=0.025cm,select,FA,double] (fa-5-1) – (a-3-1); [edge, ultra thick, teal ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,FA,double] (fa-5-1) – (a-3-1); [edge, white, thick,select,FA] (fa-5-1) – (a-3-1); [edge, ultra thick, orange, transform canvas=yshift=0.025cm,select,AB,double] (a-3-1) – (ab-3-1); [edge, ultra thick, teal ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,AB,double] (a-3-1) – (ab-3-1); [edge, white, thick,select,AB] (a-3-1) – (ab-3-1); [edge,AB] (a-2-1) – (ab-4-1); [edge,FA] (fa-2-1) – (a-2-1); [edge,FA] (fa-3-1) – (a-2-1); (c) [op, xshift=2.25cm, yshift=-0.25cm,row sep = 0.25cm,C] ; [above of=c-1-1, tabname, yshift=0.25cm,CD]; (cd) [tab, xshift=3cm,row sep = 0.25cm, inner sep=0.25cm, fill=black!5,draw=black!30,CD] ; [above of=cd-1-1,tabname,CD]; [black,CD] (cd-5-1) circle [radius=0.03cm]; [edge, ultra thick, violet, transform canvas=yshift=0.025cm,select,C,double] (bc-1-1) – (c-1-1); [edge, ultra thick, gray ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,C,double] (bc-1-1) – (c-1-1); [edge, white, thick,select,C] (bc-1-1) – (c-1-1); [edge, thick, orange,select,C] (bc-2-1) – (c-1-1); [edge, thick, teal,select,C] (bc-3-1) – (c-2-1); [edge, thick, blue,select,C] (bc-4-1) – (c-3-1); [edge, thick, dotted,CD] (cd-4-1) – (c-4-1); [edge, thick, dotted,CD] (cd-1-1) – (c-1-1); [edge, thick, dotted,CD] (cd-2-1) – (c-2-1); [edge, thick, dotted,CD] (cd-3-1) – (c-2-1); [below right of=c-3-1, node distance=0.2cm,C]; [below right of=c-5-1, node distance=0.2cm,C]; after 1 iteration op= [matrix of nodes, inner sep=0.5cm,nodes = style= circle,fill=black!30,draw=black, inner sep=0.04cm,column sep = 1.5cm, row sep = 0.75cm] tab= [rounded corners,op, draw=black, row sep = 0.75cm,nodes=style= circle,fill=white,draw=black] edge = [color=black] leftedge = [color=black, dashed] rightedge = [color=black, thick, dotted] select = [decorate, decoration=snake, amplitude=0.05cm] lab= [, font=,node distance=0.1cm, anchor=north, pattern=crosshatch, pattern color=white, outer sep=0.001cm, inner sep=0.01cm] wleft= [, font=black!90,node distance=0.15cm, anchor=east, fill=white, outer sep=0.01cm, inner sep=0.01cm] wright= [, font=black!90,node distance=0.15cm, anchor=west, fill=white, outer sep=0.01cm, inner sep=0.01cm] oplab= [, font=black!90,node distance=0.1cm, anchor=north west, pattern=crosshatch, pattern color=white, outer sep=0.001cm, inner sep=0.01cm] tabname= [node distance=0.3cm, anchor=south] CD= []C= [] AB= [] BC= [] B= [] FA= [] A= [] remove = [fill=none,draw=none,color=white,nodes = style= circle,draw=none, fill=none, inner sep=0.04cm] double= [select] double= [draw=none] select= [color=black, thin] select = [decorate, decoration=snake, amplitude=0.05cm] double= [select] CD = [remove] FA = [remove] (-0.82cm,-1.25cm) rectangle (2.5cm,1.5cm); (ab) [tab, row sep = 0.25cm, inner sep=0.25cm,AB] ; (b) [op, xshift=0.75cm, yshift=-0.25cm,row sep = 0.25cm,B] ; (bc) [tab, xshift=1.5cm,row sep = 0.25cm, inner sep=0.25cm,BC] ; [edge, ultra thick, orange, transform canvas=yshift=0.025cm, double,B] (ab-3-1) – (b-2-1); [edge, ultra thick, teal ,transform canvas=yshift=-0.025cm,, xshift=-0.01cm,double,B] (ab-3-1) – (b-2-1); [edge, white, thick, select,B] (ab-3-1) – (b-2-1); [edge, thick, blue, select,B] (ab-5-1) – (b-3-1); [edge,B] (ab-5-1) – (b-5-1); [edge,B] (ab-4-1) – (b-4-1); [edge, thick, violet, select,B] (ab-1-1) – (b-1-1); [edge, very thick, gray, select,B] (ab-2-1) – (b-1-1); [edge, ultra thick, violet, transform canvas=yshift=0.025cm,select,BC,double] (b-1-1) – (bc-1-1); [edge, ultra thick, gray ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,BC,double] (b-1-1) – (bc-1-1); [edge, white, thick,select,BC] (b-1-1) – (bc-1-1); [edge, thick, orange,select,BC] (b-2-1) – (bc-2-1); [edge,, thick, teal,select,BC] (b-2-1) – (bc-3-1); [edge,thick, blue, select,BC] (b-3-1) – (bc-4-1); [above of=b-1-1, tabname, yshift=0.25cm,BC]; [black,AB] (ab-5-1) circle [radius=0.03cm]; [black,BC] (bc-5-1) circle [radius=0.03cm]; [above of=ab-1-1,tabname,AB]; [above of=bc-1-1,tabname,BC]; (a) [op, xshift=-0.75cm, yshift=-0.25cm,row sep = 0.25cm,A] ; [above of=a-1-1, tabname, yshift=0.25cm,FA]; (fa) [tab, xshift=-1.5cm,row sep = 0.25cm, inner sep=0.25cm,FA] ; [above of=fa-1-1,tabname,FA]; [black,FA] (fa-5-1) circle [radius=0.03cm]; [edge,FA] (fa-5-1) – (a-5-1); [edge, thick, teal, select, FA] (fa-5-1) – (a-3-1); [edge, ultra thick, white, select, FA] (fa-4-1) – (a-4-1); [edge, thick, blue, select, FA] (fa-4-1) – (a-4-1); [edge, thick, teal, select, A] (a-3-1) – (ab-3-1); [edge,AB] (ab-5-1) – (a-5-1); [edge,AB, thick, violet, select] (ab-1-1) – (a-1-1); [edge,very thick, gray, select,AB] (ab-2-1) – (a-1-1); [edge, thick, blue,select,AB] (ab-5-1) – (a-4-1); [edge, ultra thick, violet, transform canvas=yshift=0.025cm,select,FA,double] (fa-1-1) – (a-1-1); [edge, ultra thick, gray ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,FA,double] (fa-1-1) – (a-1-1); [edge, white, thick,select,FA] (fa-1-1) – (a-1-1); [edge, ultra thick, orange, transform canvas=yshift=0.025cm,select,FA,double] (fa-5-1) – (a-3-1); [edge, ultra thick, teal ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,FA,double] (fa-5-1) – (a-3-1); [edge, white, thick,select,FA] (fa-5-1) – (a-3-1); [edge, ultra thick, orange, transform canvas=yshift=0.025cm,select,AB,double] (a-3-1) – (ab-3-1); [edge, ultra thick, teal ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,AB,double] (a-3-1) – (ab-3-1); [edge, white, thick,select,AB] (a-3-1) – (ab-3-1); [edge,AB] (a-2-1) – (ab-4-1); [edge,FA] (fa-2-1) – (a-2-1); [edge,FA] (fa-3-1) – (a-2-1); (c) [op, xshift=2.25cm, yshift=-0.25cm,row sep = 0.25cm,C] ; [above of=c-1-1, tabname, yshift=0.25cm,CD]; (cd) [tab, xshift=3cm,row sep = 0.25cm, inner sep=0.25cm, fill=black!5,draw=black!30,CD] ; [above of=cd-1-1,tabname,CD]; [black,CD] (cd-5-1) circle [radius=0.03cm]; [edge, ultra thick, violet, transform canvas=yshift=0.025cm,select,C,double] (bc-1-1) – (c-1-1); [edge, ultra thick, gray ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,C,double] (bc-1-1) – (c-1-1); [edge, white, thick,select,C] (bc-1-1) – (c-1-1); [edge, thick, orange,select,C] (bc-2-1) – (c-1-1); [edge, thick, teal,select,C] (bc-3-1) – (c-2-1); [edge, thick, blue,select,C] (bc-4-1) – (c-3-1); [edge, thick, dotted,CD] (cd-4-1) – (c-4-1); [edge, thick, dotted,CD] (cd-1-1) – (c-1-1); [edge, thick, dotted,CD] (cd-2-1) – (c-2-1); [edge, thick, dotted,CD] (cd-3-1) – (c-2-1); [below right of=c-3-1, node distance=0.2cm,C]; [below right of=c-5-1, node distance=0.2cm,C]; after 2 iterations op= [matrix of nodes, inner sep=0.5cm,nodes = style= circle,fill=black!30,draw=black, inner sep=0.04cm,column sep = 1.5cm, row sep = 0.75cm] tab= [rounded corners,op, draw=black, row sep = 0.75cm,nodes=style= circle,fill=white,draw=black] edge = [color=black] leftedge = [color=black, dashed] rightedge = [color=black, thick, dotted] select = [decorate, decoration=snake, amplitude=0.05cm] lab= [, font=,node distance=0.1cm, anchor=north, pattern=crosshatch, pattern color=white, outer sep=0.001cm, inner sep=0.01cm] wleft= [, font=black!90,node distance=0.15cm, anchor=east, fill=white, outer sep=0.01cm, inner sep=0.01cm] wright= [, font=black!90,node distance=0.15cm, anchor=west, fill=white, outer sep=0.01cm, inner sep=0.01cm] oplab= [, font=black!90,node distance=0.1cm, anchor=north west, pattern=crosshatch, pattern color=white, outer sep=0.001cm, inner sep=0.01cm] tabname= [node distance=0.3cm, anchor=south] CD= []C= [] AB= [] BC= [] B= [] FA= [] A= [] remove = [fill=none,draw=none,color=white,nodes = style= circle,draw=none, fill=none, inner sep=0.04cm] double= [select] double= [draw=none] select= [color=black, thin] select = [decorate, decoration=snake, amplitude=0.05cm] double= [select] CD = [remove] FA = [remove] BC = [remove] C = [remove] (-0.82cm,-1.25cm) rectangle (0.82cm,1.5cm); (ab) [tab, row sep = 0.25cm, inner sep=0.25cm,AB] ; (b) [op, xshift=0.75cm, yshift=-0.25cm,row sep = 0.25cm,B] ; (bc) [tab, xshift=1.5cm,row sep = 0.25cm, inner sep=0.25cm,BC] ; [edge, ultra thick, orange, transform canvas=yshift=0.025cm, double,B] (ab-3-1) – (b-2-1); [edge, ultra thick, teal ,transform canvas=yshift=-0.025cm,, xshift=-0.01cm,double,B] (ab-3-1) – (b-2-1); [edge, white, thick, select,B] (ab-3-1) – (b-2-1); [edge, thick, blue, select,B] (ab-5-1) – (b-3-1); [edge,B] (ab-5-1) – (b-5-1); [edge,B] (ab-4-1) – (b-4-1); [edge, thick, violet, select,B] (ab-1-1) – (b-1-1); [edge, very thick, gray, select,B] (ab-2-1) – (b-1-1); [edge, ultra thick, violet, transform canvas=yshift=0.025cm,select,BC,double] (b-1-1) – (bc-1-1); [edge, ultra thick, gray ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,BC,double] (b-1-1) – (bc-1-1); [edge, white, thick,select,BC] (b-1-1) – (bc-1-1); [edge, thick, orange,select,BC] (b-2-1) – (bc-2-1); [edge,, thick, teal,select,BC] (b-2-1) – (bc-3-1); [edge,thick, blue, select,BC] (b-3-1) – (bc-4-1); [above of=b-1-1, tabname, yshift=0.25cm,BC]; [black,AB] (ab-5-1) circle [radius=0.03cm]; [black,BC] (bc-5-1) circle [radius=0.03cm]; [above of=ab-1-1,tabname,AB]; [above of=bc-1-1,tabname,BC]; (a) [op, xshift=-0.75cm, yshift=-0.25cm,row sep = 0.25cm,A] ; [above of=a-1-1, tabname, yshift=0.25cm,FA]; (fa) [tab, xshift=-1.5cm,row sep = 0.25cm, inner sep=0.25cm,FA] ; [above of=fa-1-1,tabname,FA]; [black,FA] (fa-5-1) circle [radius=0.03cm]; [edge,FA] (fa-5-1) – (a-5-1); [edge, thick, teal, select, FA] (fa-5-1) – (a-3-1); [edge, ultra thick, white, select, FA] (fa-4-1) – (a-4-1); [edge, thick, blue, select, FA] (fa-4-1) – (a-4-1); [edge, thick, teal, select, A] (a-3-1) – (ab-3-1); [edge,AB] (ab-5-1) – (a-5-1); [edge,AB, thick, violet, select] (ab-1-1) – (a-1-1); [edge,very thick, gray, select,AB] (ab-2-1) – (a-1-1); [edge, thick, blue,select,AB] (ab-5-1) – (a-4-1); [edge, ultra thick, violet, transform canvas=yshift=0.025cm,select,FA,double] (fa-1-1) – (a-1-1); [edge, ultra thick, gray ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,FA,double] (fa-1-1) – (a-1-1); [edge, white, thick,select,FA] (fa-1-1) – (a-1-1); [edge, ultra thick, orange, transform canvas=yshift=0.025cm,select,FA,double] (fa-5-1) – (a-3-1); [edge, ultra thick, teal ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,FA,double] (fa-5-1) – (a-3-1); [edge, white, thick,select,FA] (fa-5-1) – (a-3-1); [edge, ultra thick, orange, transform canvas=yshift=0.025cm,select,AB,double] (a-3-1) – (ab-3-1); [edge, ultra thick, teal ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,AB,double] (a-3-1) – (ab-3-1); [edge, white, thick,select,AB] (a-3-1) – (ab-3-1); [edge,AB] (a-2-1) – (ab-4-1); [edge,FA] (fa-2-1) – (a-2-1); [edge,FA] (fa-3-1) – (a-2-1); (c) [op, xshift=2.25cm, yshift=-0.25cm,row sep = 0.25cm,C] ; [above of=c-1-1, tabname, yshift=0.25cm,CD]; (cd) [tab, xshift=3cm,row sep = 0.25cm, inner sep=0.25cm, fill=black!5,draw=black!30,CD] ; [above of=cd-1-1,tabname,CD]; [black,CD] (cd-5-1) circle [radius=0.03cm]; [edge, ultra thick, violet, transform canvas=yshift=0.025cm,select,C,double] (bc-1-1) – (c-1-1); [edge, ultra thick, gray ,transform canvas=yshift=-0.025cm, xshift=-0.01cm,select,C,double] (bc-1-1) – (c-1-1); [edge, white, thick,select,C] (bc-1-1) – (c-1-1); [edge, thick, orange,select,C] (bc-2-1) – (c-1-1); [edge, thick, teal,select,C] (bc-3-1) – (c-2-1); [edge, thick, blue,select,C] (bc-4-1) – (c-3-1); [edge, thick, dotted,CD] (cd-4-1) – (c-4-1); [edge, thick, dotted,CD] (cd-1-1) – (c-1-1); [edge, thick, dotted,CD] (cd-2-1) – (c-2-1); [edge, thick, dotted,CD] (cd-3-1) – (c-2-1); [below right of=c-3-1, node distance=0.2cm,C]; [below right of=c-5-1, node distance=0.2cm,C]; completed
Figure : Algorithm ’s iterations (Lines -) preserve the weight of result trees, but simplify the graph by removing partitions

Graph formulation. In order to more easily reason about not just inner joins, but also outer joins, it is useful to formulate the join operation through a graph, where rows and join attribute values form a multipartite graph with one partition of nodes per table and join column. An example of such a multipartite graph for a two-way join can be found in Figure Document. The nodes of the graph are:

  • Row nodes: Each table is a set of nodes, one node for each row, e.g., there is a node for row .

  • Null nodes: “NULL” values for a table are presumed to be an additional row of the table, e.g., .

  • Row node labels: Each row node is labelled with the row weight, e.g., , even null nodes, e.g., .

  • Join nodes: Each join column is a set of nodes, one node for each attribute value, e.g., .

  • Join node labels: Each join node is initially labelled with the value , e.g., the node for has a label .

As join operators are not necessarily symmetrical, there is a “left” and a “right” side. The edges of the graph potentially connect join nodes with row nodes with matching join attribute value, but it depends on the join operator if they do:

  • Each join node for , , and is connected to each matching row node.

  • Each join node for and that is not connected to any row node on the left side is connected to the left null node.

  • Each join node for and that is not connected to any row node on the left side is connected to the right null node.

  • Each join node for has label if there is no match on the right side.

  • Each join node for has label if there is a match on the right side.

This simply means left outer joins () allow null values on the left side and right outer joins () on the right side, while full outer joins allow them on either side (). Semijoins () and antijoins () are filters that change the weight of a join node depending on if there is a match on the right side. The main motivation for this formulation, is that in such a graph, the trees spanning all partitions corresponds to join rows:

  • Result trees: Each tree that contains exactly one node from each reachable partition corresponds to a join row (right side partitions of semijoins or antijoins are unreachable).

  • Trivial rule: A result tree cannot contain join nodes that are only connected to null nodes.

The weighted sampling problem can then be posed via the multipartite graph formulation: [Weighted Sampling over Joins] Select a random result tree (corresponding to a join row) comprised of nodes with labels with probabilities proportional to the tree’s total weight . As there are as many result trees as join rows, some method is needed to reduce the complexity noticeably below the number of join rows. As each tree is sampled with probabilities proportional to their weight, any group of trees will be sampled with probabilities proportional to their sum of weights. Thus, one can sample the result tree step by step rather than in one go, which will be shown in the following to allow for sub-linear complexity. Efficient solution to graph problem. The basic idea is to group all result trees that have the same root and then sample such a root-group. The challenge left to solve is how to compute the sum of weights of each root-group. The right side of Figure Document shows the most crucial primitive for this purpose. The operation sums the weights of all sub-trees of each join node and adds it as a new label of the join node. Such an operation can be implemented through a linear-time sequential scan of and generating a hash table of -values contained in where each entry holds the sum of observed weights of each -value. In SQL-terms, one can think of such an operation as something along the lines of Select SUM(W) from BC group by B where is presumed to be some column holding the weights of . As a second step, one can scan , which holds the root nodes. Now, for each row of one can look up the sum of sub-tree weights in the hash table entry of the key . The total weight of all groups that contain is then . If is not a leaf node, one simply recursively continues this procedure to compute the weights of until the leaf nodes are reached. Once a group with as a root node is sampled with probability proportional to its total weight, one can continue in a similar fashion with the children nodes (adjacent nodes in the graph facing away from the main table), until the full tree is obtained. Semi-joins, anti-joins and selections can be supported through weights, which is detailed later. Let be a multipartite graph corresponding to the join query and be a node ’s label.   Let be the row node partition of the join query’s main table.   Let be all neighbouring join node partitions in closer to than .   Let be all neighbouring join node partitions in further away from than .   Let be all row node partitions in furthest away from , excluding .   Initiate as .   Pick a row node partition   Let .   Set the label of all nodes in to   row node Initiate as   join node part. Let be the label of connected to   Multiply by . Let be the label of connected to Set to Add to .   Remove row node partition from   join node partition Remove join node partition from   Replace all labels with for nodes in Group Weights Algorithm. It is described in Algorithm  how the group weights can be obtained for general multi-way joins. Figure Document applies the algorithm for the running example. serves as the main table (see Line ), such that the parent moves closer to (Line ) and children node move further away from (Line ), while leafs such as are on the outskirts furthest away from (Line ). The algorithm processes a new table in each iteration (Lines  -). It makes a single stream pass over the rows of the new table (Lines  -, computes the total weight of the sub-trees rooted at the row (Lines -) and add this sub-tree weight multiplied by its own weight to the parent join node (Lines -). After the new table has been processed it is removed from further consideration (Lines -) and the algorithm goes back to Line  and terminates when no more tables are left for consideration. After termination, each row node of the main table is linked to multiple join nodes and the product of the join node labels multiplied by yields the total weight of all results trees / join rows containing . The algorithm can be implemented with a hash table for each join node partition that allows an efficient lookup of the linked join node for a row node based on the join attribute value. Note that if all join nodes except a few have the same label, one only keeps entries for the exceptions and maintains a default value for the rest. The hash table entries for a table can be computed in one scan that skips any rows that do not satisfy the selection predicates. In case of a semi-join, the default value is and only entries are for results of the semi join. While an anti-join can be as large as a table, for sampling it can be supported via semi-join: For anti-joins the default value is and only entries are for results of the semi join. Theta/non-equi-joins can also be easily supported. If the link condition is , then in addition to the hash-table one needs to maintain the total weight of all hash-table entries. Then the weight for equi-joins can simply be subtracted from the total weight, to obtain the -join weight. For theta joins with a binary operator for the link condition, one can first obtain the equi-join hash table and then replace the values with cumulatives. This means that the entry for each join attribute value holds the sum of weights of any equi-join entry that satisfies . Additionally a binary search tree needs to be constructed to efficiently find the last value where is the queried value from a joining table. Then the queried value can be rewritten as the last hash table value. It is also possible to obtain some limited support for Group by’s, as they can be performed after the sample is collected. In order to obtain a better coverage of groups, the weights should ideally be chosen inversely proportional to the size of the groups.

Multistage Multinomial Sampling

In the previous section weighted join sampling has been mapped to the problem of sampling result trees with probabilities proportional to their weight (in simple cases they are paths). In a usual implementation, each join node partition is implemented as a hash map, such that by going each row in the main table one can look up matching entries in the hash maps and compute the product of the looked up values and to obtain the total weight of all result trees that contain . After executing Algorithm  which brings the graph into the state of Figure Document, the total weight of each can therefore be computed using at most one hash-table look-up per table. As the task is to sample with probabilities proportional to , one can in a first stage perform a stream pass over the main table and collect a sample using the proposed online multinomial sampler from Section LABEL:sec:general. After collecting the main table sample, this yields the sampled groups of result trees grouped by the main table row and it is left sampling within the groups. In the graph formulation (see Figure Document), each row in the main table sample is on the “left” side, linked to a join node in the “middle” that links to multiple row nodes on the ”right side. The task is then to sample for each sampled main table row, a row node on the “right” side with probabilities proportional to the rest of the result tree weight. Luckily, the total weights of the sub-trees have been previously computed and are readily available in the hash maps for the join nodes. As the total weight of the right side is known, one can use here inversion sampling (see Figure LABEL:fig:inverse), i.e., draw a random number between and and go through the right side table until the total weight of observed rows is more than and then pick the preceding row. This allows to collect all sample continuations of the main table sample in one stream pass. Thus, in each stage the rows in the sample are extended by the row of another table until all tables that participate in the join have been reached. The main table due to the online multinomial sampler from Section LABEL:sec:general is only scanned once, while all other tables are scanned twice. In the general case, the second scans of the other tables cannot be avoided. Hence, this approach is optimal with regards to the number of sequential scans.

Handling Cyclic Joins

While the main focus of this paper are the more common acyclic joins, this section discusses the more exotic cyclic joins that sometimes receive some academic interest. As in the literature, cyclic joins are treated as a selection predicate over the join rows of an acyclic join. For a join graph where each node is a table and each edge is a join condition, a join query is called cyclic if the join graph features at least one cyclic path that passes through at least two edges and three nodes. Rewriting as Selection over Acyclic Join Query. Any cyclic join query such as can be rewritten into a selection over an acyclic join, e.g., . If a join query has multiple cycles, each cycle adds one equality predicate to the selection operator. Note that here a natural join formalisation is used and join conditions are removed by renaming attributes. If the selection has high selectivity, i.e., selects most of the records, then one can first sample from and then drop any samples that do not satisfy the selection predicate . The acceptance rate is then on expectation equal to the selectivity, i.e.,

. As the acceptance rate can vary wildly depending on which join conditions are outsourced as selections, this leads to an interesting problem of an optimal rewriting, i.e., one that maximises the selectivity. A simple heuristic to select which edge to remove from the cycle is to pick the join link

that maximises the probability that two independently drawn random records from both tables are linked. This is in principle very similar to the Chow-Liu algorithm [DBLP:journals/tit/ChowL68] that aims to break links where the mutual information is lowest. The rewriting of a cyclic join query into a selection over an acyclic join is described in the following. One goes through each node (table) and follows each of its adjacent edges (join conditions) to a neighbouring node. For each adjacent edge one looks for the shortest path between its two end nodes that is not allowed to pass the adjacent edge. If no such path can be found then the node does not participate in a cycle. If such a path can be found, the node participates in at least one cycle and one can pick an edge in that cycle that one wants to outsource as a selection predicate. After the removal of an edge, one can check for cycles again until the node no longer participates in a cycle. For a join graph with nodes and edges such a procedure performs at most shortest path searches where is the number of cycles. As join graphs are very small and the number of cycles is at most

, the searches only takes a couple of milliseconds to perform. A more challenging problem is how to pick which edge to remove. While one could estimate the selectivity for each candidate, that can be quite expensive and take more time than the subsequent sampling step.

Economical Join Sampler

The previous section presented the fundamentals for stream sampler. It will now be considered how one can obtain an economic sampler with a reduced memory footprint for various types of joins.

Exploiting Foreign Key Joins

Many-to-one relations allow to first sample from the “many” table and then look-up the “one” entity in the other table(s). If the weights are all equal, then this reduces to sampling from one table and then joining the sample with the other tables. If the weights are not equal, one can either first proceed as if they were equal and afterwards employ rejection sampling to rectify inclusion probabilities, or one needs to find the group weights and treat the foreign-key join like a many-to-many join. The former is more memory efficient, whereas the latter is reliably fast. For the economical sampler one picks the more memory efficient variant.

Simplifying Cyclic Joins

While some join queries can look like a challenging cyclic join on first glance, it is sometimes possible to reduce it to a much simpler query, e.g., an acyclic one. If a subgraph of the join graph is a foreign-key join, one can efficiently join the subgraph together and the join result can only have as many rows as the largest of the joined tables. To automate and generalise this process without knowledge of foreign key constraints, one can simply join all tables together where the join result is at most slightly larger than the tables itself. For table sizes up to , this simplification step requires expected time and space using hash joins or worst-case time and space using sort-merge joins.

Novel Hashed Join Method: Handling High-Cardinality Join Attributes without Approximations

In case there are many distinct attribute values over join attributes within very large base tables, additional techniques are needed to reduce the memory footprint. A good way to think about join rows, is as a subset of cross product rows that satisfy join conditions. By relaxing the equi-join condition to a equi-hash-join conditions, one obtains a superset: [Equi-Hash Join] The equi-hash join along some join column between two tables and is the subset of the crossproduct of rows where the hash value in the join column matches for a shared hash function.

[scale=0.85] (0,0) circle (1cm); [color=black] (0.5,0.3) circle (0.1cm); [color=black] (-0.3,-0.5) circle (0.1cm); [color=black] (-0.3,0.7) circle (0.1cm); [color=red] (-0.9,0.7) circle (0.1cm); [color=red] (0.9,0.9) circle (0.1cm); [color=gray] (-0.9,1.9) circle (0.1cm); [color=gray] (0.9,1.8) circle (0.1cm); (0,0) node[anchor=center] equi join; (0,0.5) circle (1.5cm); (0,1.4) node[anchor=center] equi-hash join; (0,1) circle (2cm); (0,2.5) node[anchor=center] cross join;

Figure : Hierarchy of join results. Samples from any equi-hash join with superflous elements purged (depicted in red) are ordinary samples from the equi join

Note that any result in the equi-join also must be in the equi-hash join, because if two attribute values are equal, so are their hashes (see also Figure Document). Thus, equi-hash joins reduce the number of join attribute values in a controlled way (controlled by the chosen hash function) by allowing to operate over the smaller number of hash values. While the reduced domain leads to collisions, all those collisions only lead to some extraneous results that can be purged later on without any negative impact on precision. More specifically, the collisions cause additional rows from the cross product that violate the join conditions, but are easy to identify during the post-processing of the sample. As the number of purged rows is not fully predictable, it can be necessary to generate more samples than expected. To avoid this when possible, one can choose a sufficiently large sample. A common source for a large number of distinct values are join columns that are database keys. In that particular case, one can predict: Let be tables with unique values along the join columns. The hash-relaxed equi-join using a universal hash function with universe of size is expected to have at most superfluous results that are not present in where . Thus, as a heuristic, it is proposed to collect a times larger sample for the hash-relaxed superset ( and as defined in Lemma ) if that does not exceed the memory limit, because the join size is expected to be at least as large as the tables. As the number counts the tables where is much smaller than , it is dependent on the choice of . Thus, different choices of can be tried out find the best choice within the formula before the sampling commences. When the memory limit does not allow the needed sample size, or the join has many distinct values without joining keys, then the hashing algorithm can be run multiple times with different random seeds to collect the targeted amount of samples.