Stream Join Sampler
Introduction
The basic idea of join sampling approaches is to extend a tuple tablebytable. For a uniform sample, the extension must be drawn with probability proportional to the size of the (sub)join of a tuple with the tables downstream [DBLP:conf/sigmod/ChaudhuriMN99, DBLP:conf/sigmod/ZhaoC0HY18]. This insight carries over to weighted join sampling, where the probability must be proportional to the total weight of the subjoin. In [DBLP:conf/sigmod/ZhaoC0HY18] a tupleoriented approach is taken to satisfy this constraint, where tuples are individually extended tablebytable, which is reliant on random access and index structures. Note that despite the use of indices [DBLP:conf/sigmod/ZhaoC0HY18] still requires time linear in the size of the tables to collect the full sample. Hence, this work instead proposes a tableoriented approach, i.e., tables are scanned sequentially onebyone. Intuitively, this is achieved by processing all extensions by one single table in one go akin to breadfirstsearch, rather than extending each tuple by all tables akin to depthfirstsearch. The foundation of such an approach is a proposed multipartite graph formulation that adds support to weighted sampling, nonequi joins, outer joins, selections, semi and antijoins (none of which are supported in prior work [shanghooshabad2021pgmjoins, DBLP:conf/sigmod/ChaudhuriMN99, olken1993random, DBLP:conf/sigmod/ZhaoC0HY18]). Notation. Throughout this section, table names are used that match the columns, e.g., if a table has columns and the table will be called . If two tables have the same column name and are joined together, the join conditions will be to enforce equality across all columns with the same name as in a natural join. For attribute values of table rows for a column instead the lower case is used to denote various independent values of a column .
Multipartite Graph Formulation
Graph formulation. In order to more easily reason about not just inner joins, but also outer joins, it is useful to formulate the join operation through a graph, where rows and join attribute values form a multipartite graph with one partition of nodes per table and join column. An example of such a multipartite graph for a twoway join can be found in Figure Document. The nodes of the graph are:

Row nodes: Each table is a set of nodes, one node for each row, e.g., there is a node for row .

Null nodes: “NULL” values for a table are presumed to be an additional row of the table, e.g., .

Row node labels: Each row node is labelled with the row weight, e.g., , even null nodes, e.g., .

Join nodes: Each join column is a set of nodes, one node for each attribute value, e.g., .

Join node labels: Each join node is initially labelled with the value , e.g., the node for has a label .
As join operators are not necessarily symmetrical, there is a “left” and a “right” side. The edges of the graph potentially connect join nodes with row nodes with matching join attribute value, but it depends on the join operator if they do:

Each join node for , , and is connected to each matching row node.

Each join node for and that is not connected to any row node on the left side is connected to the left null node.

Each join node for and that is not connected to any row node on the left side is connected to the right null node.

Each join node for has label if there is no match on the right side.

Each join node for has label if there is a match on the right side.
This simply means left outer joins () allow null values on the left side and right outer joins () on the right side, while full outer joins allow them on either side (). Semijoins () and antijoins () are filters that change the weight of a join node depending on if there is a match on the right side. The main motivation for this formulation, is that in such a graph, the trees spanning all partitions corresponds to join rows:

Result trees: Each tree that contains exactly one node from each reachable partition corresponds to a join row (right side partitions of semijoins or antijoins are unreachable).

Trivial rule: A result tree cannot contain join nodes that are only connected to null nodes.
The weighted sampling problem can then be posed via the multipartite graph formulation: [Weighted Sampling over Joins] Select a random result tree (corresponding to a join row) comprised of nodes with labels with probabilities proportional to the tree’s total weight . As there are as many result trees as join rows, some method is needed to reduce the complexity noticeably below the number of join rows. As each tree is sampled with probabilities proportional to their weight, any group of trees will be sampled with probabilities proportional to their sum of weights. Thus, one can sample the result tree step by step rather than in one go, which will be shown in the following to allow for sublinear complexity. Efficient solution to graph problem. The basic idea is to group all result trees that have the same root and then sample such a rootgroup. The challenge left to solve is how to compute the sum of weights of each rootgroup. The right side of Figure Document shows the most crucial primitive for this purpose. The operation sums the weights of all subtrees of each join node and adds it as a new label of the join node. Such an operation can be implemented through a lineartime sequential scan of and generating a hash table of values contained in where each entry holds the sum of observed weights of each value. In SQLterms, one can think of such an operation as something along the lines of Select SUM(W) from BC group by B where is presumed to be some column holding the weights of . As a second step, one can scan , which holds the root nodes. Now, for each row of one can look up the sum of subtree weights in the hash table entry of the key . The total weight of all groups that contain is then . If is not a leaf node, one simply recursively continues this procedure to compute the weights of until the leaf nodes are reached. Once a group with as a root node is sampled with probability proportional to its total weight, one can continue in a similar fashion with the children nodes (adjacent nodes in the graph facing away from the main table), until the full tree is obtained. Semijoins, antijoins and selections can be supported through weights, which is detailed later. Let be a multipartite graph corresponding to the join query and be a node ’s label. Let be the row node partition of the join query’s main table. Let be all neighbouring join node partitions in closer to than . Let be all neighbouring join node partitions in further away from than . Let be all row node partitions in furthest away from , excluding . Initiate as . Pick a row node partition Let . Set the label of all nodes in to row node Initiate as join node part. Let be the label of connected to Multiply by . Let be the label of connected to Set to Add to . Remove row node partition from join node partition Remove join node partition from Replace all labels with for nodes in Algorithm. It is described in Algorithm how the group weights can be obtained for general multiway joins. Figure Document applies the algorithm for the running example. serves as the main table (see Line ), such that the parent moves closer to (Line ) and children node move further away from (Line ), while leafs such as are on the outskirts furthest away from (Line ). The algorithm processes a new table in each iteration (Lines ). It makes a single stream pass over the rows of the new table (Lines , computes the total weight of the subtrees rooted at the row (Lines ) and add this subtree weight multiplied by its own weight to the parent join node (Lines ). After the new table has been processed it is removed from further consideration (Lines ) and the algorithm goes back to Line and terminates when no more tables are left for consideration. After termination, each row node of the main table is linked to multiple join nodes and the product of the join node labels multiplied by yields the total weight of all results trees / join rows containing . The algorithm can be implemented with a hash table for each join node partition that allows an efficient lookup of the linked join node for a row node based on the join attribute value. Note that if all join nodes except a few have the same label, one only keeps entries for the exceptions and maintains a default value for the rest. The hash table entries for a table can be computed in one scan that skips any rows that do not satisfy the selection predicates. In case of a semijoin, the default value is and only entries are for results of the semi join. While an antijoin can be as large as a table, for sampling it can be supported via semijoin: For antijoins the default value is and only entries are for results of the semi join. Theta/nonequijoins can also be easily supported. If the link condition is , then in addition to the hashtable one needs to maintain the total weight of all hashtable entries. Then the weight for equijoins can simply be subtracted from the total weight, to obtain the join weight. For theta joins with a binary operator for the link condition, one can first obtain the equijoin hash table and then replace the values with cumulatives. This means that the entry for each join attribute value holds the sum of weights of any equijoin entry that satisfies . Additionally a binary search tree needs to be constructed to efficiently find the last value where is the queried value from a joining table. Then the queried value can be rewritten as the last hash table value. It is also possible to obtain some limited support for Group by’s, as they can be performed after the sample is collected. In order to obtain a better coverage of groups, the weights should ideally be chosen inversely proportional to the size of the groups.
Multistage Multinomial Sampling
In the previous section weighted join sampling has been mapped to the problem of sampling result trees with probabilities proportional to their weight (in simple cases they are paths). In a usual implementation, each join node partition is implemented as a hash map, such that by going each row in the main table one can look up matching entries in the hash maps and compute the product of the looked up values and to obtain the total weight of all result trees that contain . After executing Algorithm which brings the graph into the state of Figure Document, the total weight of each can therefore be computed using at most one hashtable lookup per table. As the task is to sample with probabilities proportional to , one can in a first stage perform a stream pass over the main table and collect a sample using the proposed online multinomial sampler from Section LABEL:sec:general. After collecting the main table sample, this yields the sampled groups of result trees grouped by the main table row and it is left sampling within the groups. In the graph formulation (see Figure Document), each row in the main table sample is on the “left” side, linked to a join node in the “middle” that links to multiple row nodes on the ”right side. The task is then to sample for each sampled main table row, a row node on the “right” side with probabilities proportional to the rest of the result tree weight. Luckily, the total weights of the subtrees have been previously computed and are readily available in the hash maps for the join nodes. As the total weight of the right side is known, one can use here inversion sampling (see Figure LABEL:fig:inverse), i.e., draw a random number between and and go through the right side table until the total weight of observed rows is more than and then pick the preceding row. This allows to collect all sample continuations of the main table sample in one stream pass. Thus, in each stage the rows in the sample are extended by the row of another table until all tables that participate in the join have been reached. The main table due to the online multinomial sampler from Section LABEL:sec:general is only scanned once, while all other tables are scanned twice. In the general case, the second scans of the other tables cannot be avoided. Hence, this approach is optimal with regards to the number of sequential scans.
Handling Cyclic Joins
While the main focus of this paper are the more common acyclic joins, this section discusses the more exotic cyclic joins that sometimes receive some academic interest. As in the literature, cyclic joins are treated as a selection predicate over the join rows of an acyclic join. For a join graph where each node is a table and each edge is a join condition, a join query is called cyclic if the join graph features at least one cyclic path that passes through at least two edges and three nodes. Rewriting as Selection over Acyclic Join Query. Any cyclic join query such as can be rewritten into a selection over an acyclic join, e.g., . If a join query has multiple cycles, each cycle adds one equality predicate to the selection operator. Note that here a natural join formalisation is used and join conditions are removed by renaming attributes. If the selection has high selectivity, i.e., selects most of the records, then one can first sample from and then drop any samples that do not satisfy the selection predicate . The acceptance rate is then on expectation equal to the selectivity, i.e.,
. As the acceptance rate can vary wildly depending on which join conditions are outsourced as selections, this leads to an interesting problem of an optimal rewriting, i.e., one that maximises the selectivity. A simple heuristic to select which edge to remove from the cycle is to pick the join link
that maximises the probability that two independently drawn random records from both tables are linked. This is in principle very similar to the ChowLiu algorithm [DBLP:journals/tit/ChowL68] that aims to break links where the mutual information is lowest. The rewriting of a cyclic join query into a selection over an acyclic join is described in the following. One goes through each node (table) and follows each of its adjacent edges (join conditions) to a neighbouring node. For each adjacent edge one looks for the shortest path between its two end nodes that is not allowed to pass the adjacent edge. If no such path can be found then the node does not participate in a cycle. If such a path can be found, the node participates in at least one cycle and one can pick an edge in that cycle that one wants to outsource as a selection predicate. After the removal of an edge, one can check for cycles again until the node no longer participates in a cycle. For a join graph with nodes and edges such a procedure performs at most shortest path searches where is the number of cycles. As join graphs are very small and the number of cycles is at most, the searches only takes a couple of milliseconds to perform. A more challenging problem is how to pick which edge to remove. While one could estimate the selectivity for each candidate, that can be quite expensive and take more time than the subsequent sampling step.
Economical Join Sampler
The previous section presented the fundamentals for stream sampler. It will now be considered how one can obtain an economic sampler with a reduced memory footprint for various types of joins.
Exploiting Foreign Key Joins
Manytoone relations allow to first sample from the “many” table and then lookup the “one” entity in the other table(s). If the weights are all equal, then this reduces to sampling from one table and then joining the sample with the other tables. If the weights are not equal, one can either first proceed as if they were equal and afterwards employ rejection sampling to rectify inclusion probabilities, or one needs to find the group weights and treat the foreignkey join like a manytomany join. The former is more memory efficient, whereas the latter is reliably fast. For the economical sampler one picks the more memory efficient variant.
Simplifying Cyclic Joins
While some join queries can look like a challenging cyclic join on first glance, it is sometimes possible to reduce it to a much simpler query, e.g., an acyclic one. If a subgraph of the join graph is a foreignkey join, one can efficiently join the subgraph together and the join result can only have as many rows as the largest of the joined tables. To automate and generalise this process without knowledge of foreign key constraints, one can simply join all tables together where the join result is at most slightly larger than the tables itself. For table sizes up to , this simplification step requires expected time and space using hash joins or worstcase time and space using sortmerge joins.
Novel Hashed Join Method: Handling HighCardinality Join Attributes without Approximations
In case there are many distinct attribute values over join attributes within very large base tables, additional techniques are needed to reduce the memory footprint. A good way to think about join rows, is as a subset of cross product rows that satisfy join conditions. By relaxing the equijoin condition to a equihashjoin conditions, one obtains a superset: [EquiHash Join] The equihash join along some join column between two tables and is the subset of the crossproduct of rows where the hash value in the join column matches for a shared hash function.
Note that any result in the equijoin also must be in the equihash join, because if two attribute values are equal, so are their hashes (see also Figure Document). Thus, equihash joins reduce the number of join attribute values in a controlled way (controlled by the chosen hash function) by allowing to operate over the smaller number of hash values. While the reduced domain leads to collisions, all those collisions only lead to some extraneous results that can be purged later on without any negative impact on precision. More specifically, the collisions cause additional rows from the cross product that violate the join conditions, but are easy to identify during the postprocessing of the sample. As the number of purged rows is not fully predictable, it can be necessary to generate more samples than expected. To avoid this when possible, one can choose a sufficiently large sample. A common source for a large number of distinct values are join columns that are database keys. In that particular case, one can predict: Let be tables with unique values along the join columns. The hashrelaxed equijoin using a universal hash function with universe of size is expected to have at most superfluous results that are not present in where . Thus, as a heuristic, it is proposed to collect a times larger sample for the hashrelaxed superset ( and as defined in Lemma ) if that does not exceed the memory limit, because the join size is expected to be at least as large as the tables. As the number counts the tables where is much smaller than , it is dependent on the choice of . Thus, different choices of can be tried out find the best choice within the formula before the sampling commences. When the memory limit does not allow the needed sample size, or the join has many distinct values without joining keys, then the hashing algorithm can be run multiple times with different random seeds to collect the targeted amount of samples.