1. Introduction
When running queries over data distributed over multiple nodes, nodes fail or are slow with high probability; in such scenarios, systems may return errors or provide degraded performance to the user even though most of the data has been processed or is available for processing. In this paper, we propose ideas to continue providing insightful results to users in the event of failures/stragglers, using available data.
Data warehouses store unprecedented amounts of data and most modern parallel data processing systems extensively use partitioning for high throughput (snappy; F1; greenplum; shark). However, the sheer magnitude of the data coupled with the large number of users introduces varied challenges: with data being distributed across a large number of nodes, the probability that some nodes are slow (stragglers) or unavailable increases drastically. The query either slows down or systems may just fail the query in such scenarios, resource contention for nodes increases with number of queries and users which further adds to the delay.
A prime requirement of data analysts who use such systems is obtaining interactivetime responses for their queries. Studies (HCI1; HCI2) have shown that a subsecond response time is crucial for their productivity and that failures after queries run for a long time are very frustrating for users. There has been significant research on query progress monitoring (progress1; progress2; progress3). However, users still have little control over a failed query execution, despite being able to monitor its progress.
There has been a lot of work on exploring redundancy and fault tolerance techniques to mitigate the effects of failures/stragglers. Deploying and maintaining replicas can be very expensive and the overhead of switching over to replicas can be very large (redundancy). In this work we explore an orthogonal alternate approach: instead of redundancy to reduce the probability of failure, we allow errors and use probabilistic techniques to provide answers despite failures.
We exploit a key aspect of efficiency in distributed databases, partitioning, and in particular, cohash partitioning, and propose a novel way to exploit the statistical properties of cohash partitioned data to provide quick approximate answers with confidence bounds in failure/straggler scenarios. We consider aggregate queries that involve joins, Group By and Having clauses along with a subclass of nested subqueries.
We focus on use cases where approximate answers to queries are tolerable, such as gaining insights from data, discovering trends etc. We note here that there have been no discussion of providing approximate answers for queries in the presence of straggler nodes and node failures. Systems such as Dremel (dremel)
provide an answer based on available data, but without any estimations or attached error bounds.
Cohash partitioning, a well known partitioning strategy, has recently become a popular choice in several systems, and been shown to improve the efficiency of distributed query processing for both OLAP and OLTP workloads (oracleRef; snappy; F1; jecb; locationAware). It is an extension of hashpartitioning where tuples from multiple tables that join are colocated by partitioning on join predicates.
Example 0 ().
Consider the Customer, Orders, and Lineitem tables from the TPCH schema. Under a cohash partitioning scheme, the Customer table can be hashpartitioned on the c_custkey attribute, with all orders and all lineitems of a customer being colocated along with the customer tuple.
Our key insight is that, with cohash partitioning, available data from a hierarchy corresponds to obtaining a uniform random cluster sample (cluster) of the hierarchy. Using this key insight, we propose that a given query be executed to “completion” using data that is available or has already been processed and provide an estimate of the query aggregate result along with confidence bounds.
Users. They obtain approximate answers with confidence bounds instead of errors due to failures. They are less frustrated, and if the accuracy of the answer is satisfactory, they will not restart the query, thus saving resources and time.
DB Vendors. Since databases are cohash partitioned to aid query performance, our solution requires minimal changes to existing database engines or data layout strategies.
Example 0 ().
Consider an aggregate query that joins three tables from the TPCH benchmark — Customer, Orders and Lineitem. Continuing Example 1.1, if all three tables are cohash partitioned into multiple shards, data in a single shard will itself comprise a cluster sample of the join result that can be used to provide approximate answers.
Contributions.
Key idea. We propose that a widelyemployed, efficient partitioning technique, cohash partitioning, be exploited to provide approximate answers along with confidence bounds quickly in failure/straggler scenarios.
Feasibility and statistical properties. For a given query and cohash partitioning strategy, we enumerate necessary and sufficient conditions to provide approximate answers in failure/straggler scenarios (approximable query). If a query is approximable, we determine the resulting sampling design for the available data.
Error bounds.
We discuss providing error bounds for approximate answers. Computing error bounds for queries that involve multiple cohash hierarchies can be challenging due to correlation between the resulting tuples. We propose a simple heuristic for easing the variance computation; we validate the heuristic experimentally.
Approximating complex queries. We formally describe the class of nested subqueries and queries with Having clauses that can be approximated using our proposed idea. As an example, consider the Orders and Lineitem tables being cohash partitioned. A query such as “count the number of orders that have at least 10 lineitems” can be easily approximated, since all lineitems of a order will be colocated.
Experimental Validation.
We conduct extensive experimentation to validate our proposed approach on uniform and skewed TPCH datasets using previously validated cohash partitioning designs
(locationAware).1.1. Other Applications
We now present other applications of the idea proposed:
Resource Management. Consider a system that has data partitioned across nodes in a network and a large number of users running analytical queries over the partitioned data. Resource contention will be a bottleneck for performance when all users need to access all shards. Since with cohash partitioning, for many queries, any subset of data can be used provide approximate answers, users can be redirected to disjoint set of machines to significantly increase throughput.
Example 0 ().
Consider users running instances of query 3 from the TPCH benchmark that involves a join between Customer, Orders and Lineitem. Suppose the dataset is cohash partitioned so that tuples that join are colocated. If the data is sharded across nodes, queries from each user can be run on a single unique shard if , or the load can be balanced by distributing queries to each shard. This reduces resource contention and increases throughput, with the tradeoff of obtaining approximate answers.
EQP + AQP. As shown in (oracleRef; snappy; jecb) and (locationAware), cohash partitioning results in huge improvements in performance for EQP (Exact Query Processing). This benefit can carry over to AQP as well; in a nutshell, when data is cohash partitioned, we can provide approximate answers incrementally by processing queries using efficient exact query processing techniques, without having to repartition data randomly. Also, since cohash partitioning colocates joining tuples, it can broadens the scope of AQP to joins and more complex queries that involve nested subqueries. We argue that since we propose to exploit existing features in EQP systems to provide approximate answers, it may hasten the adoption of AQP in existing systems.
Example 0 ().
Consider an analyst who intends to run an aggregate query; she only needs to know a rough estimate of the aggregate but is unsure about the percentage of error that she can tolerate. She runs the query using online aggregation. If, after a few minutes, she is not comfortable with the error bounds, it is hard for her to decide whether to wait to obtain better bounds, or whether to stop the query and rerun it using EQP. This is because online aggregation can be inefficient when processing a query all the way to completion.
However, when the dataset is cohash partitioned, any subset of the database can be used to provide approximate answers, and hence, running estimates of the answer with error bounds can be provided to the user while running the query using EQP. She may stop the query after the desired confidence bounds are reached or may continue the query till the end, with no significant overhead.
2. Related Work
There is a wealth of relevant work on approximate query processing; we highlight the differences with previous work that is most closely related to ours. To the best of our knowledge, this is the first fullpaper that proposes approximate answers with confidence bounds for failure/straggler scenarios.
Extended abstract X,
a preliminary publication of this submission (the workshop expects the work to be submitted to a conference later), proposes exploiting partitioning to obtain approximate answers in failure scenarios and otherwise. This current submission extends those ideas in nontrivial ways: (1) X considers only nonredundant designs, whereas, our work considers redundant cohash partitioning designs as well, (2) We conduct extensive experiments for varied types of queries over multiple recommended cohash partitioning designs, (3) We discuss in detail, the heuristic for variance computation and validate it empirically, (4) We include a deeper discussion on obtaining estimates and confidence intervals for query affected by stragglers/failures, (5) We propose other applications for the proposed solution, and (6) We extend the ideas to more complex queries with Group By, Having and nested subqueries.
Join synopses (synopses) results in uniform random samples of the join output. It corresponds to the statistical properties of a subset of a database for a specific cohash partitioning strategy. However, the samples in (synopses) are precomputed, requiring extensive work on top of an existing system and is not designed for handling failures or stragglers.
Wander Join (wanderJoin) provides efficient ways to approximate aggregates in the presence of joins with the help of indexes; however, they do not consider node failures in parallel systems, and the approach can be inefficient if the user chooses to run the query to near completion as Wander Join differs substantially from efficient exact join algorithms.
BlinkDB (blink) handles only single table aggregates.
Quickr (quickr) : The idea is to obtain samples from different tables from the same hash space. Depending on the cohash partitioning strategy, the effect can be similar. However, there are fundamental differences – they aim to push samplers into the query plan to obtain approximate answers and do not discuss handling failures/stragglers.
Systems iOLAP (iolap) and GOLA (gola) provide incrementally improving approximate results. They require an initial random partitioning of data into subsets and do not exploit existing physical data partitioning that many existing systems rely on for efficient performance. The idea of randomly partitioning a dataset to aid online aggregation has been suggested and studied in numerous contexts (onlineAggregation; caseOnlineAgg; parallelHash; continuousSampling). However, in the previous work, the focus is on randomly partitioning each table independently of the other; cohash partitioning is not considered. Representing all possible answers (mtables) are not very insightful for aggregate queries and providing partial results (partial) to users may not be insightful since they neither obtain estimates of the aggregates nor the errors possible in the given result.
3. Background
We now provide a brief background on cohash partitioning and various sampling designs.
3.1. CoHash Partitioning
We first set up some necessary notation. We assume a relational database with relational schema . Here, is a relation name, and
is the attribute vector of the relation
. We denote by the set of attributes of relation , and by the attributes that form the primary key for . Furthermore, we denote by the instance of relation in the database .Join Graph. Given two tables , we define a join condition between to be an expression of the form , where and .
Definition 0 (Join Graph).
A join graph is an edgelabeled undirected graph , where the vertex set is a subset of , and the label of an edge is a join condition between and .
Let be a SelectProjectJoin (SPJ) SQL query. Then, we can define the join graph of as follows: its vertex set is the relations involved in , and there is an edge between two relations whenever these join in .
The schema graph of a schema is a join graph, where the vertex set consists of all the relations in , and each edge represents a possible join between the two tables. There are numerous possible schema graphs for a given schema. Figure 0(a) shows the schema graph of the TPCH benchmark (tpch) constructed using the primary keyforeign key joins between the tables. The vertices represent the 8 tables in the schema, and the edges represent the primary keyforeign key joins, as indicated by the edge labels.
Partitioning Strategies. We now discuss partitioning a database into chunks. A simple strategy is to partition each table independently of the other tables. For instance, we can hashpartition each table using a hash function and a subset of attributes : the hash function takes a value for each attribute and returns a value in . Then, the tuple is assigned to chunk , where denotes the value of tuple at attribute .
Hash partitioning can ignore relationships between data across tables. Suppose we have two tables , such that is a foreign key to the primary key . If and are independently hashpartitioned on and , respectively, then the join requires data shuffling. However, if we exploit the join predicate between and and hashpartition both tables on attribute , then the join can be computed locally without any data shuffling. This idea has been previously proposed in multiple systems (oracleRef; jecb; locationAware). Here, we describe formally a more general partitioning framework, called cohash partitioning.
Cohashing. Suppose table is already distributed into (not necessarily disjoint) chunks . Let be a table that shares an edge with in the schema graph. Adopting the terminology from (locationAware), we say that is cohashed with respect to if the tuples from are distributed as follows:

If joins with on the join condition , then belongs to all the chunks where belongs; in other words, is colocated with .

If a tuple does not join with any tuple from on the join condition , then is hashpartitioned using any subset of .
Cohashing Hierarchies. We can extend cohashing from two tables to a hierarchy. Let be a rooted tree that is a subgraph of the schema graph. We extend cohashing on by following the recursive structure of the tree. Initially, we hashpartition the root table using a subset of its attributes. We then recursively distribute each table in the tree by applying cohashing w.r.t. its parent table. Formally, we have:
Definition 0 (Cohash Scheme).
Let be a schema graph of . A cohash hierarchy is a tuple such that:

is a rooted directed intree ^{1}^{1}1An rooted intree is a directed tree, where every edge is directed towards a designated root. that is a subgraph of the schema graph; and

is a subset of the attributes of the root , called the clustering attributes of .
A cohash scheme is defined as a collection of cohashing hierarchies of the schema graph.
Table 0(a) lists some cohashing hierarchies for the schema graph in Figure 0(a). If is cohashed w.r.t. , then joining and on the join condition (or any superset of the condition) can be performed without any data shuffling.


Multiple cohash partitioning strategies have been studied in (locationAware) in terms of performance. They recommend the Schema Driven Without Redundancy (SDWithout) design as the best design without redundancy for the TPCH benchmark. This design in presented in Figure 0(b) and explained in detail in the example below:
Example 0 ().
In the SDWithout design depicted in Figure 0(b), Nation, Region and Supplier are very small and hence replicated to all partitions (they are designated with a dotted box). The remaining five tables are spanned by two hierarchies, each one depicted with a different color. The thick arrows show the orientation of the edges in the hierarchy. The first hierarchy includes Customer, Order, Lineitem and has root the table Customer () and the second hierarchy contains Part which is the root and Partsupp (). Consider hierarchy with custkey as the hashing attribute for the root table Customer. Figure 1(a) shows the distribution of tuples in the tables into three different partitions.
In our study we consider both redundant and nonredundant cohash partitioning designs for providing approximate answers to queries affected by failures/stragglers.
3.2. Sampling Designs and Estimators
A sampling design refers to an algorithm used to obtain a sample from the input data. We introduce the various sampling designs that will be used in the paper in this subsection. The notations are summarized in Table 1.
Notation  Meaning 

number of tuples in table  
current sample  
inclusion probability of tuple  
inclusion probability of tuples and  
total number of clusters at level  
clustering attributes at different stages for hierarchy 
3.3. Sampling Designs
Bernoulli sampling. In Bernoulli sampling, each element is included in the sample independently with the same probability . The inclusion probabilities are , and . The expected size of the sample is .
Cluster Sampling (cluster; bookSarndal). In cluster sampling, the table is divided into a number of disjoint groups, called clusters^{2}^{2}2This is not be confused with the dataset being divided into partitions for storing – each partition can store multiple clusters.. Then, a sample of the clusters is selected, and all the tuples in the clusters are included in the sample. Cluster sampling is also referred to as singlestage sampling.
Example 0 ().
Consider the tuple distribution in Table 1(a) of hierarchy . If we obtain a random sample of customers using custkey, then we also get the relevant orders and lineitems of those customers since they are colocated. This constitutes a cluster sample of the data in hierarchy .
Multistage Sampling (multistage; bookSarndal). In multistage sampling, we use the subscripts to refer to the different stages of sampling. The tuples in the table are initially partitioned into clusters represented by set and in the first stage, a sample is drawn from . In the second stage, each cluster is further partitioned into secondary clusters represented by set . For each cluster , a sample is drawn from the elements. We repeat this procedure of sampling subclusters for stages. At the th stage, the sampling unit can be tuples or clusters. The inclusion probability of cluster at stage one is and so on. The clustering attributes of the different stages is given by , where the hierarchy that is being sampled.
Example 0 ().
Continuing Example 3.4, if after sampling customers at random, we also sample each customer’s orders, followed by each order’s lineitems, then we obtain a 3stage sample. In this case, the clustering attributes of the three stages are given by the primary keys of the three tables, respectively.
4. Preliminaries
Notation. Let be the database, and denote the chunks of the database, as partitioned by a cohash scheme . For every table (hierarchy ), let ( denote the chunk of the table (hierarchy) in the chunk . A chunk can be thought of as a page, a block (contiguous set of pages) or a node in a cluster, depending on the granularity of the read access. Also, for some chunk , or can be empty; this handles the case where different cohashed hierarchies are partitioned across disjoint sets of nodes. Suppose that we obtain a subset of the chunks in the partition. Then, for each table we obtain a subset of its chunks, and for each hierarchy we obtain also a subset of its chunks. We denote the subset of the database obtained this way by .
The specifics of how these subsets will be obtained in practice is not important; we discuss in Section 5 that when a database is cohash partitioned, the statistical properties of the partitions are agnostic to the way the data is accessed. A hierarchy could be accessed by reading partitions successively or simultaneously. It can also be read via indexes; if table is accessed through an index using values from attribute of table , then, and will be implicitly part of a hierarchy, (), though not explicitly cohashed.
Query class considered. We consider queries with no selfjoins (without repeated occurrences of the same relation) of the form :
Select 
From 
Where 
Group By 
Having 
where, are one of SUM(), COUNT(), and AVG() aggregate operators, are equality join conditions, are data sources in or derived from . are predicates that can also be nested Exist clauses and are simple predicates.
Problem Statement. Let’s say that while executing a query of form , some of the nodes required by are very slow or unavailable. The problem we consider is as follows:
Input  Query of form (no selfjoins), 

cohash scheme and  
hierarchies chosen to answer the query: .  
Problem 1  Can we provide an approximate answer to using available data? 
Problem 2  Is yes, what are the estimates and confidence intervals? 
We study these questions in detail in the next two sections. Before proceeding, we would like to emphasize that the aim of this work is not to study already established runtime benefits of cohash partitioning, but rather, to exploit it to provide approximate answers under failures/straggler scenarios.
5. Providing Insights
In this section, we first discuss obtaining insights for simple queries of the form SELECT <attributes> FROM <tables> WHERE <predicates>, without nested subqueries, Group By or Having clauses, and then extend the ideas to general queries. We structure the discussion as follows: (1) failure model and the implied statistical properties of a datasubset (2) necessary conditions that should be satisfied to provide insights for a query in the presence of stragglers and failures, and (4) extensions of the ideas to provide insights to complex queries of the form .
5.1. Statistical properties of data
We first discuss the implicit statistical properties of data that is cohash partitioned. This is will help us understand the properties of available data in the event of failures and/or stragglers. We then discuss the failure/straggler model, and the resulting statistical properties of available data.
Cluster sample Consider a hierarchy , where is the root of . From the cohash partitioning definition, a tuple will be hashed to a machine, and hence will be located in a randomly chosen machine. The set of tuples of located at a machine corresponds to a uniform random sample of .
For a tuple , let denote the set of all tuples in any child of in the cohash hierarchy , that join with on the join condition . This is defined as:
Now, we define the set of tuples from all relations in a cohash hierarchy that are colocated with a tuple , denoted by and defined recursively as follows:
represents the set of all tuples in the cohash hierarchy that recursively join with tuple . For example, consider a tuple , where is the root of the hierarchy . represents the set of all tuples in that join with on , the set of tuples in that join with the preceding set of tuples from and so on. In other words, all tuples of that recursively join with or some tuple that joins with will be colocated with . We call such a set of tuples as a cluster with the clustering attribute being the hashing attribute of the root of the hierarchy. That is, along with all its colocated tuples from all relations in the cohash hierarchy forms a cluster with clustering attributes .
Since is sent to a machine that is randomly chosen, by implication, a cluster of tuples corresponding to are sent to a machine that is randomly chosen. Hence, data in a machine corresponds to a random set of clusters of a cohash hierarchy.
Example 0 ().
Consider hierarchy . If we look at the data in any machine, it will contain a subset of the clusters of hierarchy . These clusters of a machine will be equal to a cluster sample of the hierarchy with the clustering attribute being .
Redundancy. Cohashing may result in data redundancy in two ways: (1) a tuple may join with tuples that belong in different chunks, in which case it will be replicated. This is referred to as tuplelevel redundancy, (2) a relation can be part of multiple cohashing hierarchies and this is referred to as relationlevel redundancy.
Redundancy has an affect on the statistical properties of the data obtained from a machine and hence, affects the insights we can provide to queries. When we obtain clusters corresponding to a hierarchy from a machine, it is equivalent to obtaining cluster samples of the result of the join of the relations on the join conditions in the hierarchy. However, the statistical properties of data pertaining to part of the hierarchy or subhierarchy depends on whether cohash partitioning induces tuplelevel redundancy for the data corresponding to the relations in the subhierarchy considered. This is illustrated in the example below:
Example 0 ().
Consider the hierarchy , for the join graph in Figure 0(a). A tuple can join multiple tuples from the table Lineitem, and hence, may be replicated across chunks. A subset of data of only the Orders relation will have redundant tuples, and the statistical properties of this data subset is not easy to determine.
Nonredundancy of a relation can be determined from a given cohash scheme using the following conditions.
Proposition 5.3 ().
A relation in a cohashing hierarchy is nonredundant if either:

is the root of ; or

is cohashed with respect to on join condition s.t. , and is nonredundant.
The proofs of all propositions in this section are simple and were discussed in the extended abstract X as well.
Example 0 ().
Consider the hierarchy , also given in Example 5.1. Orders is cohashed with respect to Customer on the join condition o_custkey = c_custkey, and c_custkey. Customer is also the root of . Hence, both Orders and Customer are nonredundant relations.
For a subhierarchy with more than one relation, data for the subhierarchy will be redundant if the relation corresponding root of the subhierarchy is redundant.
Failure Model. We assume that the hash functions used for hash partitioning the root tables belong to the universal family of hash functions (universalHashing) so that the probability of any cluster being mapped to a particular partition is , and these probabilities are independent across clusters. Hence, each cluster will be distributed independently and uniformly at random across the chunks.
For a hierarchy , let denote the number of clusters in the chunk . Then, the total number of clusters from in is . If the total number of clusters in hierarchy , , is known, then is a cluster sample of , where the inclusion probability of a cluster is . Otherwise, we can estimate the total number of clusters as .
When the data subset obtained is the result of partitions being unavailable (failure) or slow (a straggler), the resulting inclusion probabilities from the interaction of two random processes – slow/failed data nodes and cohash partitioning of tuples/clusters – may not be immediately apparent.
Notation  Meaning 

true value of the aggregate  
attribute value at tuple  
estimated value of the aggregate 
We will now show that the available data after failures can be used as a cluster sample with inclusion probabilities resulting from cohash partitioning. Indeed, let us consider the following failure model for arbitrary elements and of a table ( and could be tuples or clusters). Let be the probability of node failure. (This value need not be known; data nodes in a data center might require monitoring for an extended period of time to determine the expected probability of failure.) Let be the sample without any failures and let be the sub sample set after failures. In the case where equals the whole table, represents the sample for which we are deciding the sampling design. The notation for the following equations are given in Tables 1 and 2.
(1) 
Now, suppose we define a new estimator for the SUM aggregate (following (bookSarndal)) as follows:
Adopting the failure model given in Equation 1, we modify the previous estimator as follows:
Since the factor vanishes, we can easily estimate the aggregate as before – using inclusion probabilities resulting from cohashing.
Properties of Available Data. The core insight from the previous discussion is that the available subset of data for a cohashed hierarchy resulting from failures and stragglers corresponds to a uniformly random cluster sample (or onestage sample) of the hierarchy’s data, with the clustering attributes being the hashing attributes of the root of the hierarchy. However, since all relations of a hierarchy may not be present in the query, and some relations/subhierarchies can be tupleredundant, determining whether a query can be approximated is not trivial. This is illustrated in the following example:
Example 0 ().
Consider the hierarchy in Example 5.2 and consider a query that involves only the Orders relation. If all data corresponding to this hierarchy is not available, then, it may not be possible to approximate this query using available data.
5.2. Feasibility
We now discuss the conditions for determining whether a query can be approximated for a given . We are given query , let be the set of hierarchies used by the query engine to execute before the failure or straggler event occurred. For each relation in the query, there is exactly one hierarchy from which it’s data is being read. Also, if , then data of at least one relation present in is being read. The hierarchy from which the data of a relation is being read is denoted by . Let , with be the set of hierarchies for which all data is not available due to failures and/or stragglers. We now discuss determining whether we can provide approximate answers for using the available/processed data from .
We first define a partitioningquery graph (PQ graph). We construct the PQ graph, , as follows. Let be the join graph of the query . Then:

For every , we add to .

For every , if there exists a hierarchy with and ^{3}^{3}3Here, the implication means that the join condition logically implies ., we add to .
The PQ graph determines the relationship of the tables in with respect to the hiearchies they belong to. The connected components in the PQ graph tells us which tables are colocated and hence need to be considered together while determining their statistical properties. Note that the components of a PQgraph are strictly subhierarchies used to answer the query. The root of a component/subhierarchy of a PQgraph can be any table in a hierarchy.
Example 0 ().
Consider an instance of the TPCH database partitioned with , and consider query : . and will have two components (or subhierarchies): .
In order to approximate the result, we require that each connected component of the graph with unavailable data is nonredundant, so that the data subset corresponding to each component/subhierarchy forms a cluster sample of the component. For a connected component of the graph, there is a hierarchy that corresponds to it. Let be the hierarchy corresponding to component and let be the root of the subhierarchy .
Proposition 5.7 ().
For a given and , consider . For , if and is nonredundant, then the available data corresponding to the subhierarchy, , is a cluster sample.
Example 0 ().
Continuing Example 5.6, if and Orders () is nonredundant, then, for the available data of , we obtain cluster samples.
However, if and Orders () is nonredundant, but PartSupp () is not, then we do not obtain cluster samples and we cannot approximate the result of the query.
Definition 0 (Approximable).
A query , is said to be approximable for a given and , if, for every connected component in the graph , such that , the root relation of , is nonredundant.
Now, for each with , we associate a set of clustering attributes , sampling probability , and the sample size (of the available data) . , and correspond to the hashing attributes of , the available percentage (sampling rate) of , and the number of clusters for hierarchy in the data subset, respectively. Note that multiple components can correspond to the same hierarchy.
Example 0 ().
Continuing Example 5.8 with , consider . is the corresponding hierarchy of . Let’s say we obtain tuples out of tuples from the Customer table; , with and and .
5.3. Estimators and Variance
Given a query that is approximable for a given sets and , we now discuss the resulting sampling design, the corresponding estimators and variance estimators using formulas from (bookSarndal). We start the discussion with the sum aggregate; estimators for count and average aggregates are extensions of it. We first discuss the scenario where data is unavailable for only one hierarchy and then discuss the case when data is unavailable for multiple hierarchies. Notations for the following discussion are listed in Tables 1, 2 and 3.
Case . When machines that failed or are slow contain data corresponding to only one hierarchy, , i.e, , and data is available for all other hierarchies in , available data results in a cluster sample, with the clustering attribute being the attributes corresponding to .
Notation  Meaning 

true value of the aggregate over cluster  
estimated value of aggregate over cluster 
The sum aggregate is given by , where is the value of the aggregation for cluster for hierarchy . The simplest way to estimate from a cluster sample is by using the HorvitzThompson (HT) estimator (HT), given by . Define . The variance and variance estimators for the HT estimator are given by the formula below (adopted from (bookSarndal)).
The estimator for the count aggregate easily follows from the sum estimator; replace , the value of the attribute, with . Estimating the average aggregate involves the ratio of the sum and count estimates, and hence, introduces a slight bias.
Case . We now consider the case when machines that are unavailable or slow contain data corresponding to multiple hierarchies. From the previous discussion, we know that the available data for each hierarchy corresponds to a uniform random cluster sample. In the presence of cluster samples from multiple hierarchies, the sampling design of the result of the query involving joins across hierarchies corresponds to a multistage sample.
The HT estimator for multistage sampling is computed recursively. Let be the aggregate sum of cluster from the first stage. Suppose we can compute the HT estimator for w.r.t. the last stages. Then, the unbiased HT estimator is given by the formula .
We now discuss the SUM estimator in the context of multistage samples obtained through unavailability of data. For a query , let be the set of components in such that, for each , . The estimator for the aggregate over is given by
.
Variance for multistage samples. The variance estimate of a multistage sample is broken down into variance contributions from the different stages. If is the variance of , and the variance estimator, then we can also compute the variance estimate of rstage sampling as:
For 2stage sampling, the of the second stage () and the total variance are given below:
More details about variance and variance estimation formulas can be found in (bookSarndal). For variance computation, at each stage, the sampling design is assumed to be independent and invariant: invariance implies that the sampling design used a is the same across all stages, and independence implies that the sampled elements within a cluster are independent of the sampling done in any other cluster of the same stage.
In the current scenario, the independence property is violated – the join involves a cross product of the samples from different hierarchies and hence, each tuple in the join result is not independent of the other, and the resulting intermediate table, in general, will be a correlated sample. There can be correlations in different stages of the multistage sample and the independence property requirement may not hold.
So, the variance estimate can now be calculated as: , where is the correlated variance estimate at the stage. Observe that the first stage doesn’t have correlation. The formulas and algorithms given in (largeSampleBounds; crossJoin) to compute the variance and estimated variance apply here as well.
Cluster sizes and accuracy For single stage (cluster) samples, it is established (bilevel; bookSarndal) that variance in the sizes of the clusters, and intracluster homogeneity (intracluster variance is much lower than the intercluster variance) leads to higher variance in the estimates. Hence, assuming element (tuple) level randomness instead of cluster level randomness when it is not so, can lead to severe underestimation of variance.
6. Efficient variance computation
Computing variance of multistage samples can be cumbersome and compute intensive, especially in the presence of correlations. In a multistage sample, the variance contributions decrease with stage, with the first stage contributing majority of the variance. Hence, to make computations easier, the variance can be approximated by the first two terms (the variance contribution from the first two stages of sampling) (bookSarndal). This can, however, introduce a slight bias into the variance estimate due to underestimation.
For the resulting multistage samples obtained from available data, we can still apply this approximation, since there will be no correlations in the firststage samples obtained. However, note that, this approximation requires that the property of variance contributions of lower stages being larger than the variance contributions of higher stages holds. Hence, for computing the variance of a multistage sample that is the result of processing available data, mapping the clustering attributes to different stages of the multistage sample is very important. This can be nontrivial as illustrated in the following example:
Example 0 ().
Consider a TPCH data set that is cohash partitioned using three hierarchies , , and . Consider query 9 of the benchmark without the Group By and Order By clauses, whose PQ graph consists of three components, that correspond to the three hierarchies and . Let’s say data is unavailable from all three hierarchies. Available data from each hierarchy corresponds to a cluster sample with clustering attributes given by . Since there are three sets of clustering attributes, the result obtained after the join, cannot be viewed as a cluster sample of any one set of clustering attributes. It will be a 3stage sample, with each being the clustering attribute set of one of the stages. However, the mapping between the clustering attributes, s, and the different stages so that the majority of the variance contribution is captured in the first two stages is not obvious.
We want to find the clustering attributes that have large contributions to the variance in order to approximate the variance computation. Determining the mapping is crucial, since as we will see later that approximating the variance estimation using an incorrect order underestimates the variance. We present Algorithm 1 to determine the number of stages in the multistage sample for a given , and the mapping of clustering attributes to the various stages so that the condition that variance contributions of lower stages are larger than the higher stages is satisfied.
Intuition. The intuition for the algorithm is: (a) the bigger the clusters, bigger the variance. In fact, sampling from bigger clusters, intuitively make more sense than sampling from smaller clusters, and (b) lower the sampling rate, smaller the number of resulting clusters.
Working of the algorithm. It takes as input the list of connected components, their clustering attribute set, the number of clusters, and the inclusion probabilities for each participating component. The algorithm assumes that joining attributes in any two tables have identical names. If this is not the case, it is easy to fix by renaming attributes before performing the join.
Now, starting at the first stage, it determines the clustering attributes for each stage by greedily choosing the component with the minimum number of clusters (line 7). It goes over other connected components and merges them if they have the same set of clustering attributes, and suitably updates the probabilities (lines 8 to 10). Finally, it returns the clustering attributes and sampling probabilities for each stage. We now provide an example to illustrate the working of the algorithm:
Example 0 ().
Continuing the setup from Example 6.1, we will consider Query 9 and , and in the cohash scheme. Let the availability for the hierarchies be 1% for (rooted at Customer), 80% for (rooted at Part), and 90% for (Supplier). consists of 3 components , , and . Let , and , and , and clusters. After renaming so that joining attributes have identical names, we have the component set with and . corresponding to has the smallest number of clusters (line 7), followed by () and (). So, Algorithm 1 will determine the clustering attributes and inclusion probabilities of the multistage sequence to be (cps), in that order.
Variance Estimation Accuracy. We discuss the impact of sequence on the accuracy of estimated variances for multistage samples by comparing different sequences with the sequence given by the algorithm and empirically show that the variance estimation for the multistage sequence given by our proposed algorithm is close to the observed variance.
We use the query and cohash partitioning strategy from Example 6.2 for the following experiments. For different availability percentages, we plot the ratio of estimated variance and observed variance for different multistage sequences in Figure 3. The estimated variance is calculated using uncorrelated variance formulas for the first two stages. There are three hierarchies in the cohash partitioning graph, and with roots as Customer(c), Part (p), and Supplier (s), respectively. The sequences we consider are cps, csp, pcs, psc, scp and spc.
Combination 1. From Example 6.2: Availability percentages are 1% for , 80% for , and 90% for and Algorithm 1 provides the order cps. The ratio of estimated variances to observed variance for all sequences are plotted in the first set of bars in Figure 3. We see that variance from cps and csp sequences closely match the observed variance, while other sequences severely underestimate the observed variance. This suggests that determining the first stage clustering attributes is crucial.
Combination 2. We now change the availability percentages to be 80% for , 1% for , and 90% for . Algorithm 1 provides the sequence psc. From Figure 3 (middle set of bars), we observe that variance estimated by the psc sequence is closest to the observed variance.
Combination 3. We set availablity percentages to be 80% for , 90% for , and 10% for . Algorithm 1 provides the order scp. We observe from the last set of bars in Figure 3 that the estimated variance for the sequence scp is very close to the observed variance (ratio of 1).
The above examples validate the insight in Algorithm 1 that the relative number of clusters in hierarchies determine the relative variance contributions of the different stages – fewer the clusters for a clustering attribute, greater its contribution to the overall variance, with the first stage being very crucial.
7. Extending to General Queries
We now discuss extending the ideas to include queries containing Group By, Having clauses and queries with nested subqueries.
Group Bys. Approximating queries with Group By follows the same procedure as basic queries, with estimation and variance estimation being done groupwise. There is a special case that applies whenever cluster sampling (or single stage sampling) is employed – if the grouping attributes and the clustering attributes are the same, then for any group, we either obtain the exact aggregate values or do not obtain the group. This case occurs if data is unavailable for a single hierarchy, and the clustering attributes of the hierarchy are exactly the same as the grouping attributes of the query.
Nested query predicates. A predicate in the Where clause that is a nested query should be a correlated subquery for the query to be approximable. Let be the set of relations in the outer query that are correlated with the inner query and let the be the set of relations in the nested sub query. We will construct the following graph for a given query and cohash scheme :

, introduce a vertex .

Let be the join conditions between relations and in . If and , then add edge to .
After the construction, the number of connected components in should be equal to the number of relations in in order for the query to be approximable. The intuition behind this condition is that in order for a nested query predicate to be handled, the nested subquery should only process tuples that are clustered according to the correlated tuples in the outer query.
Nested query relations. For multiple input relations, every relation or nested subquery in the From clause should result in a uniform random cluster sample.
Having clauses. We can support Having clauses when queries come under the special case of Group By clauses. That is, if we are missing data from a single hierarchy, and the clustering attributes are exactly the same as the grouping attributes of the query, then we obtain complete groups and hence can apply the predicates in the Having
clauses to provide unbiased estimates and confidence intervals.
8. Evaluation of CoHashing
We now present empirical evidence of the accuracy of approximate answers obtained from a sample of the data that is cohash partitioned.
Experimental Setup. We run all experiments on System X, using a 100GB TPCH data set, both uniform and skewed. We generate the skewed data set using the tool from (tpchSkew) setting parameter (medium skew). Since our goal is not to study the already established performance benefits of cohash partitioning, but to study its benefits for obtaining estimates with confidence bounds in failure/straggler scenarios, the system used is not important.
Partitioning Designs. We experiment with cohash partitioning designs validated and recommended by previous work (locationAware). There are three designs we consider:

SDWithoutRedundancy: This is the schema driven (SDWithout) partitioning design recommended in (locationAware) as the best cohash partitioning strategy without redundancy for the TPCH benchmark schema. This is depicted in Figure 0(b).

SDWithRedundancy: This is the schema driven (SDWith) partitioning design suggested in (locationAware) as the best cohash partitioning strategy with tuplelevel redundancy. This is graphically presented in Figure 3(a); the root of the cohashing hierarchy is relation Lineitem and all other tables are colocated through it.

WD: This is the workload driven (WD) partitioning design suggested by the algorithm in (locationAware). It is presented in Figure 3(b). It has two hierarchies with relation level redundancy for tables Lineitem, Orders and Part along with tuple level redundancy.
The design choice not only determines the accuracy of the estimation for the queries, but also affects feasibility of approximation. Depending on the database partitioning design, it may or may not be possible to provide approximate answers to the query as we will see next.
Q  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  Total Y 

Y  N  Y  Y  Y  Y  Y  Y  Y  Y  N  Y  Y  Y  N  N  N  Y  Y  N  Y  N  15  
Y  N  Y  N  Y  Y  Y  Y  Y  Y  N  Y  N  Y  N  N  N  N  Y  N  N  N  11  
Y  N  Y  N  Y  Y  Y  Y  Y  Y  N  Y  N  Y  N  N  Y  N  Y  N  N  N  12 
Queries. The TPCH benchmark consists of 22 analytical queries, out of which 11 queries have nested predicates. Table 4 lists the queries that are approximable for each of the three designs we consider; ‘Y’ indicates that the query is approximable for that design and ‘N’ indicates that is not. Notice that for SDWithout, 15 queries are approximable, and for design SDWith only 11 queries are approximable; queries 4, 13, 18, and 21 that have nested aggregates are approximable by SDWithout, but not by the other two designs. Query 17 is approximable by only WD. However, as we will see later, SDWith provides estimates with higher accuracy.
8.1. Single Relation
We first experiment with single table queries; we consider Query 6 from the TPCH benchmark that involves a single table: Lineitem. We change the availability of the data, i.e., we change the availability of data for the respective hierarchies for all three designs, and the selectivity of the predicate for the Lineitem relation. We present the results on accuracy for the uniform data set in Figure 5.
The errors for all designs are very low and decreases with increase in availability and selectivity for all three designs. The accuracy of the SDWith design is slight better than the res and this suggests that increasing the length of the hierarchy exacerbates errors; hierarchy length for SDWith is 1, whereas the hierarchy length for SDWithout and WD is 3 and 2, respectively.
For the skewed data set (not presented due to space constraints), this does not hold. Designs SDWith and SDWithout provide similar results, with slightly higher error rates (below 5%). However, design WD results in very high errors due to high skew in the number and values of lineitems associated with each part element. This suggests that apart from cluster sizes and length of hierarchy, skew needs to be taken into account for partitioning designs.
8.2. A single hierarchy
Next, we look at Query 3 of the benchmark that involves the join of Customer, Orders and Lineitem. We present results for the uniform and skewed data set in Figures 5(a) and 5(b), respectively. For both datasets, the accuracy of all three designs improves with availability and selectivity. Although, the errors for the skewed dataset are much higher compared to the uniform dataset, at higher availability (and also selectivity), the errors come within 2%. Even when 80% of the data is unavailable, we obtain results within 5% accuracy.
8.3. Multiple hierarchies
We now consider Query 9 of the benchmark that involves join on Orders, Lineitem, Part, PartSupp, Supplier and Nation. This involves data across two hierarchies for SDWithout and WD, but a single hierarchy for SDWith. We change the availability of all the hierarchies involved and plot the accuracies obtained in Figure 7 for the uniform data set. We observe that even at 1% availability, all three designs are accurate within 2% error. However, with the skewed data set, SDWith provides similar errors, but SDWithout and WD produce very high errors due to the skewed number of tuples colocated for each root tuple.
8.4. Highly selective query
Query 19 of the TPCH benchmark is a highly selective query; it involves the join of Lineitem and Part and the result only contains 0.001% of the Lineitem table. For SDWith and WD designs, this involves a single hierarchy, whereas for the SDWithout design, it involves two hierarchies. We vary the availability of all hierarchies and plot the result in Figure 8. We observe that while SDWith and WD provide estimates with higher accuracy than SDWithout for lower availabilities. At 20% availability, all designs provide estimates within 5% error. For the skewed data set, SDWith provides errors within 5% whereas estimates from the other two designs result in high errors since most samples do not satisfy the predicates.
8.5. Query with nested predicates
8.6. Queries with Group By
We now discuss obtaining approximate results for queries with Group By. We consider queries 3 and 9 along with the Group By operation. We measure the change in percentage of missing groups with availability for different designs. The accuracy of estimated aggregate values are similar to the results obtained without Group By.
Query 9. The true result has a total of 175 groups and the grouping attributes involve n_name and o_orderdate. For the uniform data set, all groups are present in the result at 1% availability. For the skewed data set, however, we need 80% availability to obtain about 80% of the groups and 95% availability to obtain most of the groups. (SDWith performs better than SDWithout and WD and obtains all groups by about 85% availability.)
Query 3. The grouping attributes in this query are l_orderkey, o_orderdate, and o_shippriority and the query requests for the top 10 results. At about 80% availability, SDWithout provides 8 out of 10 results and about 90% availability provides all 10 results in a consistent manner. For SDWith and WD, we need a high availability of 95% to get about 8 matches. For the skewed data set, this result only gets exacerbated. This is primarily because the gap in aggregated values between groups is not high and hence even a small error will lead to the group being eliminated from the set of top 10.
8.7. Summary of experimental results
Our experimental results show that, even at very lowavailability we can obtain estimates with high accuracy for aggregate queries by exploiting the statistical properties of cohash partitioned databases. We also found that the errors can increase with the size of the hierarchy, the number of unavailable hierarchies, and with high skew. But, for the case of failures and stragglers, where we expect a reasonable (at least 20%) amount of the data to be available, cohash partitioning can be exploited to obtain approximate answers instead of errors.
9. Conclusions
In this paper we proposed a novel idea to exploit an efficient partitioning strategy used in distributed systems, cohash partitioning, to address failures and stragglers that affect data analytic users. The key idea was to determine the statistical properties of available data in the event of a failure/straggler, and to use that to provide approximate answers with error bounds. We validated our idea through extensive experiments on the TPCH benchmark; in particular we found that accuracy of approximate answers for queries increases when the join paths and groupby attributes match the hierarchies embedded in the cohash partitioning. One particularly promising avenue for future research is to exploit the same idea to provide approximate answers in the nonfailure case; in the case where users choose to obtain approximate query processing. Applying cohash partitioning over factorization trees (fdbEngine; fdb) to provide quick approximations is another interesting area for future work.
Comments
There are no comments yet.