Adaptive Sampling for Rapidly Matching Histograms

08/20/2017 ∙ by Stephen Macke, et al. ∙ Shanghai Jiao Tong University University of Illinois at Urbana-Champaign 0

In exploratory data analysis, analysts often have a need to identify histograms that possess a specific distribution, among a large class of candidate histograms, e.g., find histograms of countries whose income distribution is most similar to that of Greece. This distribution could be a new one that the user is curious about, or a known distribution from an existing histogram visualization. At present, this process of identification is brute-force, requiring the manual generation and evaluation of a large number of histograms. We present FastMatch: an end-to-end architecture for interactively retrieving the histogram visualizations that are most similar to a user-specified target, from a large collection of histograms. The primary technical contribution underlying FastMatch is a sublinear algorithm, HistSim, a theoretically sound sampling-based approach to identify the top-k closest histograms under ℓ_1 distance. While HistSim can be used independently, within FastMatch we couple HistSim with a novel system architecture that is aware of practical considerations, employing block-based sampling policies and asynchronous statistics and computation, building on lightweight sampling engines developed in recent work. In our experiments on several real-world datasets, FastMatch obtains near-perfect accuracy with up to 100× speedups over less sophisticated approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In exploratory data analysis, analysts often generate and peruse a large number of visualizations to identify those that match desired criteria. This process of iterative “generate and test” occupies a large part of visual data analysis [13, 33, 62], and is often cumbersome and time consuming, especially on very large datasets that are increasingly the norm. This process ends up impeding interaction, preventing exploration, and delaying the extraction of insights.

Example 1: Census Data Exploration. Alice is exploring a census dataset consisting of hundreds of millions of tuples, with attributes such as gender, occupation, nationality, ethnicity, religion, adjusted income, net assets, and so on. In particular, she is interested in understanding how applying various filters impacts the relative distribution of tuples with different attribute values. She might ask questions like Q1: Which countries have similar distributions of wealth to that of Greece? Q2: In the United States, which professions have an ethnicity distribution similar to the profession of doctor? Q3: Which (nationality, religion) pairs have a similar distribution of number of children to Christian families in France?

Example 2: Taxi Data Exploration.

Bob is exploring the distribution of taxi trip times originating from various locations around Manhattan. Specifically, he plots a histogram showing the distribution of taxi pickup times for trips originating from various locations. As he varies the location, he examines how the histogram changes, and he notices that choosing the location of a popular nightclub skews the distribution of pickup times heavily in the range of 3am to 5am. He wonders

Q4: Where are the other locations around Manhattan that have similar distributions of pickup times? Q5: Do they all have nightclubs, or are there different reasons for the late-night pickups?

Example 3: Sales Data Exploration. Carol has the complete history of all sales at a large online shopping website. Since users must enter birthdays in order to create accounts, she is able to plot the age distribution of purchasers for any given product. To enhance the website’s recommendation engine, she is considering recommending products with similar purchaser age distributions. To test the merit of this idea, she first wishes to perform a series of queries of the form Q6: Which products were purchased by users with ages most closely following the distribution for a certain product—a particular brand of shoes, or a particular book, for example? Carol wishes to perform this query for a few test products before integrating this feature into the recommendation pipeline.

These cases represent scenarios that often arise in exploratory data analysis—finding matches to a specific distribution. The focus of this paper is to develop techniques for rapidly exploring a large class of histograms to find those that match a user-specified target.

Referring to Q1 in the first example,a typical workflow used by Alice may be the following: first, pick a country. Generate the corresponding histogram. This could be done either using a language like R, Python, or Javascript, with the visualization generated in ggplot [73] or D3 [15], or using interactions in a visualization platform like Tableau [70]. Does the visualization look similar to that of Greece? If not, pick another, generate it, and repeat. Else, record it, pick another, generate it, and repeat. If only a select few countries have similar distributions, she may spend a huge amount of time sifting through her data, or may simply give up early.

The Need for Approximation. Even if Alice generates all of the candidate histograms (i.e., one for each country) in a single pass, programmatically selecting the closest match to her target (i.e., the Greece histogram), this could take unacceptably long. If the dataset is tens of gigabytes and every tuple in her census dataset contributes to some histogram, then any exact method must necessarily process tens of gigabytes—on a typical workstation, this can take tens of seconds even for in-memory data. Recent work suggests that latencies greater than 500ms cause significant frustration for end-users and lead them to test fewer hypotheses and potentially identify fewer insights [54]. Thus, in this work, we explore approximate techniques that can return matching histogram visualizations with accuracy guarantees, but much faster.

One tempting approach is to employ approximation using pre-computed samples [7, 6, 5, 10, 31, 28], or pre-computed sketches or other summaries [18, 60, 77]. Unfortunately, in an interactive exploration setting, pre-computed samples or summaries are not helpful, since the workload is unpredictable and changes rapidly, with more than half of the queries issued one week completely absent in the following week, and more than 90% of the queries issued one week completely absent a month later [58]. In our case, based on the results for one matching query, Alice may be prompted to explore different (and arbitrary) slices of the same data, which can be exponential in the number of attributes in the dataset. Instead, we materialize samples on-the-fly, which doesn’t suffer from the same limitations and has been employed for generating approximate visualizations incrementally [64], and while preserving ordering and perceptual guarantees [46, 8]. To the best of our knowledge, however, on-demand approximate sampling techniques have not been applied to the problem of evaluating a large number of visualizations for matches in parallel.

Key Research Challenges. In developing an approximation-based approach for rapid histogram matching we immediately encounter a number of theoretical and practical challenges.

1. Quantifying Importance. To benefit from approximation, we need to be able to quantify which samples are “important” to facilitate progress towards termination. It is not clear how to assess this importance: at one extreme, it may be preferable to sample more from candidate histograms that are more “uncertain”, but these histograms may already be known to be rather far away from the target. Another approach is to sample more from candidate histograms at the “boundary” of top-, but if these histograms are more “certain”, refining them further may be useless. Another challenge is when we quantify the importance of samples: one approach would be to reassess importance every time new data become available, but this approach could be computationally costly.

2. Deciding to Terminate. Our algorithm needs to ascribe a confidence in the correctness of partial results in order to determine when it may safely terminate. This “confidence quantification” requires performing a statistical test. If we perform this test too often, we spend a significant amount of time doing computation that could be spent performing I/O, and we further lose statistical power since we are performing more tests; if we do not do this test often enough, we may end up taking many more samples than are necessary to terminate.

3. Challenges with Storage Media. When performing sampling from traditional storage media, the cost to fetch samples is locality-dependent; truly random sampling is extremely expensive due to random I/O, while sampling at the level of blocks is much more efficient, but is less random.

4. Communication between Components. It is crucial for our overall system to not be bottlenecked on any component. In particular, the process of quantifying importance (via the sampling manager) must not block the actual I/O performed; otherwise, the time for execution may end up being greater than the time taken by exact methods. As such, these components must proceed asynchronously, while also minimizing communication across them.

Our Contributions. In this paper, we have developed an end-to-end architecture for histogram matching, dubbed FastMatch, addressing the challenges identified above:

1. Importance Quantification Policies. We develop a sampling engine that employs a simple and theoretically well-motivated criterion for deciding whether processing particular portions of data will allow for faster termination. Since the criterion is simple, it is easy to update as we process new data, “understanding” when it has seen enough data for some histogram, or when it needs to take more data to distinguish histograms that are close to each other.

2. Termination Algorithm. We develop a statistics engine that repeatedly performs a lightweight “safe termination” test, based on the idea of performing multiple hypothesis tests for which simultaneous rejection implies correctness of the results. Our statistics engine further quantifies how often to run this test to ensure timely termination without sacrificing too much statistical power.

3. Locality-aware Sampling. To better exploit locality of storage media, FastMatch

samples at the level of blocks, proceeding sequentially. To estimate the benefit of blocks, we leverage bitmap indexes in a cache-conscious manner, evaluating multiple blocks at a time in the same order as their layout in storage. Our technique minimizes the time required for the query output to satisfy our probabilistic guarantees.

4. Decoupling Components. Our system decouples the overhead of deciding which samples to take from the actual I/O used to read the samples from storage. In particular, our sampling engine utilizes a just-in-time lookahead technique that marks blocks for reading or skipping while the I/O proceeds unhindered, in parallel.

Overall, we implement FastMatch within the context of a bitmap-based sampling engine, which allows us to quickly determine whether a given memory or disk block could contain samples matching ad-hoc predicates. Such engines were found to effectively support approximate generation of visualizations in recent work [8, 46, 64].

We find that our approximation-based techniques working in tandem with our novel systems components lead to speedups ranging from to over over exact methods, and moreover, unlike less-sophisticated variants of FastMatch, whose performance can be highly data-dependent, FastMatch consistently brings latency to near-interactive levels.

Related Work. To the best of our knowledge, there has been no work on sampling to identify histograms that match user specifications. Sampling-based techniques have been applied to generate visualizations that preserve visual properties [8, 46], and for incremental generation of time-series and heat-maps [64]—all focusing on the generation of a single visualization. Similarly, Pangloss [57] employs approximation via the Sample+Seek approach [28] to generate a single visualization early, while minimizing error. One system uses workload-aware indexes called “VisTrees” [29] to facilitate sampling for interactive generation of histograms without error guarantees. M4 uses rasterization without sampling to reduce the dimensionality of a time-series visualization and generate it faster [43]. SeeDB [71] recommends visualizations to help distinguish between two subsets of data while employing approximation. However, their techniques are tailored to evaluating differences between pairs of visualizations (that share the same axes, while other pairs do not share the same axes). In our case, we need to compare one visualization versus others, all of which have the same axes and have comparable distances, hence the techniques do not generalize.

Recent work has developed zenvisage [67], a visual exploration tool, including operations that identify visualizations similar to a target. However, to identify matches, zenvisage does not consider sampling, and requires at least one complete pass through the dataset. FastMatch was developed as a back-end with such interfaces in mind to support rapid discovery of relevant visualizations.

Outline. Section 2 articulates the formal problem of identifying top- closest histograms to a target. Section 3 introduces our HistSim algorithm for solving this problem, while Section 4 describes the system architecture that implements this algorithm. In Section 5 we perform an empirical evaluation on several real-world datasets. After surveying additional related work in Section 6, we describe several generalizations and extensions of our techniques in Appendix A.

1

Income Bracket

Population Counts

Target Histogram (Greece)
(a)

1

Income Bracket

Population Counts

Candidate ($Country=Italy)
(b)
Figure 1: Example visual target and candidate histogram

2 Problem Formulation

Symbol(s) Description
x-axis attribute, candidate attribute, respective value sets, and relation over these attributes, used in histogram-generating queries (see Definition 1)
,

User-supplied parameters (number of matching histograms to retrieve, error probability upper bound, approximation error upper bound, selectivity threshold (below which candidates may optionally be ignored)

Visual target, candidate ’s estimated (unstarred) and true (starred) histogram counts (normalized variants)
Distance function, used to quantify visual distance (see Definition 2)
, , , , () Quantities specific to candidate during HistSim run: number of samples taken, estimated samples needed (see Section 4), deviation bound (see Definition 4), confidence upper bound on -deviation or rareness, and distance estimate from (true distance from ), respectively
, , Quantities corresponding to samples taken in a specific round of HistSim stage 2: number of samples taken for candidate in round, per-group counts for candidate for samples taken in round, corresponding distance estimates using the samples taken in round, respectively
Set of matching histograms (see Definition 3) and non-pruned histograms, respectively, during a run of HistSim
, , Number of datapoints corresponding to candidate , total number of datapoints, samples taken during stage 1, hypergeometric pdf
Table 1: Summary of notation.

In this section, we formalize the problem of identifying histograms whose distributions match a reference.

2.1 Generation of Histograms

We start with a concrete example of the typical database query an analyst might use to generate a histogram. Returning to our example from Section 1, suppose an analyst is interested in studying how population proportions vary across income brackets for various countries around the world. Suppose she wishes to find countries with populations distributed across different income brackets most similarly to a specific country, such as Greece. Consider the following SQL query, where $COUNTRY is a variable:

SELECT income_bracket, COUNT (*) FROM census
WHERE country=$COUNTRY
GROUP BY income_bracket

This query returns a list of 7 (income bracket, count) pairs to the analyst for a specific country. The analyst may then choose to visualize the results by plotting the counts versus different income brackets in a histogram, i.e., a plot similar to the right side of Figure 1 (for Italy). Currently, the analyst may examine hundreds of similar histograms, one for each country, comparing it to the one for Greece, to manually identify ones that are similar.

In contrast, the goal of FastMatch is to perform this search automatically and efficiently. Conceptually, FastMatch will iterate over all possible values of country, generate the corresponding histograms, and evaluate the similarity of its distribution (based on some notion of similarity described subsequently) to the corresponding visualization for Greece. In actuality, FastMatch will perform this search all at once, quickly pruning countries that are either clearly close or far from the target.

Candidate Visualizations. Formally, we consider visualizations as being generated as a result of histogram-generating queries:

Definition 1

A histogram-generating query is a SQL query of the following type:

SELECT , COUNT (*) FROM
WHERE GROUP BY

The table and attributes and form the query’s template.

For each concrete value of attribute

specified in the query, the results of the query—i.e., the grouped counts—can be represented in the form of a vector

, where , the cardinality of the value set of attribute . This -tuple can then be used to plot a histogram visualization—in this paper, when we refer to a histogram or a visualization, we will be typically referring to such an -tuple. For a given grouping attribute and a candidate attribute , we refer to the set of all visualizations generated by letting vary over its value set as the set of candidate visualizations. We refer to each distinct value in the grouping attribute ’s value set as a group. In our example, corresponds to income_bracket and corresponds to country.

For ease of exposition, we focus on candidate visualizations generated from queries according to Definition 1, having single categorical attributes for and . Our methods are more general and extend naturally to handle (i) predicates: additional predicates on other attributes, (ii) multiple and complex s: additional grouping (i.e., ) attributes, groups derived from binning real-values (as opposed to categorical ), along with groups derived from binning multiple categorical attribute values together (e.g., quarters instead of individual months), and (iii) multiple and complex s: additional candidate (i.e., ) attributes, as well as candidate attribute values derived from binning real values (as opposed to categorical ). The flexibility in specifying histogram-generating queries—exponential in the number of attributes—makes it impossible for us to precompute the results of all such queries.

Visualization Terminology. Our methods are agnostic to the particular method used to present visualizations. That is, analysts may choose to present the results generated from queries of the form in Definition 1 via line plots, heat maps, choropleths, and other visualization types, as any of these may be specified by an ordered tuple of real values and are thus permitted under our notion of a “candidate visualization”. We focus on bar charts of frequency counts and histograms—these naturally capture aggregations over the categorical or binned quantitative grouping attribute respectively. Although a bar graph plot of frequency counts over a categorical grouping attribute is not technically a histogram, which implies that the grouping attribute is continuous, we loosely use the term “histogram” to refer to both cases in a unified way.

Visual Target Specification. Given our specification of candidate visualizations, a visual target is an -tuple, denoted by with entries , that we need to match the candidates with. Returning to our flight delays example, would refer to the visualization corresponding to Greece, with being the count of individuals in the first income bracket, the count of individuals in the second income bracket, and so on.

Samples. To estimate these candidate visualizations, we need to take samples. In particular, for a given candidate for some attribute , a sample corresponds to a single tuple with attribute value . The attribute value of increments the th entry of the estimate for the candidate histogram.

Candidate Similarity. Given a set of candidate visualizations with estimated vector representations such that the th candidate is generated by selecting on , our problem hinges on finding the candidate whose distribution is most “similar” to the visual target

specified by the analyst. For quantifying visual similarity, we do not care about the absolute counts

, and instead prefer to determine whether and are close in a distributional sense. Using hats to denote normalized variants of and , write

With this notational convenience, we make our notion of similarity explicit by defining candidate distance as follows:

Definition 2

For candidate and visual predicate , the distance between and is defined as follows:

That is, after normalizing the candidate and target vectors so that their respective components sum to 1 (and therefore correspond to distributions), we take the distance between the two vectors. When the target is understood from context, we denote the distance between candidate and by .

Figure 2: The target (departure hour histogram for ORD), second closest in normalized (DAL) , second closest in normalized (PHX)
Figure 3: The goldenrod histogram is identical to the blue one post-normalization, but appears very far visually pre-normalization.

The Need for Normalization. A natural question that readers may have is why we chose to normalize each vector prior to taking the distance between them. We do this because the goal of FastMatch is to find visualizations that have similar distributions, as opposed to similar actual values. Returning to our example, if we consider the population distribution of Greece across different income brackets, and compare it to that of other countries, without normalization, we will end up returning other countries with similar population counts in each bin—e.g., other countries with similar overall populations—as opposed to those that have similar shape or distribution. To see an illustration of this, consider Figure 3. The overlaid histogram in goldenrod is identical to the blue one, but we are unable to capture this without normalization.

Choice of Metric Post-Normalization. A similar metric, using distance between normalized vectors (as opposed to ), has been studied in prior work [71, 28] and even validated in a user study in [71]. However, as observed in [12], the distance between distributions has the drawback that it could be small even for distributions with disjoint support. The

distance metric over discrete probability distributions has a direct correspondence with the traditional statistical distance metric known as

total variation distance [32] and does not suffer from this drawback.

Additionally, we sometimes observe that heavily penalizes candidates with a small number of vector entries with large deviations from each other, even when they are arguably closer visually than those candidates closest in . Consider Figure 2, which depicts histograms generated by one of the queries on a Flights dataset we used in our experiments, corresponding to a histogram of departure time. The target is the Chicago ORD airport, and we are depicting the first non-ORD top-k histogram for both and (i.e., the 2nd ranked histogram for both metrics), among all airports. As one can see in the figure, the middle histogram is arguably “visually closer” to the ORD histogram on the left, but is not considered so by due to the mismatch at about the 6th hour.

KL-divergence is another possibility as a distance metric, but it has the drawback that it will be infinite for any candidate that places 0 mass in a place where the target places nonzero mass, making it difficult to compare these (note that this follows directly from the definition: ).

2.2 Guarantees and Problem Statement

Since FastMatch takes samples to estimate the candidate histogram visualizations, and therefore may return incorrect results, we need to enforce probabilistic guarantees on the correctness of the returned results.

First, we introduce some notation: we use to denote the estimate of the candidate visualization, while (with normalized version ) is the true candidate visualization on the entire dataset. Our formulation also relies on constants , , and , which we assume either built into the system or provided by the analyst. We further use and to denote the total number of datapoints and number of datapoints corresponding to candidate , respectively.

Guarantee 1

(Separation) Any approximate histogram with selectivity that is in the true top- closest (w.r.t. Definition 2) but not part of the output will be less than closer to the target than the furthest histogram that is part of the output. That is, if the algorithm outputs histograms , then, for all , or .

Note that we use “selectivity” as a number and not as a property, matching typical usage in database systems literature [66, 45]. As such, candidates with lower selectivity appear less frequently in the data than candidates with higher selectivity.

Guarantee 2

(Reconstruction) Each approximate histogram output as one of the top- satisfies .

The first guarantee says that any ordering mistakes are relatively innocuous: for any two histograms and , if the algorithm outputs but not , when it should have been the other way around, then either , or . The intuition behind the minimum selectivity parameter, , is that certain candidates may not appear frequently enough within the data to get a reliable reconstruction of the true underlying distribution responsible for generating the original data, and thus may not be suitable for downstream decision-making. For example, in our income example, a country with a population of 100 may have a histogram similar to the visual target but this would not be statistically significant. Overall, our guarantee states that we still return a visualization that is quite close to , and we can be confident that anything dramatically closer has relatively few total datapoints available within the data (i.e., is small).

The second guarantee says that the histograms output are not too dissimilar from the corresponding true distributions that would result from a complete scan of the data. As a result, they form an adequate and accurate proxy from which insights may be derived. With these definitions in place, we now formally state our core problem:

Problem 1

(Top-K-Similar). Given a visual target , a histogram-generating query template, , , , and , display candidate attribute values (and accompanying visualizations ) as quickly as possible, such that the output satisfies Guarantees 2 and 1 with probability greater than .

3 The HistSim Algorithm

In this section, we discuss how to conceptually solve Problem 1. We outline an algorithm, named HistSim, which allows us to determine confidence levels for whether our separation and reconstruction guarantees hold. We rigorously prove in this section that when our algorithm terminates, it gives correct results with probability greater than regardless

of the data given as input. Many systems-level details and other heuristics used to make

HistSim perform particularly well in practice will be presented in Section 4. Table 1 provides a description of the notation used.

3.1 Algorithm Outline

Input : Columns , visual target , parameters
Output : Estimates of the top- closest candidates to , histograms
1  
2 Initialization. 
3 , for ;
4  
5 stage 1: ;
6 Repeat times: uniformly randomly sample some tuple without replacement;
7 Update , , based on the new samples;
8 where for ;
9 Perform a Holm-Bonferroni statistical test with P-values in ; that is:
10 ;
11  
12 stage 2: ;
13 do
14          ;
15          , , for ;
16          , for ;
17          ;
18          ;
19          Repeat: take uniform random samples from any ;
20          Update , , and based on the new samples;
21          for ;
22          if else for ;
23          where for ;
24         
25while  ;
26 
27 stage 3: Sample until , for all ;
28 Update based on the new samples;
29 return , and ;
Algorithm 1 The HistSim algorithm
Figure 4: Illustration of HistSim.

HistSim operates by sampling tuples. Each of these tuples contributes to one or more candidate histograms, using which HistSim constructs histograms . After taking enough samples corresponding to each candidate, it will eventually be likely that is “small”, and that is likewise “small”, for each . More precisely, the set of candidates will likely be in a state such that Guarantees 2 and 1 are both satisfied simultaneously.

Stages Overview. HistSim separates its sampling into three stages, each with an error probability of at most , giving an overall error probability of at most :

  • Stage 1 [Prune Rare Candidates]: Sample datapoints uniformly at random without replacement, so that each candidate is sampled a number of times roughly proportional to the number of datapoints corresponding to that candidate. Identify rare candidates that likely satisfy , and prune these ones.

  • Stage 2 [Identify Top-]: Take samples from the remaining candidates until the top- have been identified reliably.

  • Stage 3 [Reconstruct Top-]: Sample from the estimated top- until they have been reconstructed reliably.

This separation is important for performance: the pruning step (stage 1) often dramatically reduces the number of candidates that need to be considered in stages 2 and 3.

The first two stages of HistSim factor into phases that are pure I/O and phases that involve one or more statistical tests. The I/O phases sample tuples (lines 1 and 1 in Algorithm 1)—we will describe how in Section 4; our algorithm’s correctness is independent of how this happens, provided that the samples are uniform.

Stage 1: Pruning Rare Candidates (Section 3.3). During stage 1, the I/O phase (line 1) takes samples, for some fixed ahead of time. This is followed by updating, for each candidate , the number of samples observed so far (line 1), and using the P-values of a test for underrepresentation to determine whether each candidate is rare, i.e., has (lines 11).

Stage 2: Identifying Top- (Section 3.4). For stage 2, we focus on a smaller set of candidates; namely, those that we did not find to be rare (denoted by ). Stage 2 is divided into rounds. Each round attempts to use existing samples to estimate which candidates are top- and which are non top-, and then draws new samples, testing how unlikely it is to observe the new samples in the event that its guess of the top- is wrong. If this event is unlikely enough, then it has recovered the correct top-, with high probability.

At the start of each round, HistSim accumulates any samples taken during the previous round (lines 11). It then determines the current top- candidates and a separation point between top- and non top- (lines 11), as this separation point determines a set of hypotheses to test. Then, it begins an I/O phase and takes samples (algorithm 1). The samples taken each round are used to generate the number of samples taken per candidate, , the estimates , and the distance estimates (algorithm 1). These statistics are computed from fresh samples each round (i.e., they do not reuse samples across rounds) so that they may be used in a statistical test (lines 11), discussed in Section 3.4

. After computing the P-values for each null hypothesis to test (

algorithm 1), HistSim determines whether it can reject all the hypotheses with type 1 error (i.e., probability of mistakenly rejecting a true null hypothesis) bounded by and break from the loop (algorithm 1). If not, it repeats with new samples and a smaller (where the are chosen so that the probability of error across all rounds is at most ).

Stage 3: Reconstructing Top- (Section 3.5). Finally, stage 3 ensures that the identified top-, , all satisfy for (so that Guarantee 2 holds), with high probability.

Figure 4 illustrates HistSim stage 2 running on a toy example in which we compute the top-2 closest histograms to a target. At round , it estimates and as the top-2 closest, which it refines by the time it reaches round . As the rounds increase, HistSim takes more samples to get better estimates of the distances and thereby improve the chances of termination when it performs its multiple hypothesis test in stage 2.

Choosing where to sample and how many samples to take. The estimates and allow us to determine which candidates are “important” to sample from in order to allow termination with fewer samples; we return to this in Section 4. Our HistSim algorithm is agnostic to the sampling approach.

Outline. We first discuss the Holm-Bonferroni method for testing multiple statistical hypotheses simultaneously in Section 3.2, since stage 1 of HistSim uses it as a subroutine, and since the simultaneous test in stage 2 is based on similar ideas. In Section 3.3, we discuss stage 1 of HistSim, and prove that upon termination, all candidates flagged for pruning satisfy with probability greater than . Next, in Section 3.4, we discuss stage 2 of HistSim

, and prove that upon termination, we have the guarantee that any non-pruned candidate mistakenly classified as top-

is no more than further from the target than the furthest true non-pruned top- candidate (with high probability). The proof of correctness for stage 2 is the most involved and is divided as follows:

  • In Section 3.4.1, we give lemmas that allow us to relate the reconstruction of the candidate histograms from estimates to the separation guarantee via multiple hypothesis testing;

  • In Section 3.4.2, we describe a method to select appropriate hypotheses to use for testing in the lemmas of Section 3.4.1;

  • In Section 3.4.3, we prove a theorem that enables us to use the samples per candidate histogram to determine the P-values associated with the hypotheses.

In Section 3.5, we discuss stage 3 and conclude with an overall proof of correctness.

3.2 Controlling Family-wise Error

In the first two stages of HistSim, the algorithm needs to perform multiple statistical tests simultaneously [17]. In stage 1, HistSim tests null hypotheses of the form “candidate is high-selectivity” versus alternatives like “candidate is not high-selectivity”. In this case, “rejecting the null hypothesis at level ” roughly means that the probability that candidate is high-selectivity is at most . Likewise, during stage 2, HistSim tests null hypotheses of the form “candidate ’s true distance from , , lies above (or below) some fixed value .” If the algorithm correctly rejects every null hypothesis while controlling the family-wise error [50] at level , then it has correctly determined which side of every lies, a fact that we use to get the separation guarantee.

Since stages 1 and 2 test multiple hypotheses at the same time, HistSim needs to control the family-wise type 1 error (false positive) rate of these tests simultaneously. That is, if the family-wise type 1 error is controlled at level , then the probability that one or more rejecting tests in the family should not have rejected is less than — during stage 1, this intuitively means that the probability one or more high-selectivity candidates were deemed to be low-selectivity is at most , and during stage 2, this roughly means that the probability of selecting some candidate as top- when it is non top- (or vice-versa) is at most .

The reader may be familiar with the Bonferroni correction, which enforces a family-wise error rate of by requiring a significance level for each test in a family with tests in total. We instead use the Holm-Bonferroni method [36], which is uniformly more powerful than the Bonferroni correction, meaning that it needs fewer samples to make the same guarantee. Like its simpler counterpart, it is correct regardless of whether the family of tests has any underlying dependency structure. In brief, a level test over a family of size works by first sorting the P-values of the individual tests in increasing order, and then finding the minimal index (starting from 1) where (if this does not exist, then set ). The tests with smaller indices reject their respective null hypotheses at level , and the remaining ones do not reject.

3.3 Stage 1: Pruning Rare Candidates

One way to remove rare (i.e. low-selectivity) candidates from processing is to use an index to look up how many tuples correspond to each candidate. While this will work well for some queries, it unfortunately does not work in general, as candidates generated from queries of the form in Definition 1 could have arbitrary predicates attached, which cannot all be indexed ahead-of-time. Thus, we turn to sampling.

To prune rare candidates, we need some way to determine whether each candidate satisfies with high probability. To do so, we make the simple observation that, after drawing tuples without replacement uniformly at random, the number of tuples corresponding to candidate

follows a hypergeometric distribution 

[42]. The number of samples to take, , is a parameter; we observe in our experiments that is an appropriate choice.111Our results are not sensitive to the choice of , provided is not too small (so that the algorithm fails to prune anything) or too big (i.e., a nontrivial fraction of the data). That is, if candidate has total corresponding tuples in a dataset of size , then the number of tuples for candidate in a uniform sample without replacement of size is distributed according to . As such, we can make use of a well-known test for underrepresentation [50] to accurately detect when candidate has . The null hypothesis is that candidate is not underrepresented (i.e., has ), and letting denote the hypergeometric pdf in this case, the P-value for the test is given by

where is the number of observed tuples for candidate in the sample of size . Roughly speaking, the P-value measures how surprised we are to observe or fewer tuples for candidate when — the lower the P-value, the more surprised we are.

If we reject the null hypothesis for some candidate when the P-value is at most , we are claiming that candidate satisfies , and the probability that we are wrong is then at most . Of course, we need to test every candidate for rareness, not just a given candidate, which is why HistSim stage 1 uses a Holm-Bonferroni procedure to control the family-wise error at any given threshold. We note in passing that the joint probability of observing samples for candidate across all candidates is a multivariate hypergeometric distribution for which we could perform a similar test without a Holm-Bonferroni procedure, but the CDF of a multivariate hypergeometric is extremely expensive to compute, and we can afford to sacrifice some statistical power for the sake of computational efficiency since we only need to ensure that the candidates pruned are actually rare, without necessarily finding all the rare candidates — that is, we need high precision, not high recall.

We now prove a lemma regarding correctness of stage 1.

Lemma 1 (Stage 1 Correctness)

After HistSim stage 1 completes, every candidate removed from satisfies , with probability greater than

This follows immediately from the above discussion, in conjunction with the fact that the P-values generated from each test for underrepresentation are fed into a Holm-Bonferroni procedure that operates at level , so that the probability of pruning one or more non-rare candidates is bounded above by .

3.4 Stage 2: Identifying Top- Candidates

HistSim stage 2 attempts to find the top- closest to the target out of those remaining after stage 1. To facilitate discussion, we first introduce some definitions.

Definition 3

(Matching Candidates) A candidate is called matching if its distance estimate is among the smallest out of all candidates remaining after stage 1.

We denote the (dynamically changing) set of candidates that are matching during a run of HistSim as ; we likewise denote the true set of matching candidates out of the remaining, non-pruned candidates in as . Next, we introduce the notion of -deviation.

Definition 4

(-deviation) The empirical vector of counts for some candidate has -deviation if the corresponding normalized vector is within of the exact distribution . That is,

Note that Definition 4 overloads the symbol to be candidate-specific by appending a subscript. In Section 3.4.3, we provide a way to quantify given samples.

If HistSim reaches a state where, for each matching candidate , candidate has -deviation, and for all , then it is easy to see that the Guarantee 2 holds for the matching candidates. That is, in such a state, if HistSim output the histograms corresponding to the matching candidates, they would look similar to the true histograms. In the following sections, we show that -deviation can also be used to achieve Guarantee 1.

Notation for Round-Specific Quantities. In the following subsections, we use the superscript “” to indicate quantities corresponding to samples taken during a particular round of HistSim stage 2, such as and . In particular, these quantities are completely independent of samples taken during previous rounds.

3.4.1 Deviation-Bounds Imply Separation

In order to reason about the separation guarantee, we prove a series of lemmas following the structure of reasoning given below:

  • We show that when a carefully chosen set of null hypotheses are all false, contains valid top- closest candidates.

  • Next, we show how to use -deviation to upper bound the probability of rejecting a single true null hypothesis.

  • Finally, we show how to reject all null hypotheses while controlling the probability of rejecting any true ones.

Lemma 2 (False Nulls Imply Separation)

Consider the set of null hypotheses defined as follows, where :

When is false for every , then is a set of top- candidates that is correct with respect to Guarantee 1.

When all the null hypotheses are false, then for all , and for all . This means that

and thus is correct with respect to the separation guarantee. Intuitively, Lemma 2 states that when there is some reference point such that all of the candidates in have their smaller than , and the rest have their greater than , then we have our separation guarantee.

Next, we show how to compute P-values for a single null hypothesis of the type given in Lemma 2. Below, we use “” to denote the probability of some event when hypothesis is true.

Lemma 3 (Distance Deviation Testing)

Let . To test the null hypothesis versus the alternative , we have that, for any ,

Likewise, for testing versus the alternative , we have

We prove the first case; the second is symmetric. Suppose candidate satisfies for some . Then, if we take samples from which we construct the random quantities and , we have that

Each step follows from the fact that increasing the quantity to the left of the “” sign within the probability expression can only increase the probability of the event inside. The first step follows from the assumption that , and the third step follows from the triangle inequality. We use Lemma 3 in conjunction with Lemma 2 by using for the reference of Lemma 3, for a particular choice of (discussed in Section 3.4.2). For example, Lemma 3 shows that when we are testing the null hypothesis for that and we observe such that , we can use (any upper bound of) as a P-value for this test. That is, consider a tester with the following behavior, illustrated pictorially:

If , then reject

In the above picture, the tester assumes that is smaller than , but it observes a value that exceeds by . When the true value for any reference , then the observed statistic will only be or larger than (and vice-versa) when the reconstruction is also bad, in the sense that is very small. If the above tester rejects when , then Lemma 3 says that it is guaranteed to reject a true null hypothesis with probability at most . We discuss how to compute an upper bound on in Section 3.4.3.

Finally, notice that Lemma 3 provides a test which controls the type 1 error of an individual , but we only know that the separation guarantee holds for when all the hypotheses are false. Thus, the algorithm requires a way to control the type 1 error of a procedure that decides whether to reject every simultaneously. In the next lemma, we give such a tester which controls the error for any upper bound .

Lemma 4 (Simultaneous Rejection)

Consider any set of null hypotheses , and consider a set of P-values associated with these hypotheses. The tester given by

rejects true null hypotheses with probability .

Consider the set of true null hypotheses and call it — suppose there are in total (if , we have nothing to prove), and index them using from 1 to . Then

The first step follows since null hypotheses are only rejected when they are all rejected. The second to last step follows since probabilities are at most 1, and the last step follows since the tester only rejects when all the P-values are at most , including . Discussion of Lemma 4. At first glance, the multiple hypothesis tester given in Lemma 4, which compares all P-values to the same , seems to be even more powerful than a Holm-Bonferroni tester, which compares P-values to various fractions of . In fact, although based on similar ideas, they are not comparable: a Holm-Bonferroni tester may allow for rejection of a subset of the null hypotheses, wheres the tester of Lemma 4 is “all or nothing”. In fact, the tester of Lemma 4 is essentially the union-intersection method formulated in terms of P-values; see [17] for details.

3.4.2 Selecting Each Round’s Tests

Figure 5: Illustration of HistSim choosing the split point when testing whether the separation and reconstruction guarantees hold.

Each round of HistSim stage 2 constructs a family of tests to perform whose family-wise error probability is at most . At round (starting from ), is chosen to be , so that the error probability across all rounds is at most via a union bound (see Lemma 5 for details).

There is still one degree of freedom: namely, how to choose the split point

used for the null hypotheses in Lemma 2. In line 1, it is chosen to be . The intuition for this choice is as follows. Although the quantities and are generated from fresh samples in each round of HistSim stage 2, the quantities and are generated from samples taken across all rounds of HistSim stage 2. As such, as rounds progress (i.e., if the testing procedure fails to simultaneously reject multiple times), the estimates and become closer to and , the set becomes more likely to coincide with , and the null hypotheses chosen become less likely to be true provided an chosen somewhere in , since values in this interval are likely to correctly separate and as more and more samples are taken. In the interest of simplicity, we simply choose the midpoint halfway between the furthest candidate in and the closest candidate in . For example, at iteration in Figure 5, lies halfway between candidates and . In practice, we observe that and are typically very close to each other, so that the algorithm is not very sensitive to the choice of , so long as it falls between and .

Figure 5 illustrates this choice of and the on our toy example. As in Figure 4, the boundary of is represented by the dashed box. The split point is located at the rightmost boundary of the dashed box.The (i.e., the amounts by which the deviate from ) determine the P-values associated with the which ultimately determine whether HistSim stage 2 can terminate, as we discuss more in the next section.

3.4.3 Deviation-Bounds Given Samples

The previous section provides us a way to check whether the rankings induced by the empirical distances are correct with high probability. This was facilitated via a test which measures our “surprise” for measuring if the current estimate is not correct with respect to Guarantee 1, which in turn used a test for how likely some candidate’s is greater than some threshold after taking samples. We now provide a theorem that allows us to infer, given the samples taken for a given candidate, how to relate with the probability with which the candidate can fail to respect its deviation-bound . The bound seems to be known to the theoretical computer science community as a “folklore fact” [27]; we give a prooffor the sake of completeness. Our proof relies on repeated application of the method of bounded differences [56] in order to exploit some special structure in the distance metric. The bound developed is information-theoretically optimal; that is, it takes asymptotically the fewest samples required to guarantee that an empirical distribution estimated from the samples will be no further than from the true distribution.

Theorem 1

Suppose we have taken samples with replacement for some candidate ’s histogram, resulting in the empirical estimate . Then has -deviation with probability greater than for . That is, with probability , we have: .

In fact, this theorem also holds if we sample without replacement; we return to this point in Section 4. For , we use to denote the number of occurrences of attribute value among the samples, and the normalized count is our estimate of , the true proportion of tuples having value for attribute . Note that we have omitted the candidate subscript for clarity.

We need to introduce some machinery. Consider functions of the form Let be the set of all such functions, where , since there are such functions. For any

, consider the random variable

By linearity of expectation, it’s clear that , since is constant and for each . Since each is a function of the samples taken , each is likewise uniquely determined from samples, so we can write , where each sample is a random variable distributed according to . Note that the function satisfies the Lipschitz property

for any and . For example, this will occur with equality if ; that is, if assigns opposite signs to and , then changing this single sample moves of the empirical mass in such a way that it does not get canceled out. We may therefore apply the method of bounded differences [56] to yield the following McDiarmid inequality—a generalization of the standard Hoeffding’s inequality:

Recalling that , this actually says that

This holds for any . Union bounding over all such , we have that

If this does not happen (i.e., for every , we have ), then we have that , since for any attribute value , . But if for all , this means that we must have some such that

As such is an upper bound on . The desired result follows from noting that

Optimality of the bound in Theorem 1. If we solve for in Theorem 1, we see that we must have That is, samples are necessary guarantee that the empirical discrete distribution is no further than from the true discrete distribution , with high probability. This matches the information theoretical lower bound noted in prior work [12, 20, 26, 72].

Generating P-values from Theorem 1. We use the above bound to generate P-values for testing the null hypotheses in Lemma 2. From the discussion in that lemma, a tester which rejects for when it observes