VizRec: A framework for secure data exploration via visual representation

11/01/2018 ∙ by Lorenzo De Stefani, et al. ∙ Brown University MIT 0

Visual representations of data (visualizations) are tools of great importance and widespread use in data analytics as they provide users visual insight to patterns in the observed data in a simple and effective way. However, since visualizations tools are applied to sample data, there is a a risk of visualizing random fluctuations in the sample rather than a true pattern in the data. This problem is even more significant when visualization is used to identify interesting patterns among many possible possibilities, or to identify an interesting deviation in a pair of observations among many possible pairs, as commonly done in visual recommendation systems. We present VizRec, a framework for improving the performance of visual recommendation systems by quantifying the statistical significance of recommended visualizations. The proposed methodology allows to control the probability of misleading visual recommendations using both classical statistical testing procedures and a novel application of the Vapnik Chervonenkis (VC) dimension method which is a fundamental concept in statistical learning theory.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual recommendation engines, such as SeeDB [33], Voyager2 [34], Rank-By-Feature [27], Show Me [19], MuVE [9], VizDeck [16], DeepEye [23], or Draco-Learn [21], aim to help users to more quickly explore a dataset and find interesting insights. To achieve that goal they use widely different approaches and techniques. For example, SeeDB [33] makes recommendations based on a reference view; it tries to find a visualization which is very different from the one the user has currently on the screen. In contrast, DeepEye [23] tries to generally recommend a good visualization for a given dataset based on previously generated visualizations.

However, all these systems do have in common that they can significantly increase the risk of finding false insights. This happens in the moment a visualization is not just a pretty picture but a tool presenting facts about the data to the user. For example, consider a user exploring a dataset containing information about different wines. After browsing the data for a bit, she creates a visualization of ranking by origin showing that wines from France are higher rated. If her only takeaway from the visual is, that in

this particular dataset wines from France have a higher rating, there is no risk of false insight. Essentially, in this case no inference happens as she is completely aware that the next dataset could look entirely different. However, it is neither in the nature of users to constraint themselves to such thinking [36], nor would in many cases such a visualization be very insightful. Rather, based on the visualization she most likely would infer that French wines are generally rated higher; generalizing her visualization insight to general datasets and thus creating an actually interesting insight. Statistically savvy users will now test this insight on whether this generalization is actually statistically valid using the appropriate test. Even more technically savvy users will also consider other hypothesis they tried and adjust the statistical testing procedure to account for the multiple comparisons problem. This is important as every additional hypothesis, explicitly expressed as a test or implicitly observed through a visualization, increases the risk of finding insights which are just spurious effects.

However, what happens when the visualization recommendations are generated by one of the above systems? First and most importantly, the user does not know if the effect shown by the visualization is actually significant or not. Even worse, she can not use a standard statistical method and “simply” test the effect shown in the visualization for significance. Visual recommendation engines are potentially checking thousands of visualizations for their interesting-factor (e.g., in case of SeeDB how different the visualization is to the current one) in just a few seconds. As a result, by testing thousands of visualizations it is almost guaranteed that the system will find something “interesting” regardless of whether the observed phenomenon is actually statistically valid or not. A test for significance for the recommended visualization should therefore consider the whole history of tests done by the recommendation engine.

Advocates of visual recommendation engines usually argue that visual recommendations systems are meant to be as hypothesis generation engines, which should always be validated on a separate hold-out dataset. While this is a valid method to control false discoveries, it is also important to understand its implications: (1) None, really none, of the found insights from the exploration dataset should be regarded as an actual insight before they are validated. This is clearly problematic if one observation may steer towards another during the exploration. (2) Splitting a dataset into an exploration and a hold-out set can significantly reduce the power (i.e., the chance to find actual true phenomena). (3) The hold-out needs to be controlled for the multi-hypothesis problem unless the user only wants to use it exactly once for a single test.

In this paper, we present an alternative approach, called VizRec, a framework to make visual recommendation engines “safe”. We focus on the visual recommendation technique proposed by SeeDB as it uses a clear semantic for what “interesting” means. However, our techniques can be adjusted to other visualization frameworks or even hypothesis generation tools, like Data Polygamy, as long as the “interesting” criterion can be expressed as a statistical test. The core idea of VizRec is that it not only evaluates the interesting-factor using a statistical test, but also that it automatically adjusts the significance value based on the search space of the recommendation engine to avoid the multiple-hypothesis pitfall. We make the following contributions:

  • We formalize the process of making visualization recommendations as statistical hypothesis testing.

  • We discuss how different possible approaches to make visualization recommendation engines safe based on classical statistical testing, and on the use of Chernoff-type large deviation bounds. We further discuss how the performance of both these approaches decreases due to the necessity of accounting for a high number of potentially adaptively chosen tests.

  • We propose a method based on the use of VC dimension, which allows controlling the probability of observing fake discoveries during the visualization recommendation process. VizRec allows control of the Family Wise Error Rate (FWER) at a given level .

  • We evaluate the performance of our system, in comparison with SeeDB via extensive experimental analysis.

The remainder of this paper is organized as follows: In Secton 2 we give a definition of the visualization recommendation problem in rigorous probabilistic terms. In Section 3 we discuss possible approaches for the visualization recommendation problems, and why highlight how they both suffer due to the necessity of accounting for a high number of statistical tests. In Section 4 we introduce our VizRec approach based on VC dimension and we argue how it allows to overcome the problems discussed in Section 3. In Section 5, we discuss some guidelines of practical interest for implementation, while in Section 6 we present an extensive experimental evaluation of the effectiveness of VizRec.

2 Problem Statement

In this section we first describe informally how SeeDB [33] makes visual recommendations, and then formalize the problem of providing statistically valid visualization recommendations.

2.1 SeeDB

SeeDB makes recommendations based on the currently shown visualization, and the corresponding reference query. SeeDB explores possible recommendations (target queries) by most commonly adding/changing the composition of the reference query. To rank the recommendations, SeeDB recommends to the user the most interesting target queries based on the deviations. Thus, SeeDB assumes a larger deviation indicates a more interesting target query. SeeDB can use different types of measures to quantify the difference between reference and target visualization. However, earth-mover distance or KL-divergence are prevalently used. Furthermore, SeeDB truncates uninteresting visualizations if the deviation value (e.g., KL-divergence) is below a certain threshold, which can be seen as a minimum visual distance.

To showcase how SeeDB can recommend spurious correlations, we used a survey conducted on Amazon Mechanical Turk [3] with answers for (mostly unrelated) multiple-choice questions. Questions range from Would you rather drive an electric car or a gas-powered car? to What is your highest achieved education level? and Do you play Pokemon Go?.

(a) Reference View
(b) Recommended View 1
(c) Recommended View 2
(d) Recommended View 3
Figure 1: An example of SeeDB [33] on survey data.

Suppose the user analyzes U.S. voting trends over this data set as shown in Figure 0(a). Based on this reference view created by the user, the actual SeeDB algorithm over our data set would recommend Figures 0(b)-0(d) as visualizations (settings of SeeDB are similar to the ones in [33]’s Figure 1). All of the recommended visualizations by SeeDB seem first to be interesting as they clearly indicate a trend reversal and a correlation between certain beliefs and voting behavior. However, these trends could also be solely a random discovery in the way the visualization is just produced because for some reason the dataset by chance just produced such bar graphs for these queries. Indeed, in 6.1 we show that some of these recommendations are not statistically safe recommendations.

Having a larger dataset may help an analyst to be more certain that the recommendations are actually significant. However, considering the total number of samples used for these visualizations bears the fundamental question, how big the data actually needs to be in order to guarantee that there are no false discoveries. Further, when automatically exploring the dataset at some point with enough filtering as done by SeeDB every dataset becomes “too small” to guarantee anything. In the following, we now formalize the recommendation process of SeeDB, introduce the visual recommendation problem and its relation to hypothesis testing.

2.2 Problem set-up

Let denote the global sample space, that is the set comprised of the records of the entire population of interest (e.g., the records of US citizens). We can imagine in the form of a two-dimensional, relational table with rows and columns, where each column represents a feature or attribute of the records.

In many practical scenarios the analyst is in possession of a sample , , composed by a much smaller number of records , and aims to estimate the properties of by analyzing the properties of the smaller dataset.

In this work we assume that the input to the recommendation engine is a dataset, , consists of records chosen uniformly at random and independently from the universe (e.g., our survey data). We refer to as the sample dataset or the training dataset and denote by

the probability distribution that generated the sample

. Alternatively, one can consider as a sample of size from a distribution that corresponds to a possibly infinite domain.

As discussed in Section 1 rather than enabling users to only interpret visualizations for some particular data set, we want to make sure that when they try to generalize results a system provides them with statistical guarantees. Hence, we consider the input not as a universal but as a random sample of the universe, and we focus on observation in that apply to the entire universe. For example prices in city centers are higher than in neighboring districts instead of prices in the city center for apartments A, B, C, … in Manhattan are higher than for apartments X, Y, Z, … in the Bronx, Queens and Brooklyn neighboring districts for the data collected in 2017 on July 21st.

We divide the features (or attributes) of the records in , and hence , into four groups:

  1. binary features, taking values in (e.g., unit has AC).

  2. discrete features with a total order(e.g., #bedrooms, #bathrooms).

  3. continuous features with a total order, (e.g., price of the unit).

  4. categorical features, taking values in a finite unordered domain (e.g., city, zipcode).

We assume that each feature is equipped with a natural metric. This is achieved, by mapping the values of a feature to a real number (e.g., for a boolean feature by mapping the value to , and to ).

2.3 Visualizations

A common form of visualization used in the dice-and-slice exploration setting of an OLAP (OnLine Analytical Processing) data cube is a bar graph. We formalize the type of visualizations we investigate in our approach in the following way:

Definition 1.

A visualization is a tuple

which can be represented as a bar graph and which describes the result of a query of the form

VerbatimBVerbatim sql SELECT X, COUNT(Y) FROM D WHERE F GROUP BY X

with the aggregate /COUNT/, represented on the axis, being partitioned according to the values of a discrete feature /X/, after restricting the records of the input dataset being considered to a subset for which the filter predicate /F/ holds.

The support of a visualization is the number of records of which satisfy the predicate , and is denoted as . The selectivity of a visualization , denoted as , is defined as the fraction of records which satisfy , that is

(1)

Note that if the group-by attribute being selected is not discrete, it will also be necessary to determinate a finite set ranges for its value, or “buckets”, to be used in the visualization.
The aggregate /COUNT(Y)/ counts of records which satisfy the query predicate grouped according to the values of the feature /X/, henceforth referred as the group-by feature.

While in this work we focus on the aggregate /COUNT(Y)/, our approach can be extended with minor modifications to the average /AVG(Y)/ aggregate, which is given by the average of the values of the records which satisfy the query predicate grouped according to the values of the group-by feature.

Though other aggregates like /MIN(Y),MAX(Y),SUM(Y)/ are used in systems like SeeDB [33] we believe that they do not add value over /COUNT(Y)/ or /AVG(Y)/ aggregates. Even worse, they may lead to misleading visualizations. /MIN(Y),MAX(Y)/ aggregates are inherently not suited to represent statistically significant behavior of distributions. Rather than using /MIN(Y)/ or /MAX(Y)/ aggregates, a user should consider conditional expectations (e.g., aggregates in the form for some constant ) to explore extreme values following the concepts of extreme-value theory as described in [12, 10]. While our results can be easily generalized to other types of visualizations (e.g., heat maps), for the sake of applicability, in this work we focus on the type captured by Definition 1.

Visualizations as distributions: When considering the /COUNT/ aggregate, it is possible to interpret the data distribution represented by a histogram visualization as representative of the probability mass function (pmf) of the discreterandom variable which takes the values of the group by feature /X/, each with probability corresponding count of the records for each column normalized according to the support of the visualization . Such distribution, does indeed correspond to the distribution of the values of the group-by feature /X/ with respect to the distribution after conditioning (or filtering) with respect to the predicate associated with . Such correspondence between visualizations and distributions provides us a natural criteria to compare visualizations by evaluating their statistical difference.

2.4 Visualization recommendations

Following the paradigm of SeeDB [33], given a first “starting visualization we aim to identify other “interesting” visualizations to be recommended. In particular, this follows the widespread mantra of Overview first, zoom and filter, then details-on-demand [28] where a visualization system ideally lets the user first pick an interesting reference visualization and helps him then to automate the zoom, filter and details-on-demand tasks.

In this work we define a visualization to be interesting with respect to a starting visualization if and are different, that is, if they represent a different statistical behavior (i.e., different distribution) of the common group-by feature /X/ under the predicates associated with and , respectively. Consistently, the greater the difference between the reference and the candidate , the higher the interest of as a candidate recommendation for .

Note that the constraint according to which possible recommended visualization must share the same group-by feature as the starting visualization is a simple consequence of the fact that we are interested in the study of how different filter predicate conditions do influence the behaviors of the same feature /X/. In the absence of such constraint, the analyst may consider how different features behave, thus leading to observations of questionable interest.

Whereas the general problem of recommending interesting visualizations can come either in the way of anomaly detection or as according to the mantra of

[28] we want to focus on the latter in this work. That is, we say that a candidate recommendation visualization is interesting with respect to a reference visualization if the two are “different enough”. That is is their distance reaches a threshold according to some distance measure .

In its simplest form, may be zero. However it makes more sense to define in terms of a minimum visual distance required by a user to spot a difference[30] when shown both the reference and a visualization of interest.

While there is a high degree of generality in the selection of the notion of difference between visualizations to be used, in this work, we leverage the correspondence between visualizations which corresponds to Definition 1 and the pmf of the group-by feature condition on the filter of the visualization, and we measure the difference between visualizations based on the difference between the associated pmfs.

In this work, we use the “Chebyschev distance” to quantify the difference between visualizations. Given two pmfs and over the same support set , the Chebyscev distance between and is given by:

(2)

where (resp., ) denotes the probability of a random variable taking value according to the distribution (resp., ).

This choice of difference metric is particularly appropriate when comparing visualizations as it highlights the maximum difference between the relative frequency of a certain value of the group-by feature in according to the conditional distribution given by the filter predicate of the two visualizations being considered. In other words, when comparing two histogram visualizations the Chebyschev distance captures the maximum difference between pairs of corresponding columns of the histograms.

3 Statistically safe visualizations and recommendations

While the statistical pitfalls of exploratory data analysis are well understood and documented (in scholarly papers [25, 31], "surprising statistical" discoveries [1, 2], and even a famous cartoon [35]), the connection to visualizations, and specifically visual recommendations engines, have only recently begun to be rigorously studied [37, 17, 4]. In this section, we formulate the statistical problem in the visual recommendation context and explore simple probabilistic techniques to solve it. A more powerful method, based on statistical learning theory is presented in the Section 4. While, for concreteness, in our presentation we focus on the SeeDB paradigms, the proposed model and techniques can be used for other recommendation systems too (e.g., Data Polygamy [5]).

A first crucial observation is that a system that provides visual representation of data and aims to highlight an interesting relationship between visualizations should provide tools to allow the analyst to ascertain that the phenomena being observed are actually statistically relevant. A recommender system should ensure that visualized result displays characteristics that are non-random and visually intelligible. That is a user looking at two visualizations should both be able to understand that they are different and why they are different without worrying whether visual features are due to missing support or random noise.

Recall that a dataset available to our visualization decision algorithm is a sample from an underlying distribution . Assume that a particular visualization is interesting with respect to . The question we are trying to answer is, how likely it is that the visualization is also interesting with respect to the underlying distribution .

In probabilistic terms, each query with a filter predicate /F/ corresponds to an event over . For concreteness consider a visualization of a histogram of a (discrete and finite) variable conditioned on an event , denoted .

The true values (in ) for are given by

We estimate these values in a dataset of size by

(3)

If in the histogram of is visually different from the histogram of , what can we rigorously predict about the difference between the histograms in ?

Therefore, we say that the difference between two visualizations and is statistically significant if and only if the difference observed between the two in the finite sample is due to an actual difference between the two histograms with respect to the distribution .The recommendation problem thus becomes in its general form to recommend a candidate visualization for a reference only if their corresponding histograms are statistically different with respect to the true underlying distribution .

Our goal is to verify that interesting visualization flagged by our algorithm with respect to generalize to interesting visualizations with respect to .

3.1 Classical statistical testing

In the classical statistical testing setting, our problem could be framed in two ways: Either it could be formulated as a goodness-of-fit test or as a homogeneity test. In a goodness-of-fit-test for a given starting reference visualization and a candidate visualization recommendation a hypothesis is considered in the form of whether certain statistical attributes of the reference visualization (i.e. the expected attributes) fit the corresponding observed attributes from the candidate query. Classical tests include the single test

-test for discrete distributions or the Kolmogorov-Smirnov test for continuous random variables. In the visualization context thus a candidate query would be selected as interesting when the null hypothesis of the attributes being similar is rejected. A

homogeneity-test on the other hand tests the hypothesis that two samples were generated by the same underlying distribution, which may be unknown. This is done, for example, by a two samples -test, or a -test comparing one column in two histograms. A system based on homogeneity-tests would then select a candidate query as interesting iff the sample corresponding to is not homogeneous with the underlying data distribution of the reference query .

However, there are major difficulties in applying standard statistical tests to the visualization problem. First, depending on the input data the correct test needs to be selected. For example, when using a -test over discrete attributes, each bucket must not be empty. A general rule of thumb to make sure estimates are reliable is to have at least samples per bucket. Further, there should be enough samples to actually use the -test. Else, Fisher’s exact test should be used for small sample sizes. In addition to each test being only applicable to certain input data, they all do guarantee a different notion of interest. A user that is presented with the test results of one or multiple tests usually will not be able to immediately connect the results to the notion of a significant visual difference as described in subsection 2.4. This brings up the problem on how comparable results of e.g a -test against a -test actually are in terms of visual difference.

Second, when blindly throwing statistical tests at the visualization problem to deal with different types of input data and different sample sizes queries return, the question is whether the hypothesis being tested are not too simple for recognizing visual difference in a meaningful way. Consider for this a -test that essentially compares whether the observed mean resembled the expected mean. Naturally, a consequence is that if they differ the candidate query should get recommended. This may however lead to many wrong recommendations merely because the null hypothesis used is too simple and gets rejected too often. The solution could be to use a test better suited for the problem, e.g. in the form of a -test. However, as we show in subsection 6.3 a -test is not suited best to spot a notion of statistical significant visual difference and comes with its own problems as pointed out in [7]. There is no free lunch and thus no universal single test that solves the visualization recommendation in general.

Third, most tests only offer merely asymptotic guarantees

because of the test statistic they use. Especially for skewed distributions or queries that return only a small number of rows this is problematic. Consider once again a

-test and a heavily skewed distribution over e.g. buckets. It is then very unlikely that the test statistic for the number of samples used in a visualization setting is already -distributed.

Fourth, recommendations based on tests are not necessarily symmetric in the sense that if the candidate query was used as reference query, the old reference query would not get necessarily recommended at all. This is especially true for the -test.

Lastly, one might be tempted to simply combine statistical testing with a magical cutoff or subsequent selection of visualizations based on the distance measure introduced in Equation 2. I.e. an algorithm could be to first apply statistical testing to get a candidate set of potentially interesting visualizations, rank them after the distance measure and then select all visualizations as interesting that have a distance higher than with respect to the reference visualization. However, this approach merely delays the problem of potentially making false discoveries: Though according to the tests the visualizations may be indeed different according to the difference guarantees the employed tests offer, they do not necessarily need to be significantly different enough. This again can be shown using the example in subsection 6.3.

3.2 Recommendation validation via estimation

The goal of VizRec is to provide an efficient and rigorous way to verify that two visualizations are indeed statistically different with “finite-sample” guarantees.

In VizRec we use the sample dataset , and the visualizations obtained from it, in order to obtain approximations of the visualization according to the entire global sample space using the Chernoff bound and later VC dimensions.

Consider a single histogram visualization , and assume it is comprised of bars, one for each of the possible values of the chosen group-by feature . Let , denote the normalized bars corresponding to . Note that such bars denote the probability of a randomly chosen record from being such that conditioned on the fact that such record satisfies the predicate associated with .

Using the sample set we compute an approximation for the , which we denote as , following (3). As only these approximations can be computed from the available data, any choice regarding which visualizations should be recommended may depend only on such approximations.

In order to argue guarantees regarding the reliability of such decisions, it is necessary to bound the maximum difference between the correct and estimated sizes of bars in the normalized histograms.

In particular for a given (i.e., our level of control for false positive recommendations) we want to compute the minimum value

Such value would in turn quantify the accuracy of the estimation of the ’s obtained by means of their empirical counterpart ’s.

Let denote the predicate associated with our visualization . We denote as (resp., ) the subset of (resp., ) which is composed by those records that satisfy the predicate . Given a choice of group-by attribute the value (resp., ) corresponds to (resp., is computed as) the relative frequency of records such that in (resp., ).

Fact 1.

Let be an uniform random sample of composed by records. For any choice of predicate, as specified in Definition 1, the subset is a uniform random sample of of size .

Fact 1 is a straightforward consequence of the fact that is an uniform sample of .

From Fact 1 and from the definitions of the ’s and of the ’s, clearly follows that the ’s are unbiased estimators for the ’s. That is for, every value of the group-by feature we have:

(4)

In order to bound the estimation error is therefore sufficient to bound the deviation from expectation (i.e., ) of the empirical estimate (i.e., ).

Chernoff-Bounds [20], allows to obtain such bounds as:

(5)

Recall our definition of selectivity of a visualization as , we can then rewrite (5) as:

(6)

where . A clear consequence of (6), is that the higher (resp., the lower) the selectivity of a visualization, the higher (resp., the lower) the quality of the estimate.

In [17], the authors use the same kind of bounds to develop a sampling algorithm which ensures the visual property of relative ordering. That is, for any pairs of vars corresponding to two possible values and of the same group-by feature , if , then, with high probability, holds as well.

While the method previously described based on an application of the Chernoff bound appears to be very useful and practical, it is important to remark that in a single application it may only offer guarantees on the quality of the approximation of one bar from a single visualization.

While it is in general possible to combine multiple applications of the Chernoff bound, the required correction leads to a quick and marked decrease of the quality of the bound. As an example if our visualization is composed by bars, a bound on the quality of the approximation of all of the bars will result in:

This bound is obtained using the union bound [20]. While potentially tolerable for a small value of , for large values the performance decrease can possibly lead to a complete loss of significance of the bound itself.

3.3 Correcting for Adaptive Multi-Comparisons

The previous section showed how we can evaluate the interest of a single visualization as a statistical test. However, this does not yet control the multiple comparisons problem which is inherent in the recommendation system process that explores many possible visualizations. Clearly, if we let a recommendation system explore an unlimited number of possible visualizations, it will eventually find an “interesting” one, even in random data. How do distinguish real interesting visualization from spurious ones that are the results of random fluctuation in the sample?

The simplest and safest method to avoid the problem is to test every visualization recommendation on an independent sample not used during the exploration process that led to this recommendation. While easy and safe, this method is clearly not practical for a process that explores many possible visualizations. Data is limited, and we cannot set aside a holdout sample for each possible test. The process needs access to as much data as possible in the exploration process in order to discover all interesting insights. Can we control the generalization error when computing a number of visualizations based on one set of data?

Assume that in our exploration of possible interesting visualizations we tried different visualization patterns, and we computed for each of these patterns a bound , , on the probability that the corresponding observation in the sample does not generalize to the distribution with respect to the entire global sample space . It is tempting to conclude that the probability that any of the visualizations does not generalized is bounded by . Unfortunately, this probability is actually much larger when the choice of the tested visualization depends of the outcome of prior tests.

This phenomenon is often referred to as Freedman’s paradox  and the only known practical approach to correct for it is to sum the error probability of all possible tests, not only the tests actually executed 111Theoretical methods, such as differential privacy [8] claim to offer an alternative method to address this issue. In practice however, the signal is lost in the added randomization before it becomes practical.. Note that standard statistical techniques for controlling the Family-Wise- Error-Rate (FWER) or the False Discovery Rate (FDR) require that the collection of tests is fixed independent of the data and therefore do not apply to the adaptive exploration scenario.

In the visualization setting, we could decide a-priory that we only consider visualizations of a particular set of patterns, say conditioning on no more than features. Such restriction defines a bound on the total size of the search space, say . If we now explore the search space and recommend visualization that pass the individual visualization test with confidence level we are guaranteed that the probability that any of our recommendations does not generalize is bounded by . As we show in the experiments section this method is only effective for relatively small search space. Next, we present a novel technique that is significantly more powerful in large and more complex search spaces.

4 Statistical Guarantees Via Uniform Convergence Bounds

We now present a powerful alternative approach, providing strong and practical statistical guarantees through a novel application of VC-dimension theory.

VC-dimension is usually considered a highly theoretical concept, with limited practical applications, mostly because of the difficulty in estimating the value of the VC-dimension of interesting learning concept class.

Surprisingly, we develop a simple and effective method to compute the VC-dimension of a class of predicate queries which are used to generate the visualizations according to Definition 1, which leads to a practical and efficient solution to our problem. We start with a brief overview of the VC theory and its application to sample complexity. We then discuss its specific application to our visualization problem, emphasizing a simple, efficient, and easy to compute reduction of the theory to practical applications.

4.1 VC dimension

The Vapnik-Chernovenkis (VC) dimension is a measure of the complexity or expressiveness of a family of indicator functions (or equivalently a family of subsets) [32]. Formally, VC-dimension is defined on range spaces:

Definition 2.

A range space is a pair where is a (finite or infinite) set and is a (finite or infinite) family of subsets of . The members of are called points and those of are called ranges.

Note that both and can be infinite. Consider now a projection of the ranges into a finite set of points :

Definition 3.

Let be a range space and let be a finite set of points in .

  1. The projection of on is defined as

  2. If , then is said to be shattered by .

The VC-dimension of a range space is the cardinality of the largest set shattered by the space:

Definition 4.

Let be a range space. The VC-dimension of , denoted is the maximum cardinality of a shattered subset of . If there are arbitrary large shattered subsets, then .

Note that a range space with an arbitrarily large (or infinite) set of points and an arbitrary large family of ranges can have bounded VC-dimension (see section 4.2).

A simple example is the family of closed intervals in , where , and corresponds to the (infinite) set of all possible closed intervals intervals , such that we have ). Let be any subset of such that . No interval in can define the subset so the VC-dimension of this range space is . This observation is generalized in the well known result:

Lemma 1.

The VC-Dimension of the union of closed intervals in equals .

VC-dimension, allows to characterize the sample complexity of a learning problem, that is it allows to obtain a tradeoff between the number of sample points being observed by a learning algorithm and the performances achievable by the algorithm itself.

Consider a range space , and a fixed range . If we sample uniformly at random a set of size we know that the fraction rapidly converges to the frequency of elements of in . Furthermore, there are standard bounds (Chernoff, Hoeffding ) for evaluating the quality of this approximation. The question becomes much harder when we want to estimate simultaneously the sizes or frequencies of all ranges in using one sample of elements. A finite VC-dimension implies an explicit upper bound on the number of random samples needed to achieve that within pre-defined error bounds (the uniform convergence property).

For a formal definition we need to distinguish between finite , where we case estimate the sizes , and infinite , where we estimate , the frequency of

in a uniform distribution over

.

Definition 5 (Absolute approximation).

Let be a range space and let . A subset is an absolute -approximation for iff for all we have that for finite ,

(7)

In [14] show an interesting connection between the VC dimension of a range space and the number of samples which are necessaries in order to obtain absolute -approximations of itself

Theorem 1 (Sample complexity [14]).

Let be a range-space of VC-dimension at most , and let and . Then, there exists an absolute positive constant such that any random subset of cardinality

(8)

is an -approximation for with probability at least .

The constant was shown experimentally [18] to be at most 222Indeed, we use in our experimental evaluation..

4.2 Statistically Valid Visualization through VC dimension

To apply the uniform convergence method via VC dimension to the visualization setup, we consider a range space , where is global domain, and consists of all the possible subsets of that can be selected by visualizations predicates. That is, includes all the subsets that correspond to any bar for any visualization which can be selected using the appropriate predicate filter. Given a choice of possible allowed predicates, we refer to the associate set of ranges as the “query range space” and we denote it .

The VC dimension of a query range class is a function of the type of select operators (i.e., ) and the number of (non-redundant) operators allowed on each feature in the construction of the allowed predicates. Note that depending on the domain of the selected features and the complexity according to which the predicate filters can be constructed, the number of possible predicates may be infinite. In order to use the VC-approach it is however sufficient to efficiently compute a finite upper bound of the VC-dimension of the set of allowed predicates. We discuss an efficient method for bounding the VC-dimension of a query range space in the next subsection (See Subsection 4.2).

In order to deploy the general results from the previous section, we have to verify that the sample provides an -approximation for the values for all the visualizations considered in the query range space .

To this end, it is useful to introduce the following, well known, property of VC dimension:

Fact 2.

Let be a range space of VC dimension . For any , the VC-dimension of is bounded by .

Using this fact in conjunction with Theorem 1 we have:

Lemma 2.

Let denote the range space of the queries being considered with VC dimension bounded by , and let . Let be a random subset of . Then there exists a constant , such that with probability at least for any filter defined in we have that the subset is an -approximation of with:

Proof.

Fact 1 ensures that given the dataset , for any choice of a predicate we have that is a random sample of . Therefore regardless of the specific choice of the predicate, we have that the VC dimension of the reduced range is bounded by . From Theorem 1 we have that if:

(9)

then is an approximation for the respective set . ∎

Lemma 2 provides us an efficient tool to evaluate the quality of our estimations of the actual ground truth values for any choice of predicate associated with the visualization. In particular, Lemma 2 verifies that the quality decreases gradually the more selective the predicate associated with a visualization is. That is, the smaller the cardinality of , the higher the uncertainty of the estimate is.

Corollary 1.

Let be a random sample from , and let be a query range space with VC dimension bounded from above by . For any visualization with and for any value we have that

(10)

where

(11)

denotes the predicate associated with the visualization and and denotes the group-by feature being considered.

4.3 The VizRec recommendation validation criteria

Consider now a given reference visualization and a candidate recommendation , both using as the group-by feature, were the the domain of has values (i.e., ). From Lemma 2, we have that with probability the empirical estimates of the normalized columns are accurate within , which depends on the size of the subset of the sample dataset used for the reconstruction of .

As argued in Section 3, we consider a candidate visualization worth of being recommended if it represents a different statistical behavior of the group-by feature with respect to the reference query.

Our VizRec strategy operated by comparing the values and for all values in the domain of the chosen group-by feature. Let (resp., ) denote the uncertainty such that with probability at least we have and accoding to Lemma 2. If it is the case that then we can conclude that with probability at least we have .

That is VizRec recognizes as statistically different (and hence, interesting) only pairs of visualizations for which the most different pair of corresponding columns differs by more than the error in the estimations from the sample. If that is the case, it is possible to guarantee that and are indeed according to the Chebyschev measure. Due to the uniform convergence bound ensured by the application of VC dimension, we can have that the probabilistic guarantees of this control hold simultaneously for all possible pairs of reference and candidate recommendation visualizations. The advantage of this approach compared to the use of multiple Chernoff bounds is discussed in Appendix C.1. Further, our VC dimension approach is agnostic to the adaptive nature of the testing as it accounts preventively for all possible evaluations of pairs of visualizations. Threrefore, we have:

Theorem 2.

For any given , VizRec ensures FWER control at level while offering visual recommendations.

1:procedure VizRec
2:Input: Starting visualization , query space , sample dataset , FWER target control level .
3:Output: A set of statistically safe recommendations sorted according to decreasing interest.
4:      Empty list of recommendations
5:      the group-by feature being considered.
6:      the predicate associated with .
7:      Uncertainty in approx.
8:     for  all  do
9:          the predicate associated with .
10:         
11:         
12:         
13:         if  then
14:                        
15:         or if stricter criteria with
16:         if  then
17:                             
18:     return sort according to the interest value.
Algorithm 1 VizRec: Visual Recommendations with VC dimension

This criteria can be strengthened by imposing a higher threshold of difference between two visualization in order for a candidate visualization to be considered interesting. As an example, in Section 3.2, we discuss the possible use of a threshold denoting visual discernability. When using this, more restrictive constraint, VizRec would accept a candidate visualization as interesting only if

We present a simplified psudocode of our VizRec procedure in Algorithm 1.

Our VizRec approach operates as the equivalent of a two-sample test, in the sense that we assume that in general there is uncertainty in the reconstruction of both the reference visualization and of the candidate . In some scenarios, it may be possible to assume that the reference visualization is given as exact. In such case in order for the candidate to be recommended it would be sufficient that

After identifying as set of recommendations whose interest is guaranteed with probability at least , VizRec ranks them according to the difference between their “empirical interest” (i.e., ) and the uncertainty of the evaluation of such measure (i.e.,

). While somewhat arbitrary, we chose this heuristic as it allows to emphasize the intrinsic value of visualizations with large support over those with small support.

4.4 The VC dimension of the Range Space

In order to actually deploy the VC dimension bounds previously discussed it is necessary to bound the VC dimension of the class of queries being considered. While challenging in general, we develop here a simple and effective bound on the VC dimension of the class of queries being considered based on the complexity of the constraints defining the predicates.

As discussed in Section 2.2, we assume that the values of the features can be mapped to real numbers. Hence constraint of values of a certain feature formalized using the operators and correspond to selecting intervals (either open or close) of the possible values of a feature. For each feature, the various clauses are connected by means of “or” operators. We characterize the complexity of such connection by the minimum number of non-redundant open and close intervals of the value. In particular we say that a connection of intervals is non-redundant is there is no connection of fewer intervals that selects the same values.

The VC dimension of a class of queries can then be characterized according to the number of non-redundant constraints applied to the various features.

Lemma 3.

Let denote the class of query functions such that each query is a conjunction of connections of clauses on the value of distinct features. The VC dimension of is:

(12)

where (resp., ) denotes the maximum number of non-redundant closed (resp., open) intervals of values corresponding to the connection of constraints regarding the value of the -th feature, for .

The proof for Lemma 3, presented in detail in Appendix A, proceeds by induction of the number of features which can be used in constructing the queries, and builds on known results on the VC dimension of a class of functions constituted by the union of a finite number of closed intervals on .

Algorithm 2 (in Appendix B) outlines a procedure which reduces an input query class to an equivalent non-redundant version and computed a bound on its VC dimension.

4.5 Trade-off between query complexity and minimum allowable selectivity

Consider exploring the space of possible recommendations by constructing the filter condition one clause at a time. Note that as the filter condition grows in its complexity (i.e., multiple non-trivial clauses are added) the number of records selected by the predicate, and hence, it selectivity, will decrease. It appears therefore reasonable to start evaluating simpler predicate filters and then proceed depth-first by adding more and more clauses. While reasonable, such procedure will possibly lead to explore large number of queries. However, most of the filters obtained by composing a high number of filters will likely lead to visualizations supported by very little sample points, and, hence, intrinsically unreliable.

Our VC dimension approach allows the system to recognize this fact and to use it in order to limit the search space. As discussed in Section 4.5, the lower the selectivity of the filter condition of a given visualization, the higher the uncertainty ,

In order for the difference between a candidate and the starting visualization to be deemed statistically relevant their Chebyshev distance has to be higher than . The Chebyshev norm distance, and, hence, the maximum interest of a candidate visualization is one. This clearly implies that all visualizations whose selectivity is such that

(13)

are not going to be interesting according to our procedure, and, hence, when exploring the space of possible recommendations, we can stop refining the queries once the selectivity of the candidate visualization drops below the threshold given by (13). This allows to prune the search space by eliminating from the exploration queries which are “not worth to be considered” as possible recommendations.

While this may appear as a weakness of our approach, it is instead consistent with the basic principle of distinguishing statistically relevant phenomena from effects of the noise introduced by the sampling process. While visualizations which involve just a very limited number of sample points may appear to represent a very interesting distribution of a subset of the data, they are more likely to represent random fluctuation in the selection of the input sample than a true phenomenon in the global domain.

By taking into consideration the selectivity of candidate visualizations, our method automatically adjusts the threshold of interest for candidate visualization.

5 Discussion

In this section, we discuss and motivate some guidelines to help the analyst determine which of the discussed tools are better suited for her actual setting.

5.1 Number of hypotheses being tested

Consider a scenario for which the system is limited to the analysis of a small number of possible visualizations. In this case, it is possible to evaluate which of these candidate visualizations are actually interesting with respect to the starting visualization by applying the testing with opportune correction for the number of hypotheses being tested, which in this case would correspond to the number of candidate visualizations.

Further, if in the same exploration session the analyst wants also to evaluate which of the visualizations are interesting with respect to a different starting visualization , this would require to treat all the additional candidate recommendations as additional hypotheses. In order to obtain FWER it will be necessary to correct for the number of hypotheses being tested, thus resulting in a loss of statistical power. Even though some FWER corrections such as the Holm or the Höchberg procedure allows to reduce the effect of the multiple hypotheses correction, compared to the the simple Bonferroni procedure, there is still a considerable decrease in the statistical power, in particular for settings for which only a low fraction of the hypotheses being tested have low -values. Therefore the testing approach appears to be more suitable for settings with a limited number of candidate recommendations being evaluated.

5.2 Bounding the complexity of the query class

The properties of our method can also be used to determine a bound the VC-dimension of the query range space being considered when looking to ensure that any candidate recommendation who differs from the reference visualization by at least is actually marked as a safe recommendation.

Let (resp., ) denote the reference (resp., a candidate) visualization, and let (resp., ) is selectivity.

Given the size of the available dataset , and the desired FWER control level , the maximum VC dimension which guarantees to meet these requirements can be obtained from (13) as:

This bound can be used as a guideline to limit the structure of the queries being considered. That is, it offers indications on the number of different features which can be considered when building the queries, and indications on their complexity, intended as the number of clauses being used in their construction (as discussed in Section 4.4).

5.3 Preprocessing heuristics

In this section we outline some preprocessing heuristics which allow to improve the effectiveness of our control procedures.

I. Removal of constant features: Features which assume the same value in all the records of the sample can be safely ignored.

II. Removal of identifier-type features: Features which assign a different unique value to each of the record can be removed. This is generally the case for identifier features (e.g.,“Street address” for real estate dataset). This appears justified as, due to the uniqueness of their values, they are not useful in constructing predicate conditions, nor they should be used as the “group-by” attribute (i.e., the attribute in the x-axis). All these heuristics share the fact that they allow to ignore some of the features (or columns) of the records. This will in turn impact both the search space and will allow to reduce the VC dimension of the query class being considered.

6 Experiments

In this section, we want to show how our framework can be applied towards both real data (i.e. the collected survey data) and synthetic data. We start by demonstrating different, problematic scenarios with our real dataset.

6.1 Anecdotal examples

Our first example shows that a system without statistical control may trick the user to believe in insights that are actually not valid and merely random. For this, assume the user wants to explore whether there is a subpopulation that believes differently from the overall population when it comes to whether obesity is a disease or not.

(a) Reference View
(b) SeeDB View 1
(c) SeeDB View 2
(d) SeeDB View 3
Figure 2: VizRec would not mark any of these visualizations as statistically significant as the difference computed with respect to the reference is not larger than the uncertainty of estimating the bars correctly.

As depicted in Figure 2 a user may falsely believe that people who prefer potato chips with Cheese flavour are more likely to believe that obesity is a disease. Though people that consume potato chips may lean more towards a view that obesity is a disease, the insight that in particular people who prefer the Cheese flavour are the most interesting subpopulation is questionable. Since for all the other flavours in our study333BBQ, Sea Salt & Vinegar, Sour Cream and Onion, Jalapeno, Cheddar and Sour Cream, Original/Plain, I don’t eat chips no visualization is within the top results, picking particularly the Cheese flavour as a belief changer looks like a potential false discovery. The next recommended result seems even more random: An automatic recommender system would imply to the user that persons who prefer the middle seat are more likely to believe that obesity is a disease. This seems hard to understand and more like some random result that the system produced.

In our second example, we want to show that VizRec may identify correctly the top SeeDB recommendation(s) as being statistically valid.

(a) Reference View
(b) SeeDB View 1
(c) SeeDB View 2
(d) SeeDB View 3
Figure 3: VizRec would also recommend the top visualization, but declares the other visualizations not being statistically significant enough.

Here, the user was interested in finding out whether there is a subpopulation that has a different voting behavior for the two main parties in the United States(cf. Figure 3). The recommended top visualization coincides with the top SeeDB result and seems sound. However, VizRec prevents the user to attribute a preference towards the Republican party for persons who prefer to listen to Country and Folk music.

Finally, it can also be the case that SeeDB recommends something as the top result which the VizRec framework would rule out as being not significant at all. As depicted in Figure 4 again a questionable relation between people who prefer Cheese flavoured potato chips and those who belief in Astrology would get recommended.

(a) Reference View
(b) SeeDB View 1
(c) SeeDB View 2
(d) SeeDB View 3
Figure 4: VizRec would not recommend the top visualization, but the second ranked one.

However, this is actually a false discovery and not backed statistically. On the contrary, VizRec deemed a relation between smokers and a belief in Astrology statistically sound. In this example another interesting problem is demonstrated: Though the interestingness scores of the top results are pretty close to each other (e.g. here ), the score itself is not sufficient to determine statistical relevance. Particularly, employing a simple cutoff may lead to false discoveries. Besides the complexity of the exploration space, the number of samples used to estimate both the reference query and the candidate query need to be accounted for too.

These anecdotal examples demonstrate that without any statistical control the user is likely to run into making false discoveries. Our VC approach does not only account for the number of samples but also for the complexity of the data exploration and is thus a well-suited tool for avoiding false discoveries in a visual recommendation system.

6.2 Random data leads to no discoveries

A meaningful baseline for any safe visual recommendation system is to make sure that random data does not lead to any recommendations. To demonstrate that the VC approach will not recommend any false positives, we generated a synthetic dataset with uniformly distributed data. samples were generated in total with the first column being selected as aggregate and the other 3 columns as features. The aggregate is uniformly distributed over and each of the features are uniformly distributed over .

With simple predicates (i.e. a queries formed from clauses solely) there are visualizations to be explored (a dummy value of was used in the queries to make a feature active or not. E.g. consider a query of the form . In this query, feature has no effect on the rows returned since . Note that using -values in the clauses does not change the VC dimension.). As a reference, a uniform distribution over was chosen. This means, that the expected support of any visualization is at least samples which is a fair amount to estimate bars.

Figure 5: Blue dots represent interest scores all evaluated visualizations. The curve denotes the threshold for recommendation using VC dimension = 4 to achieve control at level . The lower the VC dimension the more the curve takes the form of an “L”. Since there are no visualizations with scores higher than the -curve, no visualizations get recommended from the generated random data.

When not accounting for the multiple comparison problem -values below the threshold of

occur inevitably. A system without FWER guarantees would classify them thus as false positives. Using Bonferroni (or other comparable corrections) remedies this while, however, incurring a noticeable loss in statistical power.

In comparison, the lowest the VC approach guarantees is . As discussed in 4.2 the required threshold to be met by the Chebychev norm induced distance measure depends on the selectivity of the query. The necessity of this can be observed in Figure 5 too. With the interestingness scores(distances) being lower than the curve defined by for all queries in Figure 5 the VC approach does not recommend any false positives in this experiment. Using different distributions instead of the uniform one showed comparable results.

6.3 Statistical Testing vs. VC approach

We now show that while statistical testing in the form of a -test is in general not a wrong ingredient for building a VRS, in some situations it is unable to spot meaningful visual differences which would however be opportunely recognized by our VizRec approach.

Assume we had a query that yielded out of samples and a perfect estimator for the true distribution function of the reference and the query distribution. These distributions shall be distributed as in Figure 6. Thus, the -test would yield a p-value of of implying that they are different when no more than visualizations under Bonferroni’s correction are tested.

Figure 6: Comparing two close distributions that however should not be recommended since the visual difference criterion according to the VC dimension approach is not met.

However, at a VC dimension of () the required must be at least which is nearly twice as high as the difference at the first bar as shown in Figure 6. Thus, the VC approach would not select this visualization as being significantly different enough given the modest sample size. Using the test would recommend this visualization since it only spots that there is a difference but not whether the difference is significant enough.

In practice, scenarios like the one presented here occur especially due to outliers in the data. I.e. it could happen that for one feature value there are only 1-5 samples that would lead without any correction to a positive recommendation. Though a heuristic may ignore visualizations with less than

samples, this would come at the cost of ignoring rare phenomena and by using some magic number to define the threshold. I.e. when ignoring visualizations with less than samples per bin, easily an example with samples for some bin could be found where would pick up a non-visually significant visualization.

This underscores that a VRS using the -test  would correctly identify two visualizations being different but can not guarantee a meaningful difference in terms of a distance which is crucial to build usable systems without luring the user into a false sense of security. One may argue that filtering out visualizations after having performed statistical testing would remedy this(which may work in practice when the interestingness score is high enough), but then there was no guarantee that the distances observed is statistically guaranteed. Whereas the question is dependent on the scenario (i.e. which query and which guarantees a system needs to fulfill) we want to point out, that there is a simple way to use the -test to control significant distance too.

Furthermore we want to underscore the point that the Chi-squared test is indeed a very powerful test but that the correct estimation of the distribution dominates the selectivity. I.e., when we guarantee that the estimates for the probability mass function are close enough to the true values, a testing procedure like -test will even under a million possible hypothesis only need a small number of samples to spot a difference between two distributions. We thereby define the required number of point estimates to be in the range of bars as meaningful.

Figure 7: Chi-square distance and minimum number of samples required for the -test to reject the null hypothesis assuming Bonferroni correction with and queries.

In Figure 7 it is shown that even low values for the -distance only require queries with hundreds of samples to be identified correctly.

7 Related Work

[24] introduced the VC approach to provide -approximations for the selectivity of queries. Whereas they also consider joins in addition to multi-attribute selection queries, by restricting to conjunctions over multiple attributes as used naturally in OLAP and visual recommender systems we were able to lower the required VC dimension.

When comparing continuous aggregates for visualizations using e.g. kernel-density estimation for distribution plots other statistical testing procedures like the Kolmogorov-Smirnov test can be used. Also a viable alternative to goodness-of-fit tests like Fisher’s exact test,

-test or the Kolmogorov-Smirnov test independence tests like Kendall’s tau or Spearman’s rank correlation test may be used. A combination of multiple tests also accounting for differences in shape as detailed in [22]

can strengthen statistical guarantees on visualizations being different. Indeed, this is quite similar to hypothesis based feature extraction and selection approaches as in

[6][29].

Recent work [26] introduced the problem of group-by queries leading to wrong interpretations, specifically in the case when /AVG/ aggregates are used. To remedy this, the notion of a biased query is introduced. However, they do not account for the multiple comparison problem and also have no significant distance notion.

[3] introduced various control techniques for interactive data exploration scenarios. Whereas it accounts for the multiple comparison problem, it does not solve the problem of pointing out a statistical different enough distance between two visualizations.

[33] provides an approach to effectively compute visualizations over an exponential search space by using reuse of previous results and approximate queries. Visualizations are recommended by treating group-by results as normalized probability distributions and using various distance measures between two probability distributions to yield a ranking in order to recommend top- interesting visualizations. The authors found that the actual choice of the distance did not really alter results, which does not come at a great surprise given their relations as pointed out in [11].

As described in [28] zooming into particular interesting regions of the data is a key task performed by many users in the setting of data exploration. Our technique provides a simple and effective methodology which can be applied to a wide range of data. For example, an user may be given a geospatial dataset characterizing purchase power and want to have assistance in exploring interesting subregions. We believe our VC approach can be easily extended to allow for more complicated query types such as these.

8 Conclusion

In this work, we demonstrated why users should build visualization recommendation systems with mechanisms to ensure statistical guarantees in order to prevent users from making false discoveries or regarding noisy data as relevant. As a novel way which supplements classical statistical testing as in [37] we introduced a technique based on statistical learning comparable in its power to [24]. We demonstrated various trade-offs and problems to consider when using either technique and provided a simple heuristic on how to explore the vast search space more efficiently by pruning it through different preprocessing steps.

9 Acknowledgments

This research was funded in part by the DARPA Award 16-43-D3M-FP-040, NSF Award IIS-1562657, NSF Award RI-1813444 and gifts from Google, Microsoft and Intel.

References

  • Spu [2018] Spurious correlations. http://tylervigen.com/old-version.html, 2018. Accessed: 2018-08-01.
  • pHa [2018] Hack your way to scientific glory. https://projects.fivethirtyeight.com/p-hacking/, 2018. Accessed: 2018-08-01.
  • Binnig et al. [2017] C. Binnig, L. D. Stefani, T. Kraska, E. Upfal, E. Zgraggen, and Z. Zhao. Toward sustainable insights, or why polygamy is bad for you. In CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 8-11, 2017, Online Proceedings, 2017. URL http://cidrdb.org/cidr2017/papers/p56-binnig-cidr17.pdf.
  • Chaudhuri et al. [1998] S. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for histogram construction: How much is enough? In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, SIGMOD ’98, pages 436–447, New York, NY, USA, 1998. ACM. ISBN 0-89791-995-5. doi: 10.1145/276304.276343. URL http://doi.acm.org/10.1145/276304.276343.
  • Chirigati et al. [2016] F. Chirigati, H. Doraiswamy, T. Damoulas, and J. Freire. Data polygamy: The many-many relationships among urban spatio-temporal data sets. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, pages 1011–1025, New York, NY, USA, 2016. ISBN 978-1-4503-3531-7. doi: 10.1145/2882903.2915245. URL http://doi.acm.org/10.1145/2882903.2915245.
  • Christ et al. [2018] M. Christ, N. Braun, J. Neuffer, and A. W. Kempa-Liehr. Time series feature extraction on basis of scalable hypothesis tests (tsfresh – a python package). Neurocomputing, 307:72 – 77, 2018. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2018.03.067. URL http://www.sciencedirect.com/science/article/pii/S0925231218304843.
  • Delucchi [1983] K. L. Delucchi. The use and misuse of chi-square: Lewis and burke revisited. Psychological Bulletin, 94(1):166, 1983.
  • Dwork et al. [2015] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. The reusable holdout: Preserving validity in adaptive data analysis. Science, 349(6248):636–638, 2015.
  • Ehsan et al. [2016] H. Ehsan, M. A. Sharaf, and P. K. Chrysanthis. Muve: Efficient multi-objective view recommendation for visual data exploration. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pages 731–742, May 2016. doi: 10.1109/ICDE.2016.7498285.
  • Fasen et al. [2014] V. Fasen, C. Klüppelberg, and A. Menzel. Quantifying extreme risks. In Risk-A Multidisciplinary Introduction, pages 151–181. Springer, 2014.
  • Gibbs and Su [2002] A. L. Gibbs and F. E. Su. On choosing and bounding probability metrics. International statistical review, 70(3):419–435, 2002.
  • Gissibl et al. [2017] N. Gissibl, C. Klüppelberg, and J. Mager. Big data: Progress in automating extreme risk analysis. In Berechenbarkeit der Welt?, pages 171–189. Springer, 2017.
  • Greenwood and Nikulin [1996] P. E. Greenwood and M. S. Nikulin. A guide to chi-squared testing, volume 280. John Wiley & Sons, 1996.
  • Har-Peled and Sharir [2011] S. Har-Peled and M. Sharir. Relative (p, )-approximations in geometry. Discrete & Computational Geometry, 45(3):462–496, 2011.
  • Kanamori et al. [2012] T. Kanamori, T. Suzuki, and M. Sugiyama. -divergence estimation and two-sample homogeneity test under semiparametric density-ratio models. IEEE Transactions on Information Theory, 58(2):708–720, Feb 2012. ISSN 0018-9448. doi: 10.1109/TIT.2011.2163380.
  • Key et al. [2012] A. Key, B. Howe, D. Perry, and C. Aragon. Vizdeck: self-organizing dashboards for visual analytics. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 681–684. ACM, 2012.
  • Kim et al. [2015] A. Kim, E. Blais, A. Parameswaran, P. Indyk, S. Madden, and R. Rubinfeld. Rapid sampling for visualizations with ordering guarantees. Proceedings of the VLDB Endowment, 8(5):521–532, 2015.
  • Löffler and Phillips [2008] M. Löffler and J. M. Phillips. Shape fitting on point sets with probability distributions. CoRR, abs/0812.2967, 2008. URL http://arxiv.org/abs/0812.2967.
  • Mackinlay et al. [2007] J. Mackinlay, P. Hanrahan, and C. Stolte. Show me: Automatic presentation for visual analysis. IEEE Transactions on Visualization and Computer Graphics, 13(6):1137–1144, Nov 2007. ISSN 1077-2626. doi: 10.1109/TVCG.2007.70594.
  • Mitzenmacher and Upfal [2005] M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, New York, NY, USA, 2005. ISBN 0521835402.
  • Moritz et al. [2018] D. Moritz, C. Wang, G. L. Nelson, A. H. Lin, M. Smith, B. Howe, and J. Heer. Formalizing visualization design knowledge as constraints: Actionable and extensible models in draco. In IEEE Trans. Visualization and Comp. Graphics (Proc. InfoVis), 2018.
  • Porter [2008] F. C. Porter. Testing consistency of two histograms. arXiv preprint arXiv:0804.0380, 2008.
  • Qin et al. [2018] X. Qin, Y. Luo, N. Tang, and G. Li. Deepeye: An automatic big data visualization framework. Big Data Mining and Analytics, 1(1):75–82, March 2018. doi: 10.26599/BDMA.2018.9020007.
  • Riondato et al. [2011] M. Riondato, M. Akdere, U. Çetintemel, S. B. Zdonik, and E. Upfal. The vc-dimension of queries and selectivity estimation through sampling. CoRR, abs/1101.5805, 2011. URL http://arxiv.org/abs/1101.5805.
  • Russo and Zou [2015] D. Russo and J. Zou. How much does your data exploration overfit? controlling bias via information usage. arXiv preprint arXiv:1511.05219, 2015.
  • Salimi et al. [2018] B. Salimi, J. Gehrke, and D. Suciu. Hypdb: Detect, explain and resolve bias in olap. arXiv preprint arXiv:1803.04562, 2018.
  • Seo and Shneiderman [2005] J. Seo and B. Shneiderman. A rank-by-feature framework for interactive exploration of multidimensional data. Information Visualization, 4(2):96–113, July 2005. ISSN 1473-8716. doi: 10.1057/palgrave.ivs.9500091. URL http://dx.doi.org/10.1057/palgrave.ivs.9500091.
  • Shneiderman [1996] B. Shneiderman. The eyes have it: A task by data type taxonomy for information visualizations. In Visual Languages, 1996. Proceedings., IEEE Symposium on, pages 336–343. IEEE, 1996.
  • Steppe and Bauer [1996] J. M. Steppe and K. W. Bauer.

    Improved feature screening in feedforward neural networks.

    Neurocomputing, 13(1):47 – 58, 1996. ISSN 0925-2312. doi: https://doi.org/10.1016/0925-2312(95)00100-X. URL http://www.sciencedirect.com/science/article/pii/092523129500100X.
  • Stern and Johnson [2010] M. K. Stern and J. H. Johnson. Just noticeable difference. The Corsini Encyclopedia of Psychology, pages 1–2, 2010.
  • Taylor and Tibshirani [2015] J. Taylor and R. J. Tibshirani. Statistical learning and selective inference. Proceedings of the National Academy of Sciences, 112(25):7629–7634, 2015.
  • Vapnik and Chervonenkis [2015] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of complexity, pages 11–30. Springer, 2015.
  • Vartak et al. [2014] M. Vartak, S. Madden, A. Parameswaran, and N. Polyzotis. Seedb: automatically generating query visualizations. Proceedings of the VLDB Endowment, 7(13):1581–1584, 2014.
  • Wongsuphasawat et al. [2017] K. Wongsuphasawat, Z. Qu, D. Moritz, R. Chang, F. Ouk, A. Anand, J. Mackinlay, B. Howe, and J. Heer. Voyager 2: Augmenting visual analysis with partial view specifications. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pages 2648–2659. ACM, 2017.
  • xkcd [2018] xkcd. Significant. https://xkcd.com/882/, 2018. Accessed: 2018-08-01.
  • Zgraggen et al. [2018] E. Zgraggen, Z. Zhao, R. C. Zeleznik, and T. Kraska. Investigating the effect of the multiple comparisons problem in visual analysis. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI 2018, Montreal, QC, Canada, April 21-26, 2018, page 479, 2018. doi: 10.1145/3173574.3174053. URL http://doi.acm.org/10.1145/3173574.3174053.
  • Zhao et al. [2017] Z. Zhao, L. D. Stefani, E. Zgraggen, C. Binnig, E. Upfal, and T. Kraska. Controlling false discoveries during interactive data exploration. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 527–540, 2017. doi: 10.1145/3035918.3064019. URL http://doi.acm.org/10.1145/3035918.3064019.

Appendix A Proof on VC dimension bound for classes of queries

Proof of Lemma 3.

The proof is by induction on : in the base case we have . In this case, the VC dimension of corresponds to the VC dimension of the union of closed intervals and open intervals on the line. By a simple modification of the result of Lemma 1, we have that it has VC dimension at most . Let us now inductively assume that the statement holds for . In order to conclude the proof we shall verify that it holds for as well.

Assume towards contradiction that there exists a set of points that can be shattered by . From the inductive hypothesis, we have that for any subset of with more than cannot be shattered by the family of query functions which can express constraints only on the features . Without loss of generality let denote one of the maximal subsets of which can be shattered using only the constraints on the features . The following fact is important for our argument

Fact 1: Recall that the queries in are constituted by logical conjunctions (i.e., “and”) of connections (i.e.,“or” statement) of constraints on a feature. Hence, for any function in if any of the connections are such that they assume value “false”, then the query will not select such point regardless of the value of the remaining connections being conjuncted.

Consider any assignment of to the points in and let the range which realizes such shattering.

If would assign to any point in value “0”, then, according to the structure of the queries, no constraint on the -th feature would allow to assign to it value “1”, and, hence, it would not be possible to shatter .

Note that for any assignment of to the points in , there may may not exists two ranges and such that based solely on constraints on the first features, on would assign “0” to a point in and the other would assign “1” to the same point. If that would be the case, then i would be possible to shatter points using just constraints on the first features and this would violate the inductive hypothesis.

Without loss of generality, in the following we can therefore assume that for any assignment of to the points in the ranges that realize such assignment just based on the first features would assign “1” to all the points in . This implies that the shattering of the points in relies solely on the constraints on the values of the -th feature.

Consider now the points in , according to our assumption . As discusses in the base of the induction, it is not possible to shatter points using just (resp., ) closed (resp., open) intervals on the -th dimension.

Hence it is not possible to shatter and we have a contradiction. ∎

Appendix B Bounding the VC dimension of a given query class

Algorithm 2 operates “consolidating” redundant clauses in and equivalent minimal set. Such consolidation is achieved by first (lines 2-10) determining the minimal open intervals for each feature, and then by merging overlapping closed intervals whenever possible. Note that given a specific choice of a query class, Algorithm 2, needs to be run just once to bound its VC dimension.

1:procedure ReduceIntervals
2:Input: A conjunction of k clauses expressed as open or closed intervals
3:Output: An equivalent non-redundant conjunction of clauses
4:      Initialization non redundant intervals
5:     if  Any constraints of the kind “” given as input then
6:          such that is clause;
7:     else     
8:      such that is clause;
9:      such that is clause;
10:      such that is clause;
11:      list of closed intervals sorted according to the values of the increasingly.
12:     
13:     if  then
14:         ;
15:     else
16:         ;
17:         if  then
18:                              
19:     
20:     if  then
21:         ;
22:     else
23:         ;
24:         if  then
25:                              
26:     if  then
27:         return
28:     else if  then
29:         return
30:     else
31:               
32:     Update by removing all intervals ;
33:     
34:     while  do Remove intervals contained in larger intervals
35:         
36:         
37:         
38:               
39:     while  do Merge intervals with superpositions
40:         
41:         
42:         if   then
43:              
44:              
45:              
46:         else
47:              
48:                              
49:     return Output non-redundant constraints
Algorithm 2 Reduction to non-redundant intervals

Appendix C Modified -test 

As described in [13] by modifying the -test to test a more complex null hypothesis

(14)

together with the corresponding test statistic

following a non-central distribution with degrees of freedom and non-locality parameter

The distance guaranteed then is the -distance

. In fact, this testing procedure can be further generalized to use other distances that belong to the family of f-divergences like the total variation distance or Kullback-Leibler divergence via an suitable transformation or by using non-parametric models to estimate the distance measures before comparing them via statistical testing as in described in more detail in

[15].

c.1 VC bounds and Chernoff-Hoeffding bounds

When only a single visualization needs to be recommended, using Chernoff-Hoeffding bounds is both easier and yields a better guarantee than a VC-based bound. However, when testing multiple visualizations on the same data, the VC approach dominates Chernoff-Hoeffding bounds.

Figure 8: Observed estimation error for a single random experiment and bounds obtainable at a significance level of . For a VC dimension of , the bound is only slightly worse than the Chernoff bound of a single visualization. However, with increasing number of visualizations Chernoff bounds are more conservative than bounds obtained via VC.

To demonstrate this, random data was generated to estimate the probability mass function of a biased Binomial distributed according to

. In Figure 8 a sample path is shown together with the theoretical bounds for a setup with a VC dimension of . Comparing to a Chernoff-Hoeffding bound of a single visualization, the VC approach outperforms Chernoff-Hoeffding bounds when learning multiple visualizations at once.

c.2 Restricting the search space

In this experiment we show that without restriction of the search space, it is likely that only few visualizations get recommended. For this, we took again the survey dataset and removed constants and identifier alike columns. To make it visually more distinct, we also added some artificial columns like a running identifier.

Figure 9: -curves before and after restricting the search space. The distance needs to be higher than the uncertainty quantified via .

As shown in Figure 9

, restricting the search space leads to more discoveries. I.e. the goal is to not be over-conservative and apply meaningful preprocessing or feature selection by the user first.