InfiniViz: Interactive Visual Exploration using Progressive Bin Refinement

10/05/2017 ∙ by Niranjan Kamat, et al. ∙ 0

Interactive visualizations can accelerate the data analysis loop through near-instantaneous feedback. To achieve interactivity, techniques such as data cubes and sampling are typically employed. While data cubes can speedup querying for moderate-sized datasets, they are ineffective at doing so at a larger scales due to the size of the materialized data cubes. On the other hand, while sampling can help scale to large datasets, it adds sampling error and the associated issues into the process. While increasing accuracy by looking at more data may sometimes be valuable, providing result minutiae might not be necessary if they do not impart additional significant information. Indeed, such details not only incur a higher computational cost, but also tax the cognitive load of the analyst with worthless trivia. To reduce both the computational and cognitive expenses, we introduce InfiniViz. Through a novel result refinement-based querying paradigm, InfiniViz provides error-free results for large datasets by increasing bin resolutions progressively over time. Through real and simulated workloads over real and benchmark datasets, we evaluate and demonstrate InfiniViz's utility at reducing both cognitive and computational costs, while minimizing information loss.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Visualizations are widely used in data analysis. In this era of Big Data, querying large datasets has become a necessity. While analyzing large datasets helps discover insights that are otherwise unattainable [1, 2, 3], querying such large datasets is computationally expensive and inconducive to interactivity. Providing results within interactive latencies (ms) has been shown to greatly benefit analysis, with failing to do so having significant adverse consequences on the analysis outcomes [4, 5]. Marrying these twin concerns of interactivity and the need to query large datasets presents us with two compelling, contradictory forces.

A popular technique to reconcile them is to only process a sample of the data. However, this necessitates providing not only the query result but also its error [6, 7, 8, 9]. Interpretation of sampling error, even by experts, has been known to be error-prone [10, 11]. Annotation of visual results with errors can further introduce clutter [12, 13, 14].

Online aggregation builds on sampling by accessing more data over time, thereby reducing the error [17, 18, 19, 20, 21]. However, error-free results are unavailable till the entire dataset is processed. Further, as sampling error depends on the quality and size of the sample, it is common for sampling error to be large, especially at lower sampling rates, which can be expected during interactive response times (Figure 1

). Additionally, highly selective queries reduce the number of tuples passing through, thereby lowering the effective sampling rate and increasing the error. Data skew further worsens these issues.

Another common technique used to achieve interactivity is data cubes [22] – a cube contains pre-computed aggregates for user-specified measures for all possible column combinations. Consequently, user queries can be run over the pre-computed result sets, which are usually smaller by multiple orders of magnitude. These result sets can be indexed, compounding the query speedups. However, cube size increases exponentially with the number of columns and their cardinalities, increasing the cube materialization cost, but more importantly from an interactive querying perspective, the time needed to query it – a cube constructed over a dataset with dimensions, with the dimension having cardinality , can consist of up to rows for each measure. Thus, constructing cubes over large datasets is inconducive to the pursuit of interactivity.

Fig. 1: Approximate Querying Modes: While sampling and online aggregation can result in large sampling error, especially during interactive response times, progressive refinement can deliver results without measure error from the get-go. The result resolution can be increased over time.
Fig. 2: InfiniViz during Progressive Refinement: We see a user’s view in her interactive exploration of a dataset having 50M rows and 8 columns, and importantly 8 linked views, through filter and refine queries (Section II-B).

In the context of visualizations, however, constructing a cube over the entire dataset might not be necessary. As screen resolution limits the information that can be presented to the user, data binning is a natural consequence, and has been looked at previously by multiple systems including Profiler [23], imMens [24], and Nanocubes [25], which construct cubes over the smaller binned datasets. This reduces the size of the data cube that needs to be computed, thus enabling interactive query execution. These systems also allow a user to increase the refinement (zoom) level of a result, thereby providing finer-grained results due to the smaller bin sizes.

In this paper, we run with this concept of result refinement, delving into its multiple benefits in delivering approximate visualizations whose resolutions increase over time. This leads us to propose a novel querying paradigm – progressive refinement (Section II-A). We treat result refinement as one of the primary query operators, alongside filtering (Section II-B). While being mindful of the required interactive latency, interesting results are refined over time to increase the resolution of the results.

Previous binning-based systems usually allow a user to refine the results by letting her specify the refinement level. We extend this approach by introducing a generalized, richer refinement operator, that allows for specification of multiple refinement related criteria such as the number of results, average deviance, relative entropy change, in addition to the refinement level (Section II-C). As these criteria might be contradictory with each other (consider maximum number of results vs minimum refinement level), we use the result content to trade them off through our novel information theory-based metrics (Section II-D), which results in our Generalized Refinement Operator (GRO) (Section II-E). While other systems have used binning as a means to achieve data reduction, we look at binning through the lens of approximate querying, including the notion of error (Section II-C3) in such systems.

Our experiments demonstrate that not only is the InfiniViz response time low ( ms), but the overall computational cost is also a couple of magnitudes lower than the cost of querying the underlying non-binned dataset (Section IV). Our detailed user studies demonstrate, through statistically significant results, InfiniViz’s ability to accelerate not only the individual queries, but more importantly, the overall data analysis loop as well.

Further, not only is the computational cost reduced, but the cognitive load over the user in understanding the results reduces as well. In analyzing individual interesting results, e.g. comparing multiple bins or figuring out relationship between different distributions, it is known that having numerous uninteresting results can hurt the analysis [26, 27]. Indeed, our user studies also demonstrate that deluging the user with insignificant results hinders her analysis. As InfiniViz does not inundate the user with multiple insignificant results, she can focus on the interesting results. Further, as people excel at summarizing and generating patterns from visual data, having fewer results does not hurt in this endeavor [28, 29].

Our progressive refinement approach can be summarized by the following SQL query. Note that while we consider the use-case of histograms, our approach can be extended in a straightforward fashion to heatmaps as well.

SELECT agg_func(agg_col) AS y
FROM UNION(set_of_binned_tables)
WHERE filters
GROUP BY grouping_col AS x
HAVING resolution(x)

I-a Motivating Example

Let us join Sia, a data analyst, in her exploratory analysis of flight patterns. The dataset (PAD, Section IV-B) consists of details of individual flights such as their duration, delays in departure and arrival times, distance covered, distance between arrival and destination, etc. Our user studies were conducted this dataset, and the user behavior described here resembles that of our users.

Sia wishes to familiarize herself with the data, and extract interesting bits using the standard operators of filter, drill-down, and roll-up. She wants to explore flight patterns during takeoff and landing. She does so by setting a filter to the Speed dimension to only consider flights having low speeds (Figure 2). She notices that setting the filter changes the Altitude plot – only low altitude results are returned. She examines this plot in more detail by clicking on its interesting bars. The Elapsed Minutes dimension is also correlated with the Speed dimension, which she refines in an automated fashion through our GRO operator, as she wishes to refine the entire Elapsed Minutes plot. Upon observing interesting results in other plots (Latitude and Longitude), she might proceed to refine them as well. She might repeat this process with more filter queries followed by refines.

I-B Contributions

Thus, we help Sia by making the following contributions:

1. We introduce the concept of Progressive Refinement in the context of visualization-based analysis, to provide the user with error-free results within interactive response times over large datasets. While previous works allow a user to set the refinement level, we treat refinement as a first-class citizen, and look at interactive approximate querying through its lens. In this endeavor, we provide a novel refinement operator (GRO) that enables multiple resolution-related criteria to be used.

2. We introduce a novel monotonic information-theoretic metric that guides our result refinement approach.

3. Our extensive experiments, using real and simulated workloads over real-world and benchmark datasets, demonstrate not only InfiniViz’s efficacy at accelerating the data analysis loop, but also validates our proposed refinement-based interactive querying workflow (Table II).

Section II looks at the various concepts that underpin InfiniViz. We then present its system architecture in Section III. Section IV empirically validates our approaches. We then look at the related work in Section V, and finally conclude with our parting thoughts in Section VI.

Ii InfiniViz System

We now look at the various concepts underpinning InfiniViz. In particular, we further elucidate our progressive refinement concept, including its benefits and pitfalls, and look at its parallels with online aggregation. We also describe the possible user actions and the various refinement operators.

Ii-a Progressive Refinement

InfiniViz progressively improves the result resolution over time by refining the result. We now formalize these concepts of result resolution and result refinement.

Ii-A1 Result Resolution

The concept of result resolution is fairly straight-forward – small bin sizes result in a higher resolution, while larger bin sizes provide lower resolution.

Ii-A2 Result Refinement

After providing initial results within interactive response times over lower resolution data, the results having higher information loss are refined over time, thereby increasing the result resolution (Figure 1).

Ii-B User Actions & Behavior

We now look at the various direct and indirect manipulation actions available for interacting with InfiniViz. We enable two primary functionalities – filter and refine. A user can specify filters on any visualization (direct manipulation). This results in a WHERE predicate over the corresponding dimension. She can refine a specific bin’s resolution by clicking it (direct manipulation). Other direct manipulation actions include the Reset Plots button (removes all filters), Stop Refinement button (stops any further refinement), and buttons for handy, simple refinement actions (Section II-F). The generalized refinement operator (Section II-E) can be used by setting thresholds for the different refinement grammar (Section II-C) knobs (indirect manipulation). This operator can be used at the level of all visualizations, or a single visualization.

Ii-C Refinement Grammar

While current binning-based systems allow a user to set the refinement level, we enrich this traditional approach by providing multiple finely tunable knobs. These knobs, which together constitute our refinement grammar, allow a user to indirectly determine the result resolution.

Ii-C1 Refinement Level

In a traditional fashion, a user can set the minimum refinement level (MinRef). In addition, she can also set the maximum refinement level (MaxRef). Their default values are set to 0 and the resolution of the non-binned dataset, respectively.

Ii-C2 Number of Results

In addition, a user can specify the minimum (MinNR) or the maximum (MaxNR) number of results to be displayed. Their default values are set to 0 and , respectively.

Ii-C3 Average Deviance (AD)

One of the governing principles that any progressive refinement-based system should follow is to prefer refined results that show a marked difference from their expected value – a sub-bin for a bin

can be considered to impart more information if it deviates significantly from the uniform distribution

, where represents the value of the bin, and sub-bins(i) represents the set of sub-bins of a bin . This motivates our AD metric, which determines how well a bin summarizes it’s sub-bins, and can be given for a bin by , where represents the expected value of a sub-bin given its parent bin value. We summarize AD for a plot by the average AD of its bins.

We illustrate our metrics using the following running example. Consider a plot having 4 bins with the y-values , , , and , respectively. Suppose the individual bins are split into sub-bins having y-values {, }, {, }, {, }, and {, }, respectively. Then, AD for the first bin can be given by . Similarly, AD for the other bins will respectively be , , and . AD for the plot will be .

Ii-C4 Relative Entropy Change (REC)

Entropy can be used to determine the information content of a set of values by normalizing each value by the sum of all values in the set, and treating each normalized value as its probability 

[30]. We allow for a user to set bounds on minimum entropies for either a single bin or a plot. Bin entropy can be given by , where , where represents the set of bins in a visualization. Thereby, plot entropy can be given by .

Continuing with our example, the bin values can be converted into probabilities as {, , , }, and further into entropies as {, , , }. The plot entropy will be .

Entropy is a monotonically increasing metric, i.e. splitting a bin into smaller non-trivial sub-bins causes the resulting entropy to increase. We define REC by . REC is bounded from below by . A higher value indicates that the refined bins were similar to each other – whereas a value closer to indicates that the bins were dissimilar to each other and therefore, performing this refinement was beneficial to the user. This metric can be applied at either plot or bin level.

In our example, entropies of combined sub-bins are , , , and , respectively, with the entropy of the refined plot being . The plot REC is . The REC of individual bins would be , , , and .

Ii-D Result Ranking

Once a user sets the refinement grammar knobs, InfiniViz is tasked with the following naturally arising questions:

  • Which result bins should be refined?

  • What should their refinement level be?

To answer the first question, it is clear that the bins whose refinement results in greater information gain should be preferred. However, it is not possible to know this without actually refining the bins till the underlying dataset. Hence, we use our novel IGP metric (Section II-D2), which is based on our MEI metric (Section II-D1

), to estimate the information gain potential of a bin.

To answer the second question, in keeping with our underlying principle of progressive refinement, we provide results by progressively increasing the refinement level. Further refinement of a bin is stopped when doing so would violate the knobs set by the user, as elaborated by the result ranking algorithm (Section II-D3).

Ii-D1 Maximum Entropy Increase (MEI)

MEI is an entropy-based metric that measures the additional information that can be gained by refining a set of bins using the underlying dataset. Entropy of sub-bins will be minimized when a bin results in a single non-trivial sub-bin with identical measure value. It will be maximized when the sub-bins are identical. Thus, we can estimate the maximum possible entropy of a refined plot by

Thus, MEI possesses the important property of not needing to know the sub-bins’ values – it depends on the number of sub-bins, which is known apriori. MEI forms the primary building block for IGP as shown below.

To illustrate MEI, let the aforementioned 4 bins in Section II-C4 have have a domain size (difference between upper and lower end points of a bin) of each. The MEI as a result of refining all of them will be .

Ii-D2 Information Gain Potential (IGP)

Since MEI values can vary greatly between plots depending on their domain size, we define a new metric, IGP, for contextualizing the values. We define IGP as the ratio of MEI and entropy of the non-refined plot. Thus, the IGP for a plot can be given by . IGP can be given for a bin by . Bins with lower IGP values are given greater importance by the ranking function. As mentioned before, we use this metric in our ranking function to determine whether to show the refined bins to the user.

In our example, IGP for the plot will be . IGP for our bins will be , , , and , respectively.

Ii-D3 Result Ranking Algorithm

In our system, the cost of applying a filter to a binned dataset is much greater than that of performing aggregation. This results in a filter query resulting possibly in more sub-bins than specified by the MaxNR constraint. Hence, we might need to select a subset of sub-bins to display. We approach this problem by ranking the bins using AD and IGP, and displaying the top MaxNR bins.

Note that AD represents the benefit of refining a bin into the current sub-bins, whereas IGP estimates the benefit of refining the sub-bins till the highest resolution (original non-binned data). However, both these metrics cannot determine the true information gain possible (refining a bin till the highest resolution).

While it is possible to use either of these metrics to rank the results, we use a commonly-used heuristic of averaging the ranks as a result of using each individually 

[31, 32, 33], as we found this approach to perform the best (Section IV-F5). The highest ranked results are then displayed.

Ii-E Generalized Refinement Operator (GRO)

We have seen that a user can indirectly choose the results to display by setting thresholds for the refinement grammar knobs. While we would like to combine these knobs in a conjunctive fashion to ensure that none of them are violated, this might not always be possible. For example, suppose that MinRef results in 100 bins, while MaxNR is set to 50. Clearly, it is not possible to satisfy both constraints.

To solve this problem, we use a simple approach following Occam’s Razor. We rank the knobs based on their intuitiveness to a user and make sure that a more intuitive knob is not violated by a lesser intuitive one. Knobs are ranked in the following order – refinement levels, number of results, AD, and REC.

Ii-E1 GRO Algorithm

If MaxNR lies between the number of results obtainable at MinRef and MaxRef, we use AD and then REC to determine the bins to refine further. If the current number of bins is larger than MaxNR, we use the aforementioned result ranking algorithm. If MaxNR is not specified, we refine results till MinRef is satisfied. They are further refined till MaxRef if AD and REC are not violated. If refinement levels are not specified, we refine results till AD and REC are not violated. Thus, we can see that GRO is the culmination of all the techniques described so far.

Ii-F Useful Refinement Operators

In addition to GRO, we also provide simple, single-click operators that serve different purposes.

Ii-F1 Refine till Highest Resolution

This operator refines results till the non-binned dataset is queried. It follows the progressive refinement principle and provides results over intermediate refinement levels along the way.

Ii-F2 Run on Highest Resolution

This operator simply runs the query on the non-binned dataset. It allows a user to opt out of progressive refinement.

Ii-F3 Refine until Non-interesting Results

This operator stops refining a bin when it results in non-interesting sub-bins. We use AD as our interestingness measure, with the interestingness threshold set to in accordance with Weber’s law for detecting visually interesting results [34, 35].

Ii-G Parallels with Online Aggregation

Progressive refinement provides results having smaller bin sizes over time, without any result error. Of course, there is no free lunch – the uncertainty is encapsulated in the bin sizes. On the other hand, online aggregation provides results with sampling error by using a sample of the underlying dataset. By processing more data over time, the errors usually decrease. Thus, we can draw parallels between these approaches – they both result in errors, which decrease over time. The errors are over the x-axis in the progressive refinement case, while they are over the y-axis in the online aggregation case.

Ii-H Pros & Cons of Progressive Refinement

While progressive refinement presents a novel, powerful approach, it is not without its pitfalls. We summarize both its benefits and shortcomings below.

Ii-H1 Pros

Progressive refinement generally reduces the computational load on the system (Sections IV-F1). It also decreases the cognitive load on the user, which we define using the number of results displayed to answer a user’s filter and corresponding refine queries (Sections IV-F2 and IV-G5). Its response time is low, and depends on the size of the lowest resolution dataset (Sections IV-F1 and IV-G1). An important consequence of a data binning-based approach is that the size of the underlying data (number of rows) does not greatly affect the size of the binned datasets – they are more affected by the domain size and the bin resolution.

Ii-H2 Cons

One of the downsides is the offline pre-processing needed to compute the binned datasets. Determining the bins is also not straightforward as different filters are helped by different bin boundaries – a query will be answered precisely only when the filters align with the bins. A binning-based approach is also applicable only for algebraic and distributive measures [36].

Iii System Architecture

InfiniViz uses the standard client-server architecture, and comprises of 3 layers – frontend to query the data and view the results, middleware to translate user queries into progressive refinement queries that can be run on the backend, and backend to run the queries. While we could use any of the previously built binning-based systems, we use Crossfilter [37], as it well-suited to our session-based querying use case, and does not suffer from the cube size explosion problem. While crossfilter has low response times for session-based queries, it has a limitation of being single-threaded and therefore cannot take advantage of the modern multi-core processing power – we rectify this issue through the standard technique of horizontal data sharding and parallelization [38].

Fig. 3: InfiniViz System Architecture consists of 3 layers – Frontend, Middleware, and Backend – and employs a novel parallelized crossfilter at the Backend.

Iii-a System Components

We now describe the various components of InfiniViz in more detail.

Iii-A1 Frontend

A user interacts with InfiniViz through the frontend. She can issue different filter and refine queries, which are passed on to the middleware, which queries the backend and returns the results to the frontend.

Iii-A2 Middleware

The middleware, which runs on a Node.js server [39], interprets the user action, determines the queries that must be run, and dispatches them to the backend. Upon receiving the results back, it determines the results that must be displayed to the user, and dispatches them to the frontend.

Iii-A3 Backend

In an offline pre-processing step, the dataset is binned into multiple smaller datasets, which are then sharded horizontally. A parallelized multi-process crossfilter instance is created for every binned dataset, with each process running a crossfilter on its allocated shard.

At run-time, the backend end-point, termed Co-ordinator, receives a query from the middleware, consisting of a set of filters and the resolution of the binned dataset on which they must be applied. Co-ordinator passes on the query to the Master process of the specified parallelized crossfilter. Upon receiving the results from the Master, it passes on the results to the middleware.

Master: Master passes the query to its workers. Once it receives results from all workers, it aggregates the results, and returns the combined result to the Co-ordinator.

Workers: Upon receiving the filters from its Master, each Worker applies them to its crossfilter and returns the results.

Iii-B Binning Strategies

While the concept of binning using histograms has a rich history [40, 41], determining the ideal binning strategy has remained an elusive problem. Different strategies have different benefits – wider bins reduce noise for low density areas at the cost of lower precision, while narrower bins provide higher precision for high density areas while increasing the effect of noise.

These considerations for static visualizations are further complicated in the dynamic case. In this context, our progressive refinement concept, which varies bin definitions over time, thereby providing the results at varying resolutions according to the user’s requests, can be seen as an attempt at solving the bin determination problem.

There exist two general strategies for binning – equi-width and equi-data. In the equi-width case, the domain is divided equally between bins, whereas in the equi-data case, each bin consists of approximately equal number of tuples. Our preliminary user studies guided us towards using equi-width binning since it results in the changes to x-axis being smoother and more intuitive, thereby allowing a user to focus on the changes in the y-axis. On the other hand, in the equi-data case, changes occur to the bin ranges on the x-axis, which are more difficult to visually analyze.

Currently, bins for different dimensions are determined independently of each other as it is unclear whether the added complexity is worthwhile – we use the marginal distribution along a dimension in determining its bins. In the future, we would like to consider prior workloads in determining the bins.

Iv Experiments

We evaluated InfiniViz extensively through real and simulated workloads over real and benchmark datasets, using numerous metrics, some which are the execution time, number of results displayed, and the number of queries executed and hypotheses tested.

Iv-a Experimental Setup

Users interact with InfiniViz through its user interface on the client machine – a Ubuntu Linux 16.04.3 LTS system with a 4-core 3.3GHz Intel Core CPU, 16GB DDR3 @ 1600 MHz memory, and a 256GB @ 7200 RPM disk. The datasets, as given in Table I, are loaded in our parallelized version of Crossfilter 1.3.12 running on Node.js 7.4.0 on our server – a Ubuntu Linux 14.04.1 LTS system with a 24-core 2.4GHz Intel Xeon CPU, 256GB DDR3 @ 1866 MHz memory, and a 500GB @ 7200 RPM disk, which communicates over a 1 Gbps network with the client.

Iv-B Datasets

We evaluated InfiniViz using 5 datasets as given in Table I, with 3 of them being real-world datasets – a private aviation dataset (PAD), Flights [42], and Brightkite [43]. SPLOM [23], the standard benchmark in interactive data cubing, was used to generate two datasets having 10M and 1B rows each. To maintain uniformity across datasets in our experiments, each dataset was used to create 5 binned datasets (refinement levels from 0 to 4), with each split generating 2 sub-bins from a bin. As users can query the underlying non-binned dataset as well, this results in a total of 6 refinement levels for each dataset.

Iv-C User Study Setup

We designed our user study to understand user behavior in exploration of large datasets through the progressive refinement paradigm, and evaluate the benefits and short-comings of InfiniViz. Users were asked to explore the PAD dataset and extract possible insights. They were also asked to report any hypotheses that they might be testing, and whether their hypotheses were validated or invalidated by their queries. The participants consisted of 12 graduate students pursuing their PhD. All participants, except 2, were conducting research in the fields of either databases or data mining, and thus had a background in data analysis.

Each user study consisted of two sessions. In one of the sessions, users were asked to explore the dataset using the full-fledged InfiniViz system using filter and refine queries. In the other session, as part of the base case, users analyzed the underlying non-binned dataset using only filter queries, without any of the progressive refinement features. To control for learning and order effects, session order was randomized. Each experiment lasted for a minimum of 5 minutes. Users could continue exploration at the end of the 5 minutes, if they chose to. There was a 5 minute break between the two sessions. All user actions performed during the study were logged. At the end of the sessions, multiple metrics were computed.

Iv-D Simulated Querying Setup

In addition to the user study, to study InfiniViz in more detail, we simulated user behavior through queries generated in an automated fashion, resulting in 100 queries for each of the datasets. As modeling complex user behavior is a non-trivial task, we employed a simple, generalized model, where a user changes filters on different plots, and then increases the refinement level incrementally over all visualizations. Each workload was executed 3 times, with the caches being flushed before every run. The presented results are their averages over the 3 runs.

Iv-E Workloads

The simulated queries resulted in the following four workloads – PAD_Auto, Flights, Brightkite, and SPLOM_10M. To study the user behavior sessions in concert with the simulated sessions, we modified the user sessions as follows – we inserted refinement queries similar to those described in Section IV-D after every filter query. Refinement queries issued by the user were removed. This gave us the PAD_Progressive and PAD_Base workloads. User sessions, in their non-modified form, are studied in more detail in Section IV-G.

Dataset Refinement File
(# Dimensions) Level Size
PAD (8) 0 64K 2M
1 1M 37M
2 10M 332M
3 28M 865M
4 38M 1.2G
base 50M 1.5G
Flights (6) 0 755 22K
1 21K 583K
2 869K 23M
3 17M 434M
4 81M 2.0G
base 121M 2.5G
Brightkite (4) 0 336 7K
1 22K 425K
2 574K 11M
3 3M 56M
4 4M 80M
base 4.7M 86M
SPLOM_10M (5) 0 1018 28K
1 10K 275K
2 102K 3M
3 841K 21M
4 4M 102M
base 10M 235M
SPLOM_1B (5) 0 1.5K 43K
1 19K 525K
2 226K 6M
3 2.46M 63M
4 22.78M 566M
base111 Querying individual tuples of the underlying SPLOM_1B dataset is currently not possible in InfiniViz due to the memory requirements of crossfilter. Hence, the results for SPLOM_1B are not provided – while it is possible to query the binned datasets, the baseline results are unavailable. 1B 22G
TABLE I: Datasets.

Iv-F Results

We evaluate the benefits provided by the progressive refinement paradigm over the base case (querying the underlying dataset) exhaustively using multiple, complementary metrics.

Iv-F1 Reduction in Computation Time (RCT)

The biggest benefit that progressive refinement provides is the reduction of the execution time as queries do not need to hit the underlying dataset. We define RCT as the ratio of the cumulative time taken to answer a query through the progressive refinement paradigm, to the time taken by the query running over the underlying data. A lower value indicates that the user query was satisfactorily answered at a lower computational cost, while a value larger than 1 indicates that progressive refinement might not have been useful. Figure 4 shows that while the cumulative execution time increases with the refinement level, it is still lower than the time taken to run a query over the non-binned dataset. Further, the initial response time is extremely low for all workloads.

Fig. 4: Computation Time.
Fig. 5: Number of Results.

Iv-F2 Reduction in Number of Results (RNR)

By reducing the number of irrelevant results, InfiniViz reduces the cognitive load on the user, allowing her to focus on the more interesting results. We define RNR as the ratio of number of results shown to the user to the number of results that can be obtained by running the query over the underlying dataset. Figure 5 shows that RNR is low – even at the refinement level of 4, RNR is at least an order of magnitude smaller than 1 for all workloads.

Iv-F3 Result Error (RE)

A binned result can be used to estimate its refined results using uniform distribution (Section II-C3). The true value of refined results can be determined by running the query over the underlying non-binned dataset. While the binned results are themselves accurate, RE captures how well the bins reflect the results over the underlying data. We define RE for a bin by , where sub-bins represents the results that lie within the bin that are obtained by running the query over the non-binned dataset. represents the expected refined result under the uniform distribution assumption. Figure 6 shows that RE generally decreases over increasing refinement levels with low enough errors even at the level of 1 for some workloads.

Fig. 6: Result Error.

Iv-F4 Anomalous Fraction (AF)

Fig. 7: Anomalous Results.

Analyzing anomalous results is an important part of data analysis. While there exist different complex techniques to determine anomalous results such as using Lorenz curve [44], p-value [45], Gini co-efficient [46], etc. we use a simple context-dependent metric – we term a result as being anomalous if it is significantly different from its neighbors, i.e. . Figure 7 shows that in most cases, the number of anomalous results decreases with increasing refinement levels – the number of results increases with increasing refinement levels, while the number of underlying anomalous results stays constant. We cannot explain increase in this metric at refinement levels 1 and 2 for some workloads.

Iv-F5 Effectiveness of Ranking Techniques

As a user can limit the number of results to display, InfiniViz uses a novel result ranking technique as detailed in Section II-D3. The true importance of a result is determined by it’s RE using the underlying dataset– those with larger REs are given greater importance. We present Spearman’s Correlation Coefficient [47] for each of our ranking techniques, which as described in Section II-D3, consist of using only AD, or only IGP, or their average rank. Interestingly, Figure 8 shows that using the average rank results in a better ranking scheme than using either of the techniques individually for all workloads.

Fig. 8: Ranking Effectiveness.

Iv-F6 Relative Entropy Change

We determine the average information loss as a result of not refining till the highest level using the average of RECs (Section II-C4). Figure 9 shows that this metric generally decreases with increasing refinement levels – this is due to the results at higher refinement levels having greater resemblance with the result over underlying dataset, causing entropies to be similar.

Fig. 9: Relative Entropy Change.

Iv-F7 Data Sparsity

Since data binning forms an integral part of this paper, we look at data sparsity, i.e. the ratio of the number of rows in a binned dataset to the maximum number of rows possible, which given cardinality for the dimension can be given by , for dimensions. Unless there exist

distinct tuples, some bins can be expected to be empty. Due to the curse of dimensionality, we would expect this ratio to decrease with increasing resolution levels, which Figure 

10 indeed demonstrates.

Fig. 10: Data Sparsity

Iv-G User Study-Specific Results

In this section, we analyze the user study results in a detailed fashion. Note that the sessions consist of user-specified filter and refine queries. We measure multiple metrics for every user query session, and aggregate them over sessions through their average and median. Table II summarily demonstrates how the progressive refinement paradigm improves upon the base case (querying the underlying non-binned dataset). We note that the results are statistically significant (), even for the stronger hypothesis of the metric in the progressive refinement case being greater, or appropriately lesser, than the base case, for all metrics except Session Duration

. In discussing the results, we use the median value instead of the average to account for outliers, although both values are similar for most of the metrics.

Metric Median Average StdDev
Prog Base Prog Base Prog Base
Avg Query Time (s) 0.07 27.03 0.08 24.02 0.046 17.35
Total Query Time (s) 2.64 147.3 2.64 169.2 1.31 68.16
Query Time Fraction 0.004 0.47 0.004 0.51 0.004 0.144
# Filter Queries 30.5 5 38.7 5.2 22.84 1.3
Session Duration (m) 8.5 5.2 8.5 5.6 2.7 0.89
# Hypotheses 7 1 8.91 1.8 6.2 1.3
TABLE II: User Study Results.

Iv-G1 Query Execution Time

Providing low query execution times is a central requirement for a interactive analysis system. We can see that the progressive refinement paradigm took s to execute a query compared with s taken by the base case. Thus, InfiniViz not only provides multiple orders of magnitude speedup over the base case, but does so within interactive response times.

This results in of the session time being spent running the query in the base case compared with for progressive refinement. Wasting such a large fraction of an analyst’s time is clearly inconducive for interactivity – as we will see, this has a snowballing effect on the number of queries a user executes in a session and consequently more importantly, the number of hypotheses she is able to investigate.

Iv-G2 Filter Queries

We now look at the number of filter queries in each session. Section IV-G5 discusses refinement queries. While queries could be run through progressive refinement approach, only queries could be run in the base case. This shows that users were often content with the results at lower resolution as in the progressive refinement case, and as a result were able to explore multiple hypotheses quickly.

Iv-G3 Session Duration

While users were made aware of when their session reached the 5 minute mark, most users continued exploring for a total duration of minutes in the progressive refinement case, in comparison with the minutes spent in the base case. We attribute this extra time spent by busy graduate students to their curiosity in analyzing the dataset, utility of the progressive refinement paradigm, and the usefulness of InfiniViz in helping them do so. Note that even after normalizing for the session duration, number filter queries issued is significantly larger in the progressive refinement case.

Iv-G4 Hypothesis Testing

An important functionality that any data exploration system should provide is facilitation of hypothesis testing, i.e. being able to quickly form and validate hypotheses. As mentioned in Section IV-C, users informed us of their hypothesis, which they tested through filter and refinement queries. Users were able to test hypotheses through the progressive refinement paradigm, in comparison with a single hypothesis in the base case. Note that we do not refer to the statistical sense of the term hypothesis testing.

Iv-G5 Refinement Queries

In this section, we look at different refinement query statistics. We note that only of the queries refined the data – a large majority of the queries were filters. It is thus imperative that an optimal initial refinement level be chosen in a progressive refinement system – it should be detailed enough to not need further refinement, but low enough to return results within interactive times.

Refinement queries resulted in additional results being displayed. Thus, queries could be answered with fewer results, thereby, reducing users’ cognitive load.

They also resulted in a increase in the execution time. The lower increase in the execution time is due to some refinement queries being only over single plots, while each filter query needing to modify all plots.

Iv-G6 Query Complexity

The number of filters (WHERE predicates) is a good indicator of query complexity – more filters indicates a more detailed and a more complex query. Table III shows the percentage of queries having different number of filters. We can see that in the progressive refinement case, some queries had filters on up to 5 dimensions, while the base case had a maximum of 3 filters in a query – this is to be expected as the iteration speed allows users to test more complex hypotheses. Thus, not only are the users able to issue more queries, but the queries are more complex as well.

# Filters 0 1 2 3 4 5 6 7 8
%Queries_Prog 19.2 54.1 10.8 7.8 4.8 2.9 0 0 0
%Queries_Base 15.3 57.7 21.2 5.7 0 0 0 0 0
TABLE III: Fraction of Queries by Number of Filters.

Iv-G7 User Comments

While multiple objective metrics provided above capture different benefits of the progressive refinement paradigm, a subjective metric such as user comments captures the intangibles and provides a complementary, perhaps richer insight. To start off, users unanimously chose our progressive refinement approach over the base case. Some illustrative remarks were: "I will grab my coffee real quick while this is running", "It’s too slow to query the entire data.", "First query and I am already annoyed", "while waiting", "I did not feel seeing so many results was useful.", "ok there it goes. This is pretty much unworkable".

Iv-G8 User Behavior & Future Features

Filter queries in a brushing and linking-based system such as ours fall in one of 3 categories – adding a filter (), modifying a filter (), and removing a filter ().

We observed that the most recently added filter was more likely to be removed or modified. This leads us towards an important feature that a data analysis system should provide – user guidance, and in particular, query suggestion.

Some filters resulted in multiple visualizations changing drastically – as a user cannot visually follow changes in multiple visualizations, it is important to be able to highlight results that have undergone significant changes.

It is also important to reduce the number of refinement queries a user needs to perform – a user pointed out "it breaks my flow". This points towards perhaps refining interesting results after a filter query finishes execution, by default.

A progressive refinement system is thus highly affected by having an optimal initial refinement level – there does not seem to be an ideal strategy for determining it. We used the heuristic of setting an interactive response time threshold and using the largest dataset that was able to meet it for most of the filter queries. We have set this threshold for InfiniViz at s – while the interactive response time threshold can vary with datasets and systems, research has demonstrated the need for it to be no more than ms [5].

Iv-H Parallel Crossfilter Speedup

Fig. 11: Parallel Crossfilter Speedup

Section III describes InfiniViz’s system architecture – here, we look at the speedup as a result of our novel parallelized crossfilter design. We can see that parallelization can provide speedup of up to for higher refinement levels – at lower levels, due to lower execution times, the additional cost of parallelization slightly outweighs the resultant speedup. This guides us towards an interesting architectural design for the future – using non-parallelized crossfilter at lower resolutions and a parallelized crossfilter for higher resolutions, with the crossover point being determined empirically. Note that we use the standard webworkers package [48] for Node.js. It does not provide linear speedup with increasing number of cores – we profiled this to be the cause of the overall sub-linear speedup. Other packages are limited in their support for closures and are unsuitable for our purposes. In the future, we would like to develop a scalable webworkers package, which would benefit not only our parallelization framework, but more importantly, the broader web development community.

V Related Work

While data cubes [22] expedite analytical queries over large datasets, their size increases exponentially with dimension cardinalities, thereby increasing the time needed for their construction and the space needed to store them. However, more importantly from an interactive querying perspective, it affects their ability to help query large datasets within interactive response times.

Sampling can help scale to large datasets by running queries on a representative sample of the data [6, 7, 8, 9]. However, sampling introduces multiple issues in the analysis process, including sampling error, its interpretation, and visualization. Online aggregation [17, 18, 19, 20, 21] builds upon sampling by providing results whose measure error generally decreases over time, as a result of processing more data. Our approach is orthogonal to online aggregation – while online aggregation decreases error over y-axis, progressive refinement does so over the x-axis.

Data binning for the purpose of visualizing large datasets is intuitive and has been employed by numerous systems such as Profiler [23], imMens [24] and, Nanocubes [25], which construct data cubes over binned datasets. This reduces the size of the resultant cubes. However, due to the data cube size explosion problem, Profiler and imMens have been documented to scale upto 5 and 4 linked views, respectively. Nanocubes [25] allows refinement up to a fairly high-resolution version of the visualizations – unlike InfiniViz however, it does not allow drilling down to the individual records. While Nanocubes is able to reduce the size of the data cube by the sharing factor through smart indexing, it cannot deal with the inherent theoretical data cube size explosion problem discussed earlier. By avoiding building cubes over the dataset through crossfilter, InfiniViz sidesteps this problem – while execution time will be low for the session-based querying scenario where subsequent queries are related to each other, it will be comparatively higher for random user queries.

While these systems provide the standard refinement-level based operator, their focus is different from ours – Profiler helps assess quality issues in the data, imMens incorporates parallel query processing through GPUs to visualize multi-variate data tiles, while Nanocubes extends Dwarfcubes for spatiotemporal data and aims to greatly reduce the cube size – we investigate the refinement operator in detail, and study the effect of the progressive refinement paradigm on the user. We treat refinement as a first-class citizen and evaluate the consequent benefits of using refinement as a central operator. We enrich the standard refinement operator by allowing for enhancements such as limiting the number of results, and using the information content of the visualization into account in determining the results. Our strategies represent a middleware layer – InfiniViz could have used any of these systems as our backend through considerable engineering effort.

Hashedcubes [51] provides an alternative to Nanocubes by using a more compact representation and a simpler implementation. Dwarfcubes [52] laid foundations for compression techniques for data cubes, which Nanocubes enriches. Other systems such as M4 [53, 54], ScalaR [55], Forecache [56], etc. modify user queries by taking the screen resolution into consideration to not only reduce the work done at the backend, but also in transmission of the results over network. VisReduce [57] incrementally computes visualizations in a distributed environment.

Recent approaches have looked at incorporating sampling into visualizations. VAS [58] provides high-quality visualizations using a small subset of the data. Pangloss [14] enumerates numerous visualization issues in approximate query processing. Kwon et al. make the case for using sampling in visualizations and detail numerous issues important to sampling-based visual analytics [59]. ProgessiVis enables changes at the language and library level to support exploratory analysis systems [60]. IncVisage [61] builds on their zenvisage system [62], by using sampling to quickly reveal salient features of a visualization, while minimizing the result error. Unlike InfiniViz, none of these systems investigate result refinement through binning, with sampling being their preferred tool for building interactive visualization systems.

Vi Conclusion & Future Work

InfiniViz provides novel techniques for interactive visualization-driven analysis of large datasets, through our novel querying approach of progressive refinement, where we treat result refinement as a first-class operator. Our visualizations consequently are devoid of any error in the measure, with information loss being contained in the bin sizes. Progressive refinement has been highly influenced by the Progressive Transmission technique – transmitting successively refined images to users using Image Pyramids – prevalent in the early days of the internet with low network speeds [63]. Progressive refinement also draws parallels with online aggregation, as it reduces error over time over the x-axis as opposed to the y-axis, and thus presents a new paradigm in approximate querying. We demonstrate theoretically as well as empirically the fact that both computational load over the system and cognitive load over the user are reduced – this results in the user being able to execute an order of magnitude more queries. The queries also posses more complexity. This culminates in the users being able to test a significantly higher number of hypotheses.

In the future, we would like to take query sessions into consideration in constructing the binned datasets. We would also like to develop a cost-based optimizer to determine results to display, as opposed to our current rule-based approach. We would also like to improve the standard webworkers package for Node.js to provide linear speedups through parallelism. While InfiniViz currently supports one dimensional aggregations, extending it towards two dimensions (heatmaps) is theoretically straightforward. Other avenues for future work include speculative query execution to further speedup querying and user guidance [64]. We would also like to explore possibilities for integrating progressive refinement with online aggregation.

References

  • [1] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs et al., “Big Data: The Next Frontier for Innovation, Competition, and Productivity,” 2011.
  • [2]

    F. Provost and T. Fawcett, “Data Science and its Relationship to Big Data and Data-driven Decision Making,”

    Big Data, 2013.
  • [3] H. Chen, R. H. Chiang, and V. C. Storey, “Business Intelligence and Analytics: From Big Data to Big Impact,” MIS quarterly, 2012.
  • [4] B. Shneiderman, “Response Time and Display Rate in Human Performance with Computers,” CSUR, 1984.
  • [5] Z. Liu and J. Heer, “The Effects of Interactive Latency on Exploratory Visual Analysis,” TVCG, 2014.
  • [6] F. Olken, “Random Sampling from Databases,” 1993.
  • [7] S. Chaudhuri, R. Motwani, and V. Narasayya, “On Random Sampling over Joins,” SIGMOD, 1999.
  • [8] S. Agarwal, B. Mozafari et al., “BlinkDB: Queries with Bounded Errors and Bounded Response Times on very Large Data,” Eurosys, 2013.
  • [9] S. Kandula, A. Shanbhag, A. Vitorovic, M. Olma, R. Grandl, S. Chaudhuri, and B. Ding, “Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters,” SIGMOD, 2016.
  • [10] S. Belia, F. Fidler et al.

    , “Researchers Misunderstand Confidence intervals and Standard Error Bars,”

    Psychological Methods, 2005.
  • [11] G. Cumming, Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis.   Routledge, 2013.
  • [12] N. Ferreira, D. Fisher, and A. C. Konig, “Sample-oriented Task-driven Visualizations: Allowing Users to Make Better, More Confident Decisions,” SIGCHI, 2014.
  • [13] C. Olston and J. D. Mackinlay, “Visualizing Data with Bounded Uncertainty,” INFOVIS, 2002.
  • [14] D. Moritz, D. Fisher, B. Ding, and C. Wang, “Trust, but Verify: Optimistic Visualizations of Approximate Queries for Exploring Big Data,” CHI, 2017.
  • [15] P. J. Haas, J. F. Naughton, S. Seshadri et al., “Selectivity and Cost Estimation for Joins Based on Random Sampling,” JCSS, 1996.
  • [16] P. J. Haas and P. J. Haas, “Hoeffding Inequalities for Join-Selectivity Estimation and Online Aggregation,” Research Report, 1996.
  • [17] J. M. Hellerstein et al., “Online Aggregation,” SIGMOD, 1997.
  • [18] P. J. Haas et al., “Ripple Joins for Online Aggregation,” SIGMOD, 1999.
  • [19] C. Jermaine, A. Dobra, S. Arumugam, S. Joshi, and A. Pol, “A Disk-based Join with Probabilistic Guarantees,” SIGMOD, 2005.
  • [20] S. Nirkhiwale, A. Dobra, and C. Jermaine, “A Sampling Algebra for Aggregate Estimation,” VLDB, 2013.
  • [21] F. Li, B. Wu, K. Yi, and Z. Zhao, “Wander Join: Online Aggregation via Random Walks,” SIGMOD, 2016.
  • [22] J. Gray et al., “Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-tab, and Sub-totals,” DMKD, 1997.
  • [23] S. Kandel, R. Parikh, A. Paepcke, J. M. Hellerstein, and J. Heer, “Profiler: Integrated statistical analysis and visualization for data quality assessment,” AVI, 2012.
  • [24] Z. Liu, B. Jiang, and J. Heer, “imMens: Real-time Visual Querying of Big Data,” Computer Graphics Forum, 2013.
  • [25] L. Lins, J. T. Klosowski, and C. Scheidegger, “Nanocubes for Real-Time Exploration of Spatiotemporal Datasets,” TVCG, 2013.
  • [26] D. Ariely, “Seeing Sets: Representation by Statistical Properties,” Psychological science, 2001.
  • [27] G. A. Alvarez and A. Oliva, “Spatial Ensemble Statistics Are Efficient Codes That Can Be Represented with Reduced Attention,” Proceedings of the National Academy of Sciences, 2009.
  • [28] D. Keim, G. Andrienko, J.-D. Fekete et al., “Visual analytics: Definition, process, and challenges,” Lecture notes in computer science, 2008.
  • [29] D. Simakov, Y. Caspi, E. Shechtman, and M. Irani, “Summarizing Visual Data Using Bidirectional Similarity,” CVPR, 2008.
  • [30] T. Palpanas and N. Koudas, “Entropy based Approximate Querying and Exploration of Datacubes,” SSDBM, 2001.
  • [31] P. C. Fishburn, “On the Sum-of-ranks Winner When Losers are Removed,” Discrete Mathematics, 1974.
  • [32] S. Siegel and J. W. Tukey, “A Nonparametric Sum of Ranks Procedure for Relative Spread in Unpaired Samples,” Journal of the American Statistical Association, 1960.
  • [33] L. Festinger, “The Significance of Difference between Means without Reference to the Frequency Distribution Function,” Psychometrika, 1946.
  • [34] E. H. Weber, EH Weber on the tactile senses.   Psychology Press, 1996.
  • [35] G. E. Legge, “A Power Law for Contrast Discrimination,” Vision Research, 1981.
  • [36] E. Malinowski et al., Advanced Data Warehouse Design: From Conventional to Spatial and Temporal Applications.   Springer, 2008.
  • [37] squareup.com, “Crossfilter,” square.github.io/crossfilter/, 2012.
  • [38] I. Green, “Web workers: Multithreaded programs in javascript,” 2012.
  • [39] S. Tilkov and S. Vinoski, “Node. js: Using JavaScript to Build High-performance Network programs,” IEEE Internet Computing, 2010.
  • [40] K. Pearson, “Contributions to the mathematical theory of evolution,” Philosophical Transactions of the Royal Society of London. A, 1894.
  • [41] D. Howitt and D. Cramer, “Introduction to statistics in psychology,” Pearson Education, 2007.
  • [42] A. S. D. Expo, “Flights Dataset,” http://stat-computing.org/dataexpo/2009, 2009.
  • [43] E. Cho, S. A. Myers, and J. Leskovec, “Friendship and Mobility: User Movement in Location-based Social Networks,” SIGKDD, 2011.
  • [44] K. McGarry, “A Survey of Interestingness Measures for Knowledge Discovery,”

    The Knowledge Engineering Review

    , 2005.
  • [45] R. L. Wasserstein and N. A. Lazar, “The ASA’s Statement on p-values: Context, Process, and Purpose,” Taylor & Francis, 2016.
  • [46] C. Gini, “Concentration and dependency ratios,” Rivista di Politica Economica, 1997.
  • [47] J. L. Myers, A. Well, and R. F. Lorch, “Research Design and Statistical Analysis,” Routledge, 2010.
  • [48] A. Tang. (2012) Node Webworkers. [Online]. Available: https://www.npmjs.com/package/webworker-threads
  • [49] M. Bostock, “D3. js,” Data Driven Documents, 2012.
  • [50] X. Li, J. Han, Z. Yin, J.-G. Lee, and Y. Sun, “Sampling Cube: A Framework for Statistical OLAP over Sampling Data,” SIGMOD, 2008.
  • [51] C. A. Pahins, S. A. Stephens, C. Scheidegger, and J. L. Comba, “Hashedcubes: Simple, Low Memory, Real-time Visual Exploration of Big Data,” TVCG, 2017.
  • [52] Y. Sismanis, A. Deligiannakis, Y. Kotidis, and N. Roussopoulos, “Hierarchical Dwarfs for the Rollup Cube,” Proceedings of the 6th ACM international workshop on Data warehousing and OLAP, 2003.
  • [53] U. Jugel, Z. Jerzak, G. Hackenbroich et al., “M4: A Visualization-oriented Time Series Data Aggregation,” VLDB, 2014.
  • [54] U. Jugel, Z. Jerzak, G. Hackenbroich, and V. Markl, “VDDA: Automatic Visualization-driven Data Aggregation in Relational Databases,” The VLDB Journal, 2016.
  • [55] L. Battle, M. Stonebraker, and R. Chang, “Dynamic reduction of query result sets for interactive visualizaton,” Big Data, 2013 IEEE International Conference on, 2013.
  • [56] L. Battle, R. Chang, and M. Stonebraker, “Dynamic Prefetching of Data Tiles for Interactive Visualization,” SIGMOD, 2016.
  • [57] J.-F. Im, F. G. Villegas, and M. J. McGuffin, “VisReduce: Fast and Responsive Incremental Information Visualization of Large Datasets,” Big Data, 2013 IEEE International Conference on, 2013.
  • [58] Y. Park, M. Cafarella, and B. Mozafari, “Visualization-aware Sampling for Very Large Databases,” ICDE, 2016.
  • [59] B. C. Kwon, J. Verma, P. J. Haas et al., “Sampling for Scalable Visual Analytics,” IEEE Computer Graphics and Applications, 2017.
  • [60] J.-D. Fekete and R. Primet, “Progressive analytics: A Computation Paradigm for Exploratory Data Analysis,” arXiv preprint arXiv:1607.05162, 2016.
  • [61] S. Rahman, M. Aliakbarpour, H. K. Kong, E. Blais, K. Karahalios, A. Parameswaran, and R. Rubinfield, “I’ve Seen “Enough”: Incrementally Improving Visualizations to Support Rapid Decision Making,” VLDB, 2017.
  • [62] T. Siddiqui, A. Kim, J. Lee, K. Karahalios, and A. Parameswaran, “Effortless data exploration with zenvisage: an expressive and interactive visual analytics system,” VLDB, 2016.
  • [63] E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Ogden, “Pyramid Methods in Image Processing,” RCA Engineer, 1984.
  • [64] K. Dimitriadou, O. Papaemmanouil, and Y. Diao, “Explore-by-example: An automatic query steering framework for interactive data exploration,” SIGMOD, 2014.