1 Interaction/Demo Scenario
We first present two usage scenarios to demonstrate the two different classes of queries supported in ScatterSearch, then we describe the system in more detail in the next section.
Scenario #1: Region Query for Sports Analytics
Jane is a sports analyst interested in how different plays and strategies are used by NCAA basketball teams in different parts of the basketball court. She loads in the NCAA play-by-play data111console.cloud.google.com/marketplace/details/ncaa-bb-public/ncaa-basketball into ScatterSearch and examines the player’s positions on the court, broken down by the different event types (event_type). She sees an overview of all the datapoints plotted on the querying canvas and notices that there is a dense region near the free-throw line. She draws a region query on the canvas (Figure 1B green region) and found that many of these events were indeed free throws, but there were also rebounds that happened near the free-throw line. To look at shots that were made in the game, she switches the category to shot_type. She finds it difficult to compare the relative locations of the different shot types on the court, so she selects the option to fix the x and y extent of the resulting visualizations. She learns that while dunks, layups, hook shots, and tip shots happen in close proximity to the hoop, jump shots can pretty much happen anywhere on the court and account for most of the three-pointers. She is now interested in finding teams that make a lot of three-pointers. She selects the region around the three-point line and finds that the Tigers, Wildcats, and Bulldogs have many plays around the three-point line.
Scenario #2: Query-by-Visualization for Social Data Analysis John is a social scientist interested in relationships between different socio-economic factors and their connection with crime rates across different communities. He examines a sampled dataset of 190 visualizations generated from the pairwise combination of 20 different quantitative attributes in the dataset222archive.ics.uci.edu/ml/datasets/communities+and+crime. By first browsing through the data, he finds that the percentage of households with rent and investment incomes is inversely correlated with the percentage of households that are on public assistance. He wants to find other pairwise attributes that also have a similar trend, so he drags and drops this visualization onto the canvas (Figure 1B) as a query. He finds several variables that are inversely correlated with elderly, retirement communities, such as lower percentages of the population between ages 12-29 (agePct12t29) and smaller household size (householdsize). He also finds that as the family median income in a community increases, its need for public assistance decreases.
2 System Description
, a VQS for line charts. Users can either issue a region query or query by dragging and dropping an existing visualization. Based on the selected axes, the query manager generates a collection of scatterplot visualizations by retrieving the data values from the underlying database. Each scatterplot consists of a series of X, Y datapoints. These raw values can be preprocessed in a way that improves the downstream tasks. For example, normalization and outlier removal accentuates the key patterns in the scatterplot and sampling decreases the amount of data that needs to be processed. Next, each scatterplot is further processed into some representation that facilitates effective similarity computation with the queried scatterplots. For example, a scatterplot can either be binned into a heatmap matrix, parametrized into a functional representation, or represented as a graph or polygon. In the current version ofScatterSearch, we use the raw representation for region queries and heatmap representation for query-by-visualization. The scatterplot representations are inputs to the scoring functions, which produce a score indicating how well a scatterplot satisfies the query. Finally, the scatterplots are sorted based on this score and displayed as a ranked list to the user.
Scoring Metrics: For query-by-visualization, we developed a novel multi-level Euclidean distance that operates on the heatmap matrix representations of the scatterplot. To compute the similarity between two matrices and , we first generate a set of matrix representations of various resolutions for each scatterplot (e.g. ). We compute the similarity between the two scatterplots based on comparing their corresponding grids at the same resolution. The resulting distances are aggregated into a single score based on a weighted sum (with constants ).
The intuition for computing a score across multiple resolutions is that coarse-grained grids give a general sense of the scatterplot density at different locations and are cheaper to compute than fine-grained grids. Accordingly, we assign higher weights () to coarse-grained grids since if a pair of scatterplots are not “roughly” similar on a coarse-grained grid, it is most likely not very similar at the finer level and overall. For the region query, we compute the number of points that lie inside the selected region as the score.
3 Conclusion & Future Work
We presented ScatterSearch, a novel VQS for scatterplot visualizations. To our knowledge, ScatterSearch is the first end-to-end system that enables users to visually query for desired scatterplots of interest. Our preliminary prototype serves as an experimentation framework to evaluate the effect of different query specification interfaces, preprocessing procedures, representations, and similarity metrics. With this modular framework, our next step is to work with real-world users and datasets to evaluate the efficacy of ScatterSearch
’s pattern matching capabilities, as well as understanding the types of queries that users are interested in when searching for scatterplots. We hope to gather formative feedback from the VIS community through the poster demonstration to improve this work.
-  M. Correll and M. Gleicher. The semantics of sketch: Flexibility in visual query systems for time series data. In IEEE VAST, pages 131–140, Oct 2016.
-  N. Elmqvist, P. Dragicevic, and J. D. Fekete. Rolling the dice: Multidimensional visual exploration using scatterplot matrix navigation. IEEE Transactions on Visualization and Computer Graphics, 14(6):1141–1148, 2008.
-  H. Hochheiser and B. Shneiderman. Dynamic query tools for time series data sets: Timebox widgets for interactive exploration. Information Visualization, 3(1):1–18, 2004.
-  D. J. Lee, J. Lee, T. Siddiqui, J. Kim, K. Karahalios, and A. Parameswaran. You can’t always sketch what you want: Understanding sensemaking in visual query systems. In IEEE VAST, 2019.
-  M. Mannino and A. Abouzied. Expressive time series querying with hand-drawn scale-free sketches. In CHI, pages 388:1–388:13, New York, NY, USA, 2018. ACM.
-  A. V. Pandey, J. Krause, C. Felix, J. Boy, and E. Bertini. Towards Understanding Human Similarity Perception in the Analysis of Large Sets of Scatter Plots. CHI, pages 3659–3669, 2016.
Y. Rubner, C. Tomasi, and L. J. Guibas.
Earth mover’s distance as a metric for image retrieval.
International Journal of Computer Vision, 2000.
-  K. Ryall, N. Lesh, T. Lanning, D. Leigh, H. Miyashita, and S. Makino. Querylines: approximate query for visual browsing. In CHI Extended Abstracts, pages 1765–1768. ACM, 2005.
-  J. Seo and B. Shneiderman. A rank-by-feature framework for unsupervised multidimensional data exploration using low dimensional projections. Information Visualization, 2004., pages 65–72, 2004.
-  Tuan Nhon Dang and Leland Wilkinson. ScagExplorer: Exploring Scatterplots by Their Scagnostics. 2014 IEEE Pacific Visualization Symposium, pages 73–80, 2014.