Scatterplots are an important type of visualization used extensively in data science and visual analytic systems. Objects in a dataset are visualized on a 2D Cartesian plane, with the dimensions being two quantitative attributes from the objects. Each object can be represented as a point, polygon or other mark. Aggregation-based marks (e.g. pie chart, heatmap) can also be used to represent groups of objects. The user of a scatterplot can perform a variety of tasks to provide insights into the underlying data, such as discovering global trends, inspecting individual objects or characterizing distributions[sarikaya2017scatterplots].
Despite the usefulness of static scatterplots, they suffer from significant overdraw problem on big datasets[perrot2015large, mayorga2013splatterplots]. Here, we focus on scatterplots with millions to billions of objects, where significant overlap of marks is unavoidable, making the visualization ineffective. To address this issue in scatterplots, there has been substantial research[heimerl2018visual, jo2018declarative, liu2013immens, lins2013nanocubes] on devising aggregation-based scatterplots using visual aggregates such as contours or hexagon bins. While avoiding visual clutter, these approaches do not support inspecting individual objects, which is a fundamental scatterplot task[sarikaya2017scatterplots]. Prior works also used transparency[fekete2002interactive, kosara2002focus+], animation[chen2018using] and displacements of objects[waldeck2004mobile, keim1998gridfit, trutschl2003intelligently] to ease the overdraw problem. However, due to limited screen resolution, these methods have scalability limits.
On the other hand, the use of zooming in scatterplots has the potential to effectively mitigate visual clutter. By expanding the 2D Cartesian plane into a series of zoom levels with different scales, more screen resolution becomes available, allowing for object layouts that avoid occlusion and excessive density. Inspecting large amounts of objects thus becomes feasible. Aggregation-based marks such as circles or heatmaps can still be used to visualize groups of objects. Figure Kyrix-S: Authoring Scalable Scatterplot Visualizations of Big Data shows such a visualization created by the system we introduce in this paper, which shows one billion comments made by users on Reddit.com, where is the posting time and is the number of characters in the comments. Additional examples are in Figure 1. For simplicity, we term such visualizations scalable scatterplot visualizations, or SSV.
There has been significant work on building systems/toolkits to aid the creation of SSVs (e.g.[tao2019kyrix, bederson1994pad++, guo2018efficient, das2012efficient]
). Specifically, prior systems can be classified into two categories:general pan/zoom systems and specialized SSV systems. General pan/zoom systems are typically expressive, supporting not only SSVs, but also pan/zoom visualizations of other types of data (e.g. hierarchical and temporal data) or that connect multiple 2D semantic spaces111A 2D semantic space consists of zoom levels sharing the same coordinate system and visualizing the same type of objects. An SSV has only one semantic space. General pan/zoom systems typically allow “semantic jumping” from one semantic space to another[tao2019kyrix, pietriga2005toolkit] (e.g. from a space of Reddit comments to a space of Reddit forums). . Specialized SSV systems (e.g. [guo2018efficient, chen2014visual]), on the other hand, generally have a narrow focus on SSVs.
While these systems have been shown to be effective, they can suffer from some drawbacks that limit their ability to support general SSV authoring at scale. In particular, limited scalability is a common drawback of both types of systems. As often as not, implementations assume all objects reside in the main memory of a computer[chen2014visual, mayorga2013splatterplots, liao2017cluster, guo2018efficient, lins2013nanocubes, miranda2017topkube, bederson1994pad++, pietriga2005toolkit].
General pan/zoom systems, while being flexible, generally incur too much developer work
due to their low-level nature. When authoring an SSV, the developer needs to manually generate the layout of visual marks on zoom levels. In very large datasets, there will be many levels (e.g. Google Maps has 20). Individually specifying the layout of a set of levels is tedious and error-prone. In particular, big or skewed data can make it challenging for the developer to specify a layout that avoids occlusion and excessive density in the visualization.
Another drawback of specialized SSV systems is low flexibility. Oftentimes systems are hardcoded for specific scenarios (e.g., supporting specific types of visual marks such as heatmaps[perrot2015large, lins2013nanocubes] or points[das2012efficient, chen2014visual], enforcing a density budget but not removing overlap, etc.) and are not extensible to general use cases. The developer cannot make free design choices when using these systems, and is forced to constantly switch tools for different application requirements.
In this paper, we describe Kyrix-S 222The birth of Kyrix-S is driven by the limitations we see when we use Kyrix[tao2019kyrix], a general pan/zoom system we have developed, to build real-world SSV-based applications. The name Kyrix-S here suggests that we implement Kyrix-S as an extension of Kyrix for SSVs, rather than a replacement. S may suggest scale, scatterplots or spatial partitioning. More detailed discussion on the relationship between the two systems can be found in Sections 2 and 7. , a system for SSV authoring at scale which addresses all issues of existing systems. To enable rapid authoring, we present a high-level declarative grammar for SSVs. We abstract away low-level details such as rendering of visual marks so that the developer can author a complex SSV in a few tens of lines of JSON. We show that compared to a state-of-the-art system, this is 4X–9X reduction in specification on several examples. In addition, we build a gallery of SSVs to show that our grammar is expressive and that the developer can easily extend it to add his/her own visual marks.
This grammar for SSVs is supported by an algorithm that automatically chooses the layout of visual marks on all zoom levels, thereby freeing the developer from writing custom code. We store objects in a multi-node parallel database using multi-node spatial indexing. As we show in Section 8, this allows us to respond to any pan/zoom action in under 500ms on datasets with billions of objects.
To summarize, we make the following contributions:
2 Related Works
2.1 General Pan/zoom Systems
A number of systems have been developed to aid the creation of general pan/zoom visualizations[bederson1994pad++, bederson2003jazz, pietriga2005toolkit, tao2019kyrix]. These systems are expressive and capable of producing not only SSVs, but also pan/zoom visualizations of other types of data (e.g. hierarchical, temporal, etc) or with multiple semantic spaces connected by semantic zooms[pietriga2005toolkit]. However, as mentioned in the introduction, these systems fall short in supporting SSVs due to limited scalability and too much developer work.
Kyrix[tao2019kyrix] is a general pan/zoom system we have developed. Here, we summarize the novel aspects of Kyrix-S compared to Kyrix and similar systems (e.g., ForeCache [battle2016dynamic], Nanocubes [lins2013nanocubes], imMens [liu2013immens]):
Kyrix-S provides a high-level grammar for SSVs, which enables much shorter specification than what Kyrix’s grammar requires for the same SSV (see Section 8.2 for an empirical comparison);
Kyrix-S implements a layout generator which frees the developer from deciding the layout of objects on zoom levels. Kyrix does not assist the developer in choosing an object layout, which makes authoring SSVs using Kyrix fairly challenging;
Kyrix-S is integrated with a distributed database which scales to billions of objects. In contrast, Kyrix only works with a single-node database which cannot scale to billions of objects.
Note that Kyrix-S has a narrow focus on SSVs and is not intended to completely replace general pan/zoom systems. As we will discuss more in Section 7, we implement Kyrix-S as an extension to Kyrix.
2.2 Specialized SSV Systems
There has been considerable effort made to develop specialized SSV systems, which mainly suffer from two limitations: low flexibility and limited scalability.
Many systems focus on a small subset of the SSV design space, and are not designed/coded to be easily extensible. For example, many focus on specific visual marks such as small-sized dots (e.g.[das2012efficient, chen2014visual, kefaloukos2014declarative]), heatmaps (e.g.[perrot2015large, lins2013nanocubes, pahins2016hashedcubes, miranda2017topkube, liu2019smartcube]), text[cartolabe], aggregation-based glyphs[liao2017cluster, beilschmidt2017linear] and contours[mayorga2013splatterplots]. Some works maintain a visual density budget[das2012efficient, perrot2015large, guo2018efficient], while some focus on overlap removal[beilschmidt2017linear, chen2014visual, derthick2003constant]. In contrast to these systems, Kyrix-S aims at a much larger design space. We provide a diverse library of visualization templates that are suitable for a variety of scatterplot tasks. For high extensibility, Kyrix-S’s declarative grammar is designed with extensible components for authoring custom visual marks.
In addition to the limited focus, most specialized SSV systems cannot scale to large datasets with billions of objects due to an in-memory assumption[leafletcluster, delort2010vizualizing, guo2018efficient, mayorga2013splatterplots, chen2014visual, lekschas2019pattern, drosou2012disc, nutanong2012multiresolution]. We are only aware of the work by Perrot et al.[perrot2015large] which renders large heatmaps using a distributed computing framework. However, that work only focuses on heatmaps.
Specialized SSV systems generally come with a layout generation module which computes the layout of visual marks on each zoom level. The design of Kyrix-S’s layout generation is inspired by many of them and bears similarities in some aspects. For example, favoring placements of important objects on top zoom levels is adopted by many works[guo2018efficient, cartolabe, das2012efficient]. The idea of enforcing a minimum distance between visual marks comes from blue-noise sampling strategies[perrot2015large, chen2014visual, guo2018efficient].
However, the key differentiating factor of Kyrix-S comes from its more stringent requirements on scalability and the design space. These requirements (see Section 3) pose new algorithmic challenges. For instance, Sarma et al.[das2012efficient] uses an integer programming solution without considering overlaps of objects. To enable overlap removal, one needs to add pairwise non-overlap constraints into the integer program, making it hard to solve in reasonable time. As another example, Guo et al.[guo2018efficient] and Chen et al.[chen2014visual] do not support visual marks that show a group of objects with useful aggregated information. This requires a bottom-up aggregation process which breaks their top-down algorithmic flow. In order to scale to billions of objects, Kyrix-S cannot rely on existing algorithms and instead needs to compute visual mark layouts in parallel using a distributed algorithm as described in Section 6.
2.3 Static Scatterplot Designs
Alleviating the overdraw problem of static scatterplot visualizations has been a popular research topic for a long time. Many methods have been proposed, including binned aggregation[moritz2019falcon, liu2013immens, jo2018declarative], appearance optimization[fekete2002interactive, kosara2002focus+, chen2018using], data jittering[waldeck2004mobile, keim1998gridfit, trutschl2003intelligently] and sampling[dix2002chance, chen2019recursive]. We refer interested readers to existing surveys on scatterplot tasks and designs[sarikaya2017scatterplots], binned aggregation[heimerl2018visual] and visual clutter reduction[ellis2007taxonomy, elmqvist2009hierarchical]. Kyrix-S’s design follows many guidelines in these works, which we elaborate in Section 3.
2.4 Declarative Visualization Grammars
Numerous declarative grammars have been proposed for authoring visualizations at different levels of abstractions. The first of these is Wilkinson’s grammar of graphics (GoG)[wickham2010layered], which forms the basis of subsequent works. For example, ggplot2[Wickham_2009] is the direct implementation of GoG in R and is widely used. D3[d3] and Protovis[bostock2009protovis] are low-level libraries that provide useful primitives for authoring basic visualizations. Vega is the first grammar that concerns specifications of interactions. Built on top of Vega, Vega-lite[satyanarayan2017vegalite] offers a more succinct grammar for authoring interactive graphics. Recently, more specialized grammars have emerged for density maps[jo2018declarative], unit visualizations[park2017atom], and pan/zoom visualizations[tao2019kyrix].
Despite the diversity of this literature, not many grammars support SSVs well. Some low-level grammars such as D3[d3], Vega[vega] and Kyrix[tao2019kyrix] can express SSVs, but the specification is often verbose due to their low-level and general-purpose nature. Kyrix-S, on the contrary, uses a high-level grammar that abstracts away unimportant low-level details. For example, switching mark representations can be simply done by changing a renderer type parameter (e.g. from “circle” to “heatmap”) without writing a renderer. Furthermore, different from aforementioned grammars, Kyrix-S’s grammar allows specifications of multiple zoom levels altogether with convenient components for specifying sampling/aggregation semantics.
3 Design Goals
Limitations of prior art, existing guidelines and our experience with SSV users drive the design of Kyrix-S. Here, we present a few goals we set out to achieve.
G1. Rapid authoring. Our declarative grammar should enable specification of SSVs in a few tens of lines of code. This goal is inspired by the design rationale of several high-level declarative languages (e.g. Vega-lite[satyanarayan2017vegalite] and Atom[park2017atom]), and driven by the limitations we see in using Kyrix[tao2019kyrix] to author SSVs.
G2. Visual expressivity. Kyrix-S should enable the exploration of a broad SSV design space and not limit itself to specific visual representations. Moreover, it is crucial to allow inspection of individual objects in addition to showing aggregation information. As outlined by Sarikaya et al. [sarikaya2017scatterplots], there are four common object-centric scatterplot tasks: identify object, locate object, verify object and object comparison. A recent study[lekschas2019pattern] also highlights the importance of browsing objects in multi-scale visualizations.
G3. Usable SSVs. The SSVs authored with Kyrix-S should be usable, e.g. free of visual clutter, using simple visual aggregates, etc. We identify usability guidance from a range of surveys and SSV systems (e.g.[elmqvist2009hierarchical, guo2018efficient, das2012efficient]), which we formally describe in Section 6.
G4. Scalability. Kyrix-S should be able to handle large datasets with billions of objects and potentially skewed spatial distribution. This goal has the following two subgoals:
G4-a. Scalable offline indexing. Offline indexing should finish in reasonable time on big data, and scale well as the data size grows.
G4-b. Interactive online serving. The end-to-end response time to any user interaction (pan or zoom) should be under 500ms, an empirical upper bound that ensures fluid interactions[liu2014effects].
In the rest of the paper, we justify the design choices we make by referencing the above goals when appropriate.
4 Declarative Grammar
In this section, we present Kyrix-S’s declarative grammar. We start with showing a gallery of example SSVs authored with Kyrix-S (Section 4.1), which we then use to illustrate the design of the grammar in Section 4.2.
4.1 Example SSVs
Figure 1 shows a gallery of SSVs and their specifications.
Taxi. In Figure 1a, a multi-scale heatmap shows 178.5M taxi trips in Chicago since 2013, where is trip length (in seconds) and is trip total (in dollars). In the overview (upper), the long thin “heat” region suggests that most trips have a similar total-length ratio. In a zoomed-in view (lower), we see vertical “heat” regions around entire minutes. In fact, more than 70% of the trips have a length of entire minutes, indicating the possible prevalent use of minute-precision timers. Figure 1b is the same representation of this dataset in contour lines.
FIFA. The SSV in Figure 1c visualizes 18,207 soccer players in the video game FIFA 19. and are respectively the shooting and defensive rating of players. Players with the highest wages are shown at top levels. Lesser-paid players are revealed as one zooms in. Figure 1f is a radar-based SSV with the same and . Each radar chart shows the averages of eight ratings (e.g. passing, power) of a cluster of players. When hovering over a radar, three players from that cluster with the highest wages are shown.
Liquor. Figure 1d is an SSV of 17.3M liquor purchases by retailers in Iowa since 2012. and axes are the unit price (dollars) and quantity (# of bottles) of the purchases. Each pie shows a cluster of purchases grouped by day of the week. One can hover over a pie to see a tabular visualization of the three most recent purchases, as well as a convex hull showing the boundary of the cluster.
Reddit. Figure 1e is another representation of the one-billion Reddit comments dataset. Different from Figure Kyrix-S: Authoring Scalable Scatterplot Visualizations of Big Data, comments are directly visualized as non-overlapping texts. The number above each comment represents how many comments are nearby, giving the user an understanding of the data distribution hidden underneath.
4.2 Grammar Design
The primary goal of Kyrix-S’s declarative grammar is to help the developer quickly navigate a large SSV design space (G1 and G2). The high-level design of the grammar closely follows a survey of scatterplots designs and tasks by Sarikaya et al.[sarikaya2017scatterplots], which outlined four common design variables of scatterplot visualizations: point encoding (i.e. visual representation of one object), point grouping (i.e. visual representation of a group of objects), point position (e.g. subsampling, zooming) and graph amenities (e.g. axes, annotations). These design variables map to the highest-level components in Kyrix-S’s grammar, i.e., Marks, Layout, Data and Config, as illustrated in Figure 2 using the BNF notation[knuth1964backus]. We elaborate the design of them in the following.
4.2.1 Marks: Templates + Extensible Components
The Marks component (Rules 2-14444Hereafter, rules referenced inside parentheses implicitly refer to rules in Figure 2. A rule defines the composition logic of one component in the grammar.) defines the visual representation of one or more objects, and covers both point encoding and point grouping in [sarikaya2017scatterplots]. Visual marks of a single or a cluster of objects span a huge space of possible visualizations. To keep our grammar high-level (G1), we adopt a templates+extensible components methodology, where we provide a diverse library of template mark designs, and offer extensible components for authoring custom marks.
We divide the Marks component into two subcomponents: Cluster (Rule 3) and Hover (Rule 4).
Cluster: cluster marks are static marks rendering one or a group of objects. Currently, Kyrix-S has five built-in Cluster marks including Circle (Figure Kyrix-S: Authoring Scalable Scatterplot Visualizations of Big Data), Contour (Figure 1b), Heatmap (Figure 1a), Radar (Figure 1f) and Pie (Figure 1d). The developer can choose one of these marks by specifying just a name (G1). These built-in Cluster marks are carefully chosen to cover a range of aggregate-level SSV tasks[sarikaya2017scatterplots]. For example, heatmaps and contour plots enable the user to characterize distribution and identify correlation between the two axes. The user can perform numerosity comparison and identify anomalies with circle-based SSVs. Radar-based and pie-based SSVs allow for exploring object properties within a neighborhood. For fast authoring, Kyrix-S sets reasonable default values for many parameters (G1), e.g., inner/outer radius of a pie and bandwidth of heatmaps. The developer can also customize (G2-b) using a Config component (Rules 3 and 24).
The Aggregate component (Rule 6) specifies details of aggregations statistics shown by a Cluster mark, and is composed of Dimensions (Rule 10) and Measures (Rule 11). A Dimension is a categorical field of the objects indicating how objects are grouped (e.g. by day of the week in Figure 1d). A Measure defines an aggregation statistic (e.g. average of a rating in Figure 1f). Currently Kyrix-S supports six aggregation functions: count, average, min, max, sum and square sum (Rule 14).
Hover: Hover marks add more expressivity into the grammar by showing additional marks when the user hovers over a Cluster mark. For example, in Figure Kyrix-S: Authoring Scalable Scatterplot Visualizations of Big Data three example comments are shown upon hovering a circle. The motivation for adding this component is two-fold.
First, as outlined in G2, we want to enable tasks that require inspection of individual objects in addition to showing visual aggregates with Cluster marks. To this end, we design a Ranklist component which visualizes objects with top-k importance (Rule 7). The importance of objects is defined in the layout component as a field from the objects. We offer a default tabular visualization template (e.g. Figure 1d), and allow custom marks via Custom (e.g. player profiles in Figure 1f).
Secondly, multi-scale visualizations often suffer from the “desert fog” problem[jul1998critical], where the user is lost in the multi-scale space and not sure what is hidden underneath the current zoom level. Boundary is designed to aid the user in navigating (G3) by showing the boundaries of a cluster of objects (Rule 8), using either the convex hull (Figure 1d) or the bounding box (Figure Kyrix-S: Authoring Scalable Scatterplot Visualizations of Big Data). By hinting that there is more to see by zooming in, more interpretability is added to the visualization.
4.2.2 Layout: Configuring All Zoom Levels at Once
The Layout component (Rules 15-22) controls the placement of visual marks555 For KDE-based SSVs (e.g. heatmaps and contours), a visual mark here refers to the kernel density estimates generated by a weighted object.
For KDE-based SSVs (e.g. heatmaps and contours), a visual mark here refers to the kernel density estimates generated by a weighted object.on zoom levels, which corresponds to the point position design variable in [sarikaya2017scatterplots]. We aim to assist the developer in specifying the layout for all zoom levels together rather than independently, motivated by the limitation of general pan/zoom systems[tao2019kyrix, bederson2003jazz, bederson1994pad++] that mark placements are manually configured for every zoom level.
and (Rules 16 and 17) define the two spatial dimensions. The only specifications required are two raw data columns that map to the two dimensions (e.g. trip length and total in Figures 1a and 1b). An optional Extent component (Rule 21) can be used to indicate the visible range of raw data values on the top zoom level.
The component (Rule 18) controls how visual marks are distributed across zoom levels. Drawn from prior works[das2012efficient, guo2018efficient, cartolabe]
, we use a usability heuristic that makes objects with higher importance more visible on top zoom levels. The importance is defined by a field of the objects. For example, in Figure1e, highest-scored comments are displayed on top zoom levels.
Optionally, Theta is a number between 0 and 1 indicating the amount of overlap allowed between Cluster marks (Rule 19), with 0 being arbitrary overlap is allowed and 1 being overlap is not allowed. For instance, Theta is 0.5 in Figure 1c, making the player profiles overlap to a certain degree.
The above layout-related parameters serve as inputs to the layout generator, which we detail in Section 6.
4.2.3 Data and Config
We assume that the raw spatial data exists in the database, and can be specified as a SQL query (Rule 23). The highest-level Config component corresponds to the design variable graph amenities in[sarikaya2017scatterplots]. The developer can use it to specify global rendering parameters such as the size of the top zoom level, number of zoom levels, as well as annotations such as axes, grid lines and legends.
5 Optimization Framework
Figure 3 illustrates the optimization framework adopted by Kyrix-S to scale to large datasets(G4). There are two main phases: offline indexing and online serving. Specifically, given an SSV specification, the layout generator computes offline the placement of visual marks on zoom levels using several usability considerations (G3), e.g., bounded visual density, free of clutter, etc. Along the way, useful aggregation information (e.g. statistics and cluster boundaries) is also collected. The computed layout information is stored in a multi-node database with multi-node spatial indexes. Online, the data fetcher communicates with the frontend and fetches data in user’s viewport from the multi-node database with sub-500ms response times (G4-b). In the next section, we describe these two components in greater detail.
6 Layout Generation and Data Fetching
Here, we first describe how we model the layout generation problem (Section 6.1). We then describe a single-node layout algorithm (Section 6.2), which is the basis of a distributed algorithm detailed in Section 6.3. Lastly, Section 6.4 describes the design of the data fetcher.
6.1 Layout Generation: Problem Definition
We assume that there is a discrete set of zoom levels numbered 1, 2, 3 from top to bottom with a constant zoom factor between adjacent levels (e.g. 2 as in many web maps). The layout generation problem concerns how to, in a scalable manner, place visual marks onto these zoom levels in a general way that works for any SSV that Kyrix-S’s declarative grammar can express (G2).
To aid the formulation of the layout generation problem, we collect a set of existing layout-related usability considerations from prior SSV systems and surveys[guo2018efficient, beilschmidt2017linear, chen2014visual, das2012efficient, elmqvist2009hierarchical, liao2017cluster], and list them as subgoals of G3: Usable SSVs.
G3-a. Non/partial overlap. Cluster visual marks (Rule 3) should not overlap or only overlap to a certain degree (if specified by Theta in Rule 19). For simplicity, we assume that Cluster marks have a fixed-size bounding box, which is either decided by Kyrix-S or specified by the developer (see Figure 1e for an example). We then only check the overlap of bounding boxes.
G3-b. Bounded visual density. Mark density in any viewing region should not exceed an upper bound. Excessive density stresses the user and slows down both the client and the server. Kyrix-S sets a default upper bound on how many marks should exist in any viewport-sized region based on empirical estimates of the processing capability of the database and the frontend. We should also avoid very low visual density, which often leads to too many zoom levels and thus increased navigation complexity. We therefore try to maximize spatial fullness without violating the overlap constraint and the density upper bound.
G3-c. Zoom consistency. If one object is visible on zoom level , either through a custom Cluster mark or a Ranklist mark (Rule 7), it should stay visible on all levels . This principle is adopted by many SSV systems that support inspection of individual objects (e.g. [das2012efficient, chen2014visual, guo2018efficient]). The rationale is to aid object-centric tasks where keeping track of locations of objects is important.
G3-d. Data abstraction quality. Data abstraction characterized by visual marks should be interpretable and not misinform the user. For Cluster marks, it is important to reduce within-cluster variation[cui2006measuring, yang2003interactive, elmqvist2009hierarchical], which can be characterized by average distance of objects to the visual mark that represent them[cui2006measuring]. We also adopt an importance policy, where objects with higher importance (Rule 18) should be more likely to be visible on top zoom levels. This is a commonly adopted principle to help the user see representative objects early on[guo2018efficient, das2012efficient].
Discussion. Despite that subgoals G3-ad are all from existing works, we are not aware of any prior system that addresses all of them. As mentioned in Section 2, a key distinction of Kyrix-S’s layout generation lies in the more stringent requirements of scalability and the design space. Due to this broad focus, finding an “optimal layout” with the objectives and constraints in G3-ad is hard. In fact, a prior work[das2012efficient] proves that with only a subset of G3-ad, finding the optimal layout is NP-hard (for an objective function they define). Therefore, we do not attempt to define a formal constraint solving problem. Instead we keep our goals qualitative and look for heuristic solutions.
6.2 A Single-node Layout Algorithm
Here, we describe a single-node layout algorithm which assumes that data fits in the memory of one computer.
We assume that the / placement of a Cluster mark comes from an object it represents. Alternatively, one could consider inexact placement of the marks (e.g. “median location” or binned aggregation), which we leave as our future work. Additionally, we assume that the / placement of a Hover mark is the same as the corresponding Cluster mark. So in the rest of Section 6, any mention of mark refers to a Cluster mark if not explicitly stated.
We make two important algorithmic choices. First, we enforce a minimum distance between marks in order to cope with the overlap and density constraints (G3-a and G3-b
). Second, we use a hierarchical clustering algorithm to ensure zoom consistency (G3-c) and data abstraction quality (G3-d).
Enforcing a minimum distance between marks. For overlap and density constraints, we make use of the normalized chessboard distance () between two marks and :
where () is the () coordinate of the centroid of in the pixel space and () is the width (height) of the bounding box of a mark (note that bounding boxes of marks are of the same size).
helps us reason about non/partial overlap constraints. If , and do not overlap because they are at least one bounding box width/height away on or . Even if is smaller than one, the degree of overlap is bounded. For example, if , the centroids of and remain visible despite the potential overlap.
To this end, we set a lower bound on the between any two visual marks, which is specified through the Theta component (e.g. Figure 1c) or built-in with Cluster marks.
We also use to enforce the visual density upper bound (G3-b). Intuitively, the smaller is, the closer marks are, and thus the denser the visualization is. We search for the smallest (for maximum spatial fullness, G3-b) that does not allow more than marks in any viewport-sized region (). To find this value, we show in Figure 4 another perspective on how controls the placement of marks: enforcing that any is equivalent to scaling the bounding boxes of marks by a factor of , and then enforcing that none of these scaled bounding boxes overlap. So we are left with a simple bin-packing problem. For a given , the maximum number of marks that can be packed into a viewport is:
With this, we can find the smallest such that using a binary search on .
We take the larger calculated/specified for the overlap and density constraints. By imposing this lower bound on , these two constraints are strictly satisfied.
Hierarchical clustering. The key part of the algorithm is a bottom-up hierarchical clustering process. Suppose there are zoom levels. We start with a fake bottom level where every object is in its own cluster. Each cluster’s aggregation information (e.g. aggregated stats and cluster boundaries) is initialized using the only object in it, which we call the “representative object” of a cluster in the following.
Then we build the clusters level by level. For each zoom level , we construct a new set of clusters by merging the clusters on level . Zoom consistency (G3-c) is then guaranteed because each zoom level merges clusters from the one level down. By mathematical induction, we can show that if an object is visible on level , it is visible on any level .
Specifically, we iterate over all clusters on level in the order of the importance of their representative objects, which is a greedy strategy to make important objects more visible (G3-d). For each cluster on level , we search for a cluster on the current level with the closest . If this is smaller than , we merge into ; otherwise we add to level . By merging a cluster into its nearest neighbor (measured in
), within-cluster variances can be reduced (G3-d). Figure 5 shows an example with 9 objects and 3 zoom levels.
. The single-node algorithm preserves an outlier if it is not withinof any other object. To identify less isolated outliers, one would need to assign to each object a score (i.e. the importance field) indicating how distant an object is from other objects. Kernel density estimations would be an example of such type of score.
Optimizations and complexity analysis. Let be the total number of objects. When constructing clusters for level , sorting the clusters on level takes . We maintain a spatial search tree (e.g. R-tree) of the clusters on level so that nearest neighbor searches can be done in . Inserting a new cluster into the tree also takes . Therefore, the overall time complexity of this algorithm is if we see the number of zoom levels as a constant.
6.3 A Multi-node Distributed Layout Algorithm
The algorithm presented in Section 6.2 only works on a single machine which has limited memory. Here, we extend it to work with a multi-node database system.666The distributed algorithm proposed here works with any multi-node database that supports basic data partitioning (e.g. Hash-based) and 2D spatial indexes.
Given the sequential nature of the single-node algorithm, one major challenge here is how to utilize the parallelism offered by the multi-node database. Our idea is to spatially partition a zoom level, perform clustering in each partition independently in parallel and then merge the partitions. Figure 6 shows an illustration of the three steps. We detail them in the following, assuming the context of constructing clusters on zoom level from the clusters on level .
Step 1: skew-resilient spatial partitioning. We use a KD-tree[bentley1990k] to spatially partition the 2D plane so that each resulting partition has similar number of clusters from zoom level . Note that each cluster belongs to exactly one partition according to its centroid. A KD-tree is a binary tree (Figure 6a) where every non-leaf tree node represents a split of a subplane, and every leaf tree node is a final partition stored as a table in one database node. KD-tree splits are axis-aligned and alternate between horizontal and vertical as one goes down the hierarchy. For each split, the median value of the corresponding axis is used as the split point. We stop splitting when the number of clusters in a partition can fit into the memory of one database node.
Step 2: processing partitions in parallel. Since each partition fits in the memory of one database node, we can efficiently run the single-node clustering algorithm on each partition in parallel. As a result, a new set of clusters is produced in each partition where no two clusters have an smaller than (Figure 6c).
Step 3: merging clusters on partition boundaries. After Step 2, some clusters close to partition boundaries may have an smaller than . Step 3 resolves these border cases by merging clusters along KD-tree splits. We “process” (i.e. merging clusters along) KD-tree splits in a bottom-up fashion, starting with splits that connect two leaf partitions. After the KD-tree root is processed, we finish the layout generation for level .
When processing a given split, we make use of the fact that only clusters whose centroid is within a certain distance to the split ( or depending on the orientation of the split) need to be considered. Consider the horizontal split in Figure 7. The two horizontal dashed lines indicate the range of cluster centroids that we need to consider. Any cluster whose centroid is outside this range is at least away (in ) from any cluster on the other side of the split.
We use a greedy algorithm to process a KD-tree split. We iterate over all clusters in the aforementioned range in the order of their coordinates ( if the split is vertical). We keep track of the last added/merged cluster . Let be the currently considered cluster. If , we add and set to ; otherwise we merge and . The one with the less important representative object is merged into the other (g3-d). Then we update accordingly.
Consider again Figure 7. There are five clusters A-E in decreasing importance order. The boxes around clusters are their bounding boxes scaled by a factor of . So if two boxes overlap, two corresponding clusters have an smaller than (see Figure 4). The above algorithm iterates over the clusters in the following order: . When , . is then merged into because and has a less important representative object. For the same reason, is merged into .
Optimizations and complexity analysis. Let be the upper bound on the number of clusters that can fit in memory. Hence there are roughly partitions, which means there are KD-tree nodes. Determining the splitting point can be done in , thus constructing the spatial partitions takes . Step 1 also involves distributing the clusters to the correct database node, which is often an expensive I/O bound process. So we do spatial partitioning only once based on the bottom level, and reuse the same partition scheme for other levels to avoid moving data around database nodes. Step 2 runs in because the single node algorithm is run in parallel across partitions. Step 3 takes because there are KD-tree levels in total, and we need to consider for each KD-tree level clusters in the worst case. However, Step 3 is expected to run very fast in practice because most clusters are out of the range in Figure 7.
Other partitioning strategies. One could partition the data using fields other than and and then in a similar fashion, run the single-node algorithm on the resulting partitions in parallel. However, since the two spatial attributes are not involved in partitioning, objects in each partition would span the whole 2D space. So even though overlap and density constraints are satisfied within each partition, when merged together, they will very likely be violated unless extra spatial postprocessing are in place. We therefore choose to perform spatial partitioning throughout to guarantee G3-a and G3-b.
6.4 Data Fetching
The data fetcher’s job is to efficiently fetch data in the user’s viewport (G4-b). We make use of multi-node spatial indexes, which can help fetch objects in a viewport-sized region with interactive response times.
Creating multi-node spatial indexes. Suppose the -th () partition on zoom level is stored in the database table , which has roughly clusters. We augment all such with a box-typed column bbox, which stores the bounding box of cluster marks. We then build a spatial index on column bbox, by issuing the following query:
where gist is the spatial index based on the generalized search tree[postgresgist]. In practice, these CREATE INDEX statements can be run in parallel by the multi-node database.
Fetching data from relevant partitions. Given a user viewport on zoom level , clusters from partition that are inside can be fetched by a query like the following:
where && is the intersection operator. The spatial index on bbox ensures that this query runs fast. We traverse the KD-tree to find out partitions that intersect , run the above query on these partitions and union the results. Note that for top zoom levels that are small in size, there can be too many partitions that intersect with the viewport, which can be harmful for data fetching performance because we need to wait for sequential network trips to many database nodes. Therefore, we merge all partitions on each of the top levels into one database table. is an empirically determined constant based on the relative size of the zoom levels to the viewport size.
We implement Kyrix-S as an extension to Kyrix[tao2019kyrix], a general pan/zoom system we have built. This enables the developer to both rapidly author SSVs and reuse features of a general pan/zoom system in one integrated system. For example, Kyrix supports multiple coordinated views. Without switching tools, the developer can construct a multi-view visualization in which one or more views are SSVs authored with Kyrix-S. As another example, the developer can augment SSVs with the semantic zooming functionality provided by Kyrix, where the user can click on a visual mark and zoom into another SSV. Furthermore, Kyrix provides APIs for integrating a pan/zoom visualization into a web application, which are highly desired by the SSV developers we collaborate with. Examples include programmatic pan/zoom control, notifications of pan/zoom events, getting current visible data items.
Specification compilation. Kyrix-S uses a Node.js module to validate the JSON-based SSV specification. Validated specifications are compiled into low-level Kyrix specifications so that part of Kyrix’s frontend code can be reused to handle rendering and pan/zoom interactions.
Layout generator and data fetcher. Kyrix-S’s layout generator and data fetcher override respectively Kyrix’s index generator and data fetcher. Both components are written in the same Java application, using the Java Database Connectivity (JDBC) to talk to Citus777https://www.citusdata.com/
Database deployment and orchestration. Kyrix-S provides useful scripts for one-command deployment of Kyrix-S and database dependencies (G1). We use Kubernetes999https://cloud.google.com/kubernetes-engine/ to orchestrate a group of nodes running containerized Citus and Kyrix-S built with Docker101010https://www.docker.com/.
|Building KD-tree (Step 1)||11.8||10.5||2.7||2.4||0.7|
|Redistributing data (Step 1)||94.3||100.0||8.5||8.4||1.3|
|Parallel clustering (Step 2)||9.9||3.7||6.9||9.0||4.7|
|Merge partitions (Step 3)||61.3||18.2||1.1||0.8||0.1|
|Creating Spatial Indexes||2.4||1.3||1.2||1.2||1.3|
We conducted extensive experiments to evaluate two aspects of Kyrix-S: 1) performance and 2) authoring effort.
We conducted performance experiments to evaluate the online serving and indexing performance of Kyrix-S. We used both example SSVs in Figures Kyrix-S: Authoring Scalable Scatterplot Visualizations of Big Data and 1 and a synthetic circle-based SSV Syn
that visualizes a skewed dataset where 80% of the objects are in 20% of the 2D plane, and the rest of the 20% are uniformly distributed across the 2D plane. For database partitioning, we setmillion, i.e., each partition has roughly 2 million objects. So for a dataset with objects, there are partitions. Based on the number of partitions, we provision a Google Cloud Kubernetes cluster with n1-standard-8 PostgreSQL nodes (8 vCPUs, 30GB memory), each serving 8 partitions.
8.1.1 Online Serving Performance
To measure the online response times, we used a user trace where one pans around to find the most skewed region on a zoom level, zooms in, repeats until reaching the bottom level and then zooms all the way back to the top level. We measured the 95-th percentile111111A 95-percentile says that 95% of the time, the response time is equal to or below this value. This is a common metric for measuring network latency of web applications. of all data fetching time and network time.
Table 1 shows the results on five SSVs. The 95-percentile data fetching times were all below 32ms. The reason was because we only fetched data from the partitions that intersect with the viewport and the spatial indexes sped up the spatial queries. Network times were mostly negligible except for Taxi Heatmap and Taxi Contour, where many more data items were fetched due to smaller values.
Figure 8 shows the response times on different sizes of the synthetic SSV Syn. We can see that the response times remained stably under 20ms for data sizes from 32 million to 1 billion.
8.1.2 Offline Indexing Performance
Table 2 shows the indexing performance of the layout generator on five example SSVs. We make the following observations. First, the indexing phase finished in reasonable time: every example finished in less than 3 hours. Second, redistributing the data to the correct spatial partition was the most time consuming part since it was an I/O bound process. Fortunately, the same spatial partitions can be reused for updatable data if the spatial distribution does not change drastically. Third, parallel clustering and spatial index creation took the least time because they could be run in parallel across partitions. Fourth, merging clusters along KD-tree splits was mostly a cheap process. In fact, the largest number of clusters along a KD-tree split was 16,647. The reason that this step took longer on Reddit Text than on Reddit Circle was because it had more zoom levels (20 vs. 15) due to larger mark size (text vs. circle). Moreover, iterating through objects along KD-tree splits were much more time-consuming on the bottom five levels.
Figure 9 shows how indexing time changed for different sizes of Syn. We can see that the indexing time scaled well as the data size grew: as data size doubled, indexing time roughly doubled as well.
8.2 Authoring Effort
To evaluate the accessibility of our grammar, we compared the authoring effort of Kyrix-S with Kyrix[tao2019kyrix], a state-of-the-art general pan/zoom system. To our best knowledge, Kyrix is the only system that offers declarative primitives for general pan/zoom visualizations, and has been shown to be accessible to visualization developers in a user study[tao2019kyrix]. Former systems/languages such as D3[d3]
, Pad++[bederson1994pad++], Jazz[bederson2003jazz] and ZVTM[pietriga2005toolkit] require procedural programming which generally takes more authoring effort[tao2019kyrix]. We measured lines of specifications using both systems for the two examples SSVs in Figures 1d and 1f. We used a code formatter121212https://prettier.io/ to standardize the specifications, and only counted non-blank and non-comment lines.131313Code in this experiment is included in the supplemental materials.
|Figure 1d||62 lines||568 lines|
|164 lines||610 lines|
|68 lines||514 lines|
Table 3 shows the results. We can see that when authoring the two example SSVs, Kyrix-S achieved respectively and saving in specifications compared to Kyrix. In the second example, when we excluded the custom renderer for soccer players (which has 96 lines), the amount of savings was . These savings came from Kyrix-S abstracting away low-level details such as rendering of visual marks, configuring zoom levels, etc.
The above comparison did not include the code for layout generation. To enable the comparison, we stored the layouts generated by Kyrix-S as database tables so that Kyrix could directly use them. However, programming the layout was in fact a challenging task, as indicated by the total lines of code of Kyrix-S’s layout generator (1,439). Therefore, we conclude that Kyrix-S greatly reduced the user’s effort in authoring SSVs compared to general pan/zoom systems.
9 Limitations and Future Work
Other layout strategies. Kyrix-S’s assumes that the location of a mark comes from an object. This can be relaxed to diversify our layout generator. For example, supporting inexact placement of marks such as binned aggregation[heimerl2018visual] in SSVs is one future direction. We also plan to investigate layout strategies that concern multi-class scatterplots, e.g. how to preserve relative density orders among multiple classes[chen2014visual, chen2019recursive].
More built-in templates. Our declarative grammar is designed to enable rapid extension of the system with custom marks. This motivates us to engage more with the open-source community and enrich our built-in mark gallery with templates commonly required/authored by developers.
Incremental updates. Currently, Kyrix-S assumes that data is static and pre-materialize mark layouts. To interactively debug, the developer needs to either use a sample of the data or reduce the number of zoom levels. It is our future work to identify ways to incrementally update our mark layout upon frequent changes of developer specifications, as well as when the data itself is updated dynamically.
Animated transitions. A discrete-zoom-level model simplifies layout generation, but can potentially lead to abrupt visual effect upon level switching, especially for KDE-based renderers such as heatmaps. As future work, we will use animated transitions to counter this limitation.
Raster Images-based SSVs. The visual density constraint, partly due to limited processing capabilities of the frontend and the database, forbids the creation of dense visualizations such as point clouds[cartolabe]. We envision the use of raster images to remove this constraint for these visualizations where interaction with objects is not required.
In this paper, we presented the design of Kyrix-S, a system for easy authoring of SSVs at scale. Kyrix-S contributed a declarative grammar that enabled concise specification of a wide range of SSVs and rapid authoring of custom marks. Behind the scenes, Kyrix-S automatically generated layout of visual marks on zoom levels using a range of usability guidelines such as maintaining a visual density budget and high data abstraction quality. To scale to big datasets, Kyrix-S worked with a multi-node parallel database system to implement the layout algorithm in a distributed setting. Multi-node spatial indexes were built to achieve interactive response times. We demonstrated the expressivity of Kyrix-S with a gallery of example SSVs. Experiments on real and synthetic datasets showed that Kyrix-S scaled to big datasets with billions of objects and reduced the authoring effort significantly compared to a state-of-the-art authoring system.
We thank the anonymous reviewers for their thoughtful feedback. This work was in part supported by NSF OAC-1940175, OAC-1939945, IIS-1452977, DGE-1855886, IIS-1850115, DARPA FA8750-17-2-0107 and the Data Systems and AI Lab initiative under Grant 3882825.