Scaling interactive visual data exploration to massive datasets is becoming increasingly important with the rapid generation of data across domains, from healthcare to sciences. It is not unusual for analysts in application domains to deal with datasets of sizes in the order of terabytes or petabytes. Since fluid interactions help allocate human attention efficiently over data(Liu and Heer, 2014), interactivity should not be compromised when exploring big datasets, which can easily overwhelm analysts.
Details-on-demand (Shneiderman, 1996) is a common interaction pattern that arises from exploratory data analysis practices and can be particularly effective in exploring complex datasets, reducing the user’s information load. In this interaction pattern users start with an overview of a dataset and then zoom into a smaller subset of interest within the dataset to examine this data patch (Pirolli and Card, 1999), while querying details on items within the focused region as needed. Users repeat the same process after zooming further into or zooming out of the current region. However, most visual exploration systems cannot handle very big datasets, let alone enable details-on-demand interactions. Large data scales make it challenging to bound the interaction response times within 500ms, which is required for sustaining an interactive user experience (Liu and Heer, 2014).
Several earlier details-on-demand systems address interactivity challenges at scale with highly-customized implementations. Google Maps and Aperture Tiles (Cheng et al., 2013) precompute image tiles of the entire world map at multiple detail (zoom) levels for scalable panning and zooming. Similarly, imMens (Liu et al., 2013) supports interactive panning, zooming and brushing & linking in binned plots by precomputing tiled data cubes. ATLAS (Chan et al., 2008) uses predictive prefetching and level-of-detail management to improve the performance of panning and zooming interactions on large time-series datasets. ForeCache (Battle et al., 2016) also uses predictive prefetching along with data tiling to help sustain an interactive details-on-demand exploration of satellite images. Although these earlier systems use similar approaches to facilitate scalability, they are highly customized one-off tools developed from scratch for specific datasets. The optimization techniques used in these systems are often inaccessible to visualization developers at large, who are not necessarily experts in performance optimization. Furthermore, current visualization specification tools provide no or limited support for developers to create interactive visualization applications at scale.
To accelerate and improve the development of scalable visual data exploration systems, we need general purpose tools that can help developers handle large datasets by using effective optimization techniques (e.g. indexing, caching and prefetching). This warrants an integrative, end-to-end approach to visualization specification, where performance optimizations and data are pushed to the server side computation and DBMSs, which can scale reasonably well with increasing data sizes.
In this paper, we present the design of Kyrix, a novel system for developers to build large-scale details-on-demand visualizations. Our goal is to achieve both generality and scalability. Figure 1 shows the architecture of Kyrix. On the developer side, we offer a concise yet expressive declarative language for specifying visualizations. Declarative designs hide execution details (e.g. backend optimization, frontend rendering) from developers, so that they can focus on visual specification (Satyanarayan et al., 2014). On the execution side, there are three main components: the compiler, the backend server, and the frontend renderer. The compiler parses developers’ specification and performs basic constraint checkings. Based on the developer specification, the backend server then builds indexes and performs necessary precomputation. The frontend renderer is responsible for listening to users’ activities, communicating with the backend server to fetch data and rendering the visualizations.
In the following, we first discuss a simple map visualization created using Kyrix, briefly demonstrating the use of its services and declarative language. We then introduce optimizations used by Kyrix that facilitate the development of fluid details-on-demand interactions. Next we discuss useful extensions to the Kyrix system along with avenues of future research. We then put Kyrix in the context of earlier scalable visualization systems and grammars for declarative visualization specification. We conclude by summarizing our contributions and reiterating our vision on accelerating the development of interactive visualizations for massive datasets.
2. Developing Interactive Visualizations with Kyrix
The goal of Kyrix is to provide an end-to-end solution for visualization developers to create details-on-demand visualizations. To this end, Kyrix provides a declarative language for visualization specification.
2.1. Kyrix Declarative Language
Kyrix’s declarative model has two basic abstractions: canvas and jump. A canvas is an arbitrary size worksheet with one or more overlaid layers, forming a single view showing a static visualization. A jump is a customized transition from one canvas to another. This model allows easy specifications of common details-on-demand interactions such as panning, geometric and semantic zooming 111Geometric zooming refers to scaling the visualization to show different levels of details. Data type and visual encoding are unchanged. Semantic zooming, in contrast, connects different views showing related data using smooth zoom-like transitions. Data type and visual encoding can both be changed..
The Kyrix declarative language is data type agnostic and supports a myriad of specific visualizations. To render a layer, developers specify the following:
The data needed for the layer. This is specified using a SQL query to a DBMS along with a transform function postprocessing the query result. Developers can use existing visualization libraries (e.g., D3 and Vega) to specify a desired transform function (e.g., layout transforms, scaling, etc). However, this transform function is not required. Developers still can and should transform their data outside Kyrix if it is more convenient.
The location of each returned data object on the canvas. This is specified using a placement function.
A rendering function that converts a canvas object to pixels on the screen. Kyrix’s rendering functions can be written using lower-level visualization specification libraries such as D3.
A jump transition can be established simply by specifying a from canvas, a to canvas and a transition type (right now it can be geometric zoom, semantic zoom or both). It can also be customized in many ways. For example, developers can specify a subset of objects on the from canvas that can trigger this jump. For more details on the language, interested readers can refer to our developer manual222https://github.com/tracyhenry/Kyrix/blob/master/compiler/README.md.
2.2. Example: Map of US Crime Rate
The state map canvas is specified in Lines 5~21. This canvas contains two overlaid layers: a static legend layer (lines 13~15) and a pannable state border layer (lines 18~21). Each layer is specified using an identifier of a data transform (lines 9 and 10) and a boolean value indicating whether this layer is static (Lines 13 and 18). Static layers do not need to be re-rendered when user pans. So in this case when user pans, the legend will stay unchanged in the upper right-hand corner, overlaid on the state border layer. The county map canvas is also similarly specified. In Figure 3 we leave out the specification of the county map canvas along with the transform, rendering and placement functions due to limited space.
A jump transition from the state canvas to the county canvas is defined in line 36. In the constructed jump object, the first two arguments respectively identify the state and county canvases. The third argument specifies the jump type. The rest of the arguments are used to customize the jump transition. To complete the specification of the application, developer would also specify an initial canvas and a viewport center (line 39).
3. Interactivity in Kyrix
In general, the interactivity problem in Kyrix is to achieve a 500 ms response time to the following user interactions: (1) A pan to a different location on the same canvas and (2) a jump to a different canvas.
In Section 3.1 we discuss how Kyrix fetches data in response to user interactions. Then in Section 3.2, we give some general guidelines that will assist with achieving our goal. Lastly, Section 3.3 gives some end-to-end performance numbers concerning achieving our goal. We will discuss in Section 4 other performance options. If accepted, we would expect to present this work primarily as a demo.
3.1. Data Fetching
As user performs one of the operations (pan or jump), or when an application is first loaded, Kyrix’s frontend communicates with the backend to retrieve the data needed to render the viewport. Like previous systems (e.g., ForeCache (Battle et al., 2016)), Kyrix employs both a frontend cache and a backend cache. If there is a cache miss in both, Kyrix backend will talk to the backing DBMS to fetch data. In this data fetching process, we identify two important factors that can affect Kyrix’s performance:(1)fetching granularity and (2)database design and indexing. In the following, we describe these two factors in detail.
Fetching Granularity. The standard wisdom, as applied in Google Maps, ForeCache(Battle et al., 2016) and Aperture Tiles(Cheng et al., 2013), is to decompose a canvas into fixed-size static tiles (Figure 4(a)). The frontend then requests the tiles that intersect with the given viewport. Every tile is individually fetched and rendered. Kyrix currently supports static tiling. Kyrix also contributes a novel fetching granularity, dynamic boxes. Dynamic box fetching amounts to requesting a box that contains the given viewport (Figure 4(b)). We call this enclosing box a dynamic box because its size and location changes dynamically. Whenever the viewport moves outside the current box, frontend sends the current viewport location to backend and requests a new box. There are numerous ways to calculate a box, e.g., a box centered at the viewport center having width (height) 50% larger than the viewport width (height). We expect dynamic boxes to outperform static tiles for the following reasons:
compared to large tiles, dynamic boxes fetch less data;
compared to small tiles, dynamic boxes require fewer frontend-backend requests in general;
in cases where data is not uniformly distributed, dynamic boxes can adjust their sizes and locations based on data sparsity, incurring much fewer network and database trips than static tiles.
In Section 3.3, we use two simple box calculation algorithms to experimentally show that dynamic boxes are a more performant option than static tiles. We leave an in-depth performance study as future work.
Database Design and Indexing. We now describe two database designs along with two indexing schemes that we use to support static tiles and dynamic boxes. Our first database design maps tuples to static tiles and has two tables. The first table is a record table containing all the raw data attributes in addition to an auto-increment tuple_id attribute. The second table contains two columns tuple_id and tile_id. Each record in this table corresponds to a tuple that overlaps a tile. Kyrix backend uses placement functions specified by developers to precompute the second table. We then build Btree/hash indexes on the tuple_id column of the first table and the tile_id column of the second table. At runtime, tile queries are answered by joining these two tables on the tuple_id column.
Our second database design is based on spatial index in PostgreSQL. In addition to raw data attributes, we store a bbox attribute representing the bounding box of a tuple on a canvas333We assume records are generally rendered bigger than a single pixel. This bounding box information is derived from the placement functions specified by developers.. We then build a spatial index on the bbox column. Using this design, queries that request tuples whose bounding boxes intersect with a given rectangle should run fast. Therefore, this design can be used by both static tiles and dynamic boxes.
3.2. Performance Hygiene
Parallelism. We can apply parallelism to improve the data management in Kyrix. All data and metadata (canvas definitions, etc.) are stored in and retrieved from the DBMS. Although the performance experiments in the next section use PostgreSQL, it would be prudent to replace the DBMS with a parallel one if performance requirements warrant a switch. Currently, rendering is performed by a separate process on a separate CPU in the frontend. This operation can also be easily parallelized. Lastly, each concurrent Kyrix application is run in a separate process, since there is no interaction between them, except through the DBMS. Right now, Kyrix applications function like a read-only browsers. Future releases will extend Kyrix to allow editing updates, which can be supported by DBMS concurrency control.
Application Design. Managing visual density on the screen, which can overwhelm users as well as the client (e.g., the browser) resources, is an important concern in visualization of large datasets. Application design must deal with what canvases exist and how to put data onto these canvases so that visual density is not too high.
Separability. Recall in Section 4, we describe how Kyrix precomputes database tables and indexes to ensure data fetching speed. However, when data is huge or the SQL query corresponding to a canvas layer is complex, this precomputation process can take a long time. We identify a common case where this precomputation process can be avoided: the (, ) placement of objects are directly raw data attributes, or some simple scaling of raw data attributes. In these separable cases, if we assume DBAs have built spatial indexes on relevant raw data attributes when data is first loaded into the DBMS, we do not have to precompute the tables described in Section 4. For separable cases, we provide developers with the option to specify the relevant attributes so that precomputation can be skipped by Kyrix. There are cases where this requirement cannot be met, i.e., the placement of an object depends on multiple data attributes or the placements of other objects. We call these cases non-separable. Pie chart is an example.
3.3. Initial Performance Experiments
We conducted performance experiments on two synthetic datasets using three viewport movement traces. The goal of these experiments is to study the characteristics of the two fetching granularities when combined with different database designs. All experiments are done on an AWS EC2 m4.2xlarge instance with 8 cores and 32GB RAM. PostgreSQL 9.3 is the backing DBMS.
Datasets. We used two synthetic datasets, Uniform and Skewed. In Uniform, there are 100M random dots evenly distributed on a 1M0.1M canvas. In Skewed, 80M dots lie in 20% of the canvas area (a 0.4M0.05M rectangle) and 20M dots lie in the rest of the canvas. Skewed corresponds to the likely scenario when objects are distributed unevenly on a canvas.
Viewport Movement Traces. In our experiments we use three viewport movement traces illustrated in Figure 5.
The viewport is always aligned with tile boundaries. It horizontally moves leftwards six steps (the length of a tile) then vertically up six steps.
The viewport is never aligned with tiles. It also horizontally moves leftwards six steps (the length of a tile) then vertically upwards six steps.
The viewport moves diagonally from bottom left to top right. There are six steps in total.
Fetching schemes. We evaluated the following fetching schemes.
Dbox: Dynamic boxes with spatial index. The box fetched is exactly the viewport in each step.
Dbox 50%: Dynamic boxes with spatial index. The box fetched is 50% larger than the viewport.
Tile spatial: Static tiles with spatial index (three tile sizes tested: 256, 1,024 and 4,096).
Tile tuple-tile mapping: Static tiles with tuple-tile mapping (tile size 1,024 tested). Btree index is used on tuple_ID and tile_ID columns.
Dbox has the best overall performance on both Uniform and Skewed. The reasons are twofold. First, it fetches the least amount of data needed to render the viewport. Second, compared to small tiles, it issues much fewer queries.
Tile 1,024 spatial has competitive performance on trace-a, and is even better than Dbox 50%. This is because the viewport completely aligns with tile boundaries in trace-a.
Tile 4,096 and 256 spatial have the worst performances. This is expected since the tile size 4,096 fetches more data than other fetching schemes and the tile size 256 issues more queries than other fetching schemes.
4. Discussion and Future Work
Previous work (Battle et al., 2016) has studied prefetching data ahead of the user’s interaction. Specifically, both momentum-based and semantic-based prefetching were considered in a tiling context. To determine what to prefetch, semantic-based prefetching uses the similarity to recently viewed data in data characteristics (e.g., distribution). Whereas, momentum-based prefetching takes the user’s recent movements (e.g., pan and zoom) into account to that end. We plan to evaluate the effectiveness of momentum-based prefetching in the context of dynamic boxes. Our future work will also study caching options for Kyrix. Caching and prefetching are challenging given the jump operation, and will be more challenging by the extension of Kyrix to support coordinated views.
Currently, we are collaborating with a neurology group at Massachusetts General Hospital (MGH), which we anticipate motivating various future extensions of Kyrix. Our collaborators want to be able to interactively explore 50 terabytes of electroencephalogram (EEG) data collected from sleeping subjects. They want three different views of the data, a temporal view, a spectral view and a composite clustering view, to be coordinated. For instance, movement in the temporal view should cause an appropriate change in the spectral view. Hence, Kyrix must be extended to support multiple canvases on the screen simultaneously and to have pan/zoom operations in one canvas cause desired actions in other canvases. In addition, MGH wants an update model for Kyrix so they can edit and tag relevant data. Fifty terabytes will require a parallel multi-node DBMS to achieve our performance goals.
Lastly, we envision Kyrix as an integrated environment for developing scalable visualization applications. To this end, e.g, we plan to work on an “application by example” interface, whereby a user can drag and drop screen objects, and Kyrix can learn to automatically generate the location function (and perhaps other parts of the application).
5. Related Work
Kyrix is related to prior efforts in scalable visualization systems and declarative visualization specification.
5.1. Scalable Visualization Systems
Earlier research has proposed methods for scalable interactive data analysis that fall into one of the two categories in general: precomputation and sampling (Hellerstein, 2015). Precomputation, which traditionally referred to processing data into formats such as prespecified tiles or cubes, has been the prevalent approach to interactively answer queries via zooming, panning, brushing and linking. Google Maps precompute image tiles for multiple zoom layers to support scalable panning and zooming. Extending the tiling idea to structured data, imMens (Liu et al., 2013) computes multivariate data tiles in advance along with projections corresponding to materialized database and performs fast “roll ups”and rendering on the GPU. Nanocubes (Lins et al., 2013) stores and queries multi-dimensional aggregated data at multiple levels of resolution in memory for visualization. Hashedcubes (Pahins et al., 2017) improves on the memory footprint and implementation complexity of Nanocubes with an incurred cost of longer query times. ForeCache (Battle et al., 2016) uses data tiling together with predictive prefetching and in-memory caching to enable scalable panning and zooming for visualizations of array-based datasets. When precomputation is not possible (e.g., queries are not known in advance), sampling, often combined with precomputation, and online aggregation (Hellerstein et al., 1997; Agarwal et al., 2013; Battle et al., 2013) are used to improve user experience.
Kyrix precomputes database indexes and uses novel data fetching mechanisms to efficiently respond to pan and zoom interactions. Kyrix’s new dynamic-box fetching together with spatial index outperforms tile-based fetching used in earlier systems. To ensure the 500ms response time, Kyrix also adopts predictive prefetching and caching techniques (Chan et al., 2008; Battle et al., 2016).
5.2. Declarative Visualization Specification
Earlier research proposes declarative grammars over data as well as visual encoding and design variables to specify visualizations. In a seminal work, Wilkinson introduces a grammar of graphics (Wilkinson, 1999) and its implementation (VizML), forming the basis of the subsequent research on visualization specification. Drawing from Wilkinson’s grammar of graphics, Polaris (Stolte et al., 2002) (commercialized as Tableau) uses a table algebra, which later evolved to VizQL (Hanrahan, 2006), the underlying representation of Tableau visualizations. Wickham introduces ggplot2 (Wickham, 2010), a widely-popular package in the R statistical language, based on Wilkinson’s grammar. Similarly, Protovis (Bostock and Heer, 2009), D3 (Bostock et al., 2011), Vega (Satyanarayan et al., 2016), Brunel (Wills, 2017), and Vega-Lite (Satyanarayan et al., 2017) all provide grammars to declaratively specify visualizations.
Kyrix’s declarative grammar differs from these earlier efforts by providing constructs for specification of scalable interactive visualizations and integrating visualization specification with a server-side processing and scalable data management for performance optimization.
The current practice of purpose-built scalable visualization tools is itself not scalable under the fast growth of large datasets across domains. To accelerate the development pace of interactive visualization systems at scale, we need to make it easier for developers to access scalable data management models as well as performance optimizations needed for sustaining interactive rates. In this paper, we present the design of Kyrix, a novel end-to-end system for developers to build interactive, details-on-demand visualizations at scale. Kyrix enables developers to declaratively specify visualizations, while utilizing Kyrix’s suite of optimizations and data management model. Kyrix also contributes a novel dynamic fetching scheme that outperforms tile-based fetching common to existing systems.
- Agarwal et al. (2013) Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 29–42.
- Battle et al. (2016) Leilani Battle, Remco Chang, and Michael Stonebraker. 2016. Dynamic Prefetching of Data Tiles for Interactive Visualization. In ACM SIGMOD. 1363–1375.
- Battle et al. (2013) Leilani Battle, Michael Stonebraker, and Remco Chang. 2013. Dynamic Reduction of Query Result Sets for Interactive Visualizaton. In Proc. IEEE Conference on Big Data.
- Bostock and Heer (2009) Michael Bostock and Jeffrey Heer. 2009. Protovis: A Graphical Toolkit for Visualization. IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis) (2009).
- Bostock et al. (2011) Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. 2011. D3: Data-Driven Documents. IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis) (2011).
- Chan et al. (2008) Sye-Min Chan, Ling Xiao, J. Gerth, and P. Hanrahan. 2008. Maintaining interactivity while exploring massive time series. In IEEE Symposium on Visual Analytics Science and Technology. 59–66.
- Cheng et al. (2013) Daniel Cheng, Peter Schretlen, Nathan Kronenfeld, Neil Bozowsky, and William Wright. 2013. Tile based visual analytics for twitter big data exploratory analysis. In Big Data, 2013 IEEE International Conference on. IEEE, 2–4. http://aperturetiles.com/
- Hanrahan (2006) Pat Hanrahan. 2006. Vizql: a language for query, analysis and visualization. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data. ACM, 721–721.
- Hellerstein (2015) Joseph M. Hellerstein. 2015. Interactive Analytics. In Readings in Database Systems (5th ed.). MIT Press.
- Hellerstein et al. (1997) Joseph M Hellerstein, Peter J Haas, and Helen J Wang. 1997. Online aggregation. In ACM SIGMOD Record, Vol. 26. ACM, 171–182.
- Lins et al. (2013) Lauro Lins, James T. Klosowski, and Carlos Scheidegger. 2013. Nanocubes for real-time exploration of spatiotemporal datasets. IEEE TVCG 19, 12 (2013), 2456–2465.
- Liu and Heer (2014) Zhicheng Liu and Jeffrey Heer. 2014. The effects of interactive latency on exploratory visual analysis. IEEE transactions on visualization and computer graphics 20, 12 (2014), 2122–2131.
- Liu et al. (2013) Zhicheng Liu, Biye Jiang, and Jeffrey Heer. 2013. imMens: Real-time Visual Querying of Big Data. Comput. Graphics Forum 32 (2013), 421–430.
- Pahins et al. (2017) Cicero A. L. Pahins, Sean A. Stephens, Carlos Scheidegger, and Joao L. D. Comba. 2017. Hashedcubes: Simple, Low Memory, Real-Time Visual Exploration of Big Data. IEEE Transactions on Visualization and Computer Graphics (2017), 671–680.
- Pirolli and Card (1999) Peter Pirolli and Stuart Card. 1999. Information foraging. Psychological review 106, 4 (1999), 643.
- Satyanarayan et al. (2017) Arvind Satyanarayan, Dominik Moritz, Kanit Wongsuphasawat, and Jeffrey Heer. 2017. Vega-Lite: A Grammar of Interactive Graphics. IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis) (2017).
- Satyanarayan et al. (2016) Arvind Satyanarayan, Ryan Russell, Jane Hoffswell, and Jeffrey Heer. 2016. Reactive Vega: A Streaming Dataflow Architecture for Declarative Interactive Visualization. IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis) (2016).
Satyanarayan et al. (2014)
Arvind Satyanarayan, Kanit
Wongsuphasawat, and Jeffrey Heer.
Declarative Interaction Design for Data Visualization. InACM User Interface Software & Technology (UIST). http://idl.cs.washington.edu/papers/reactive-vega
- Shneiderman (1996) Ben Shneiderman. 1996. The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations. In Proceedings of the 1996 IEEE Symposium on Visual Languages.
- Stolte et al. (2002) C. Stolte, D. Tang, and P. Hanrahan. 2002. Polaris: a system for query, analysis, and visualization of multidimensional relational databases. IEEE Transactions on Visualization and Computer Graphics 8, 1 (2002), 52–65. https://doi.org/10.1109/2945.981851
- Wickham (2010) Hadley Wickham. 2010. A layered grammar of graphics. Journal of Computational and Graphical Statistics 19, 1 (2010), 3–28.
- Wilkinson (1999) Leland Wilkinson. 1999. The Grammar of Graphics (1st ed.). Springer.
- Wills (2017) Graham Wills. 2017. Brunel v2.5. https://github.com/Brunel-Visualization/Brunel. Accessed: 2018-04-04.