Data visualizations use visual representations of abstract data to amplify human cognition. Researchers traditionally investigate visualizations as artifacts created for people. This paper revisits this traditional perspective in line with the growing research interest in applying artificial intelligence (AI) to visualizations. Similar to common data formats like text and images, visualizations are increasingly created, shared, collected and reused with the power of AI. Thus, we see that visualizations are becoming a new data format processed by AI. For instance, this trend was evident at the 2020 IEEE Visualization Conference where multiple techniques were proposed for automating the creation of visualizations [qian2020retrieve, wu2020mobilevisfixer, oppermann2020vizcommender, shi2020calliope, kim2020gemini], retargeting visualizations [zhang2020viscode, fu2020chartem], and analyzing visualization ensembles [chen2020composition, zhao2020chartseer]. In light of this trend, new concepts and research problems are emerging, raising the need to organize existing literature and clarify the research landscape.
This survey describes the research vision of formalizing visualizations as an emerging data format and reviews recent advances in developing AI approaches for visualization data (AI4VIS). We define visualization data as the digital representations of visualizations in computers and focus on visualizations in information visualization and visual analytics. Nevertheless, AI is a broad notion that has been studied in different areas. Those areas have different motivations and research questions about applying AI to visualization data, the proposed techniques, and the content format used to represent visualizations. For instance, the Web and Information Retrieval community advances the technology for searching visualizations (e.g., [chen2015diagramflyer, ray2015architecture]), while research in Computer Vision recently studies visual question answering in charts (e.g., [kafle2018dvqa, chen2020figure]). Therefore, a comprehensive understanding of AI4VIS research requires a foundation in rich literature from diverse disciplines.
We construct a literature corpus by a relation-search approach [mcnabb2017survey], i.e., graph traversal over the citation and reference networks. This approach allows us to collect 98 publications from 10 research communities – publications from the visualization community account for roughly one-third. While not being exhaustive, our corpus provides sufficient research instances to synthesize relevant work that contributes to the understandings of important common research questions and techniques. As shown in Figure 1, we organize and categorize AI4VIS research following a well-established why-what-how viewpoint [xu2020survey]:
Why apply AI to visualization data. We identify three common goals for AI4VIS research, namely visualization recommendation, enhancement, and analysis (Figure 1
). We classify those goals into subcategories to provide a comprehensive response to ongoing discussions like “why should we teach machines to read charts made for humans”[ono2018should].
What is visualization data. We formalize the concept of visualization data by providing an overview of existing content formats of visualization data as well as their representations (Figure 1).
How to apply AI to visualization data. Most importantly, we contribute a task abstraction regarding how to apply AI to visualization data (Figure 1). The task abstraction is critical since it allows for domain-agnostic and consistent descriptions of research questions across different disciplines. Besides, it facilitates an organized discussion reconciling distinct visualization paper types [lee2019broadening], i.e., a technique paper focuses on a single task, while a system paper might accomplish multiple tasks. In total, we note seven common tasks and discuss the AI approaches for each task separately.
Drawing upon the discussion, we outline research opportunities. We make the list of surveyed paper with related material available online at htmlai4vis.github.io. We hope that our survey will stimulate new theories, problems, techniques, and applications in this growing research area.
2 Related Surveys
Our ability to collect data has significantly exceeded our ability to analyze it, contributing to the emergence of AI approaches that automate the processes. Recently, there are active ongoing discussions about how visualization research could be interwoven with artificial intelligence (AI) [visai, andrienko2020big]. However, AI has been defined and operationalized as a broad notion. To make our discussion concrete, our specific prospect is to formalize visualizations as an emerging format of data. We see visualizations different from existing types such as images and texts, thereby raising many research questions regarding how AI facilitates the manipulation and analysis of visualizations.
Several surveys review techniques for automating the creation of visualizations. Saket et al [saket2018beyond] discussed the prospect of learning visualization design and classified automated visualization design systems into knowledge-based (i.e., rule-based), data-driven (i.e., machine-learning), and hybrid approaches. This classification was systematically reviewed in a recent survey about visualization and infographics recommendation [zhu2020survey]. Besides, Qin et al [qin2020making] drew on research in the database community to survey what makes data visualization more efficient and effective. In addition to automated creation and recommendation, Davila et al [davila2020chart] reviewed research over the past eight years to formalize chart mining, defined as “the process of automatic detection, extraction and analysis of charts to reproduce the tabular data that was originally used to create them”. Wang et al [wang2020applying] recently surveyed machine learning models applied to visualizations. Different from them, our objective is to provide a comprehensive review of AI4VIS research, that is, a scaffold for emerging research problems to be formulated and understood (e.g., automatic assessment and summarization of visualizations).
In this section, we describe the scope of our literature survey, the search methodology and corpus, as well as our analysis method.
3.1 Definition and Scope
This survey focuses on AI approaches applied to visualization data. We define visualization data as the digital representations of visualizations in computers. Thus, we include existing work that contributes AI techniques or systems that primarily focuses on inputting or outputting visualization data. However, due to the wide scope of visualizations and related research, we constrict the scope of visualizations for manageability.
Excluding scientific visualizations. We cover literature related to information visualization and visual analytics, particularly for which visualizations are represented as charts and infographics. This restriction excludes literature that primarily focus on scientific visualizations. Scientific visualizations represent scientific data such as flows and volumes, which are typically designed with a strong inherent reference to space and time [kehrer2012visualization]. We exclude them due to their heterogeneous nature that yields different research challenges and interests [xu2020survey].
Excluding research specific to a chart type. The research problems and proposed techniques could vary among different chart types. For instance, a conference (i.e., International Symposium on Graph Drawing) in the graph visualization community focuses on the graph layout problem, whereas their problems might not apply to other visualizations like histograms. Those chart-specific studies are large-scale and therefore might not be covered in a single survey, e.g., Behrisch et al [behrisch2018quality] devoted a survey to discussions about assessment metrics for different charts. Different from them, our goal is to identify common research problems (i.e., assessment, recommending) irrespective of the chart types. Nevertheless, we note a few chart-type-specific studies and discuss how they might fall into our taxonomy in section 8.
Excluding human interaction data. Finally, we emphasize the central role of visualization data in the surveyed AI approaches. In other words, we do not consider work whose primary goal is to collect and analyze human interaction data when creating or using visualizations. Therefore, we exclude provenance data [xu2020survey] and data collected for natural language interfaces [srinivasan2017natural].
3.2 Method and Corpus
To establish the corpus of papers we discussed in this survey, we apply a relation-search method [mcnabb2017survey] to traverse and search the literature. Our method starts with a linear scan of the full papers published at the 2020 IEEE Visualization Conference to collect the starting points. This initial set of papers has nine papers [qian2020retrieve, wu2020mobilevisfixer, oppermann2020vizcommender, shi2020calliope, kim2020gemini, zhang2020viscode, fu2020chartem, chen2020composition, zhao2020chartseer]. We further augment these starting points with papers covered in related surveys [saket2018beyond, davila2020chart, zhu2020survey, qin2020making]. If a paper is selected to be included, we traverse both it references and citations. We search for papers breadth-first in an attempt to avoid over-focus on one particular line of research.
Our relation-search method results in an interdisciplinary corpus consisting of 98 papers from 10 research areas. We decide the research area of each paper according to the classification of computer science areas by CSRankings111http://csrankings.org/. Figure 2 lists the research areas, indicating the interdisciplinary nature and widespread popularity of visualizations. The primary area is Visualization, accounting for around one-third (34/98). The next areas are Human-computer Interaction and Databases. Besides, we see research efforts from Artificial Intelligence, Data Mining, Computer Graphics, and the Web & Information Retrieval.
Another finding is the increasing trend of applying AI methods for data visualization. Figure 2 shows that the total number of publications has been increasing steadily over the last decade, particularly with a surge since 2018 and a peak at 2020. Given the wide and ever-growing research efforts and interests, we believe that this topic will receive more attention in the near future.
We acknowledge the limitations of our search methodology which is based on manual search over citation and reference graphs. Consequently, our survey should be seen as an effort in investigating the diversity of related research, providing sufficient research instances to contextualize the current research landscape, and indicating future research opportunities. As such, we do not claim comprehensiveness nor exhaustiveness. Instead, future research could augment the corpus by automated graph traversal.
3.3 Coding and Classification
Figure 1 provides an overview of our classifications of surveyed papers, which is organized according to the why-what-how axes. This organization is well-established in visualization-related surveys [xu2020survey, behrisch2018quality]. Nevertheless, we introduce two modifications to the “what” and “how” axes in order to contextualize our discussion in AI4VIS research.
Firstly, from the “what” perspective, we discuss not only what is visualization data and but also its representation (section 5). For representation, we review internal representations (how visualization data is stored and operated in systems) and feature representation (how visualizations are converted into features that are mathematically and computationally convenient for analysis).
Secondly, from the “how” perspective, we organize the AI approaches with a novel task abstraction. We identify seven common tasks for AI4VIS research and discuss approaches for each task in section 6. Our motivations for such task abstractions are two-fold:
Reconciling techniques with system papers. We find our corpus a mixture of technique and system papers that confound the discussion. A technique paper typically solves a single task, while a system paper could consist of multiple components each for different tasks [lee2019broadening]. As such, we aim to decompose system papers into abstract tasks that allow for a collectively-exhaustive taxonomy of tasks.
Unifying Inconsistent Vocabularies. Due to the interdisciplinary nature of our corpus, we find tasks are usually described in inconsistent vocabularies. For instance, the task of extracting encoding choices from visualization images or specifications is described as deconstruction [wu2020mobilevisfixer, qian2020retrieve] or chart mining [davila2020chart]. Thus, we wish to establish a common vocabulary that enables consistent discussions for researchers from different areas to communicate the relevance and subtleties.
For the above purposes, we adopt a bottom-up approach by iteratively categorizing and labeling the tasks in surveyed papers. Thanks to the interdisciplinary nature of our corpus, we are able to start from several “seeding” tasks by referring to task taxonomy in other fields [computerVision]. Subsequently, we verify whether the task could map to existing categories, and if not applicable, discuss alternative task categorizations. Most tasks in the final taxonomy conform to well-known terminology, with few exceptions that reflect the peculiarity of visualizations (e.g., visualization recommendation). We discuss details of the task taxonomy in section 6. Our supplemental website, available at ai4vis.github.io, provides the details of our labels. Particularly, we label tasks at the (sub-)section level, and provide quotes from the paper to help readers understand why it falls into the task category.
4 Goal: Why Apply AI to Visualization Data
The goals of applying AI to visualization data cover a wide spectrum, pursued by research efforts from different areas. We adopt a deductive classification method to create a mutually exclusive and collectively exhaustive taxonomy that better structures our discussion. Specifically, we subdivide goals along two axes deductively: whether visualizations are the input or output and whether visualizations are single or multiple. We further merge outputting single visualization and outputting many visualizations from an inductive perspective, i.e., we observe that they share the same sub-categorization. Therefore, we finally classify goals into 3 categories, which are further subdivided (Figure 3):
Visualization Generation outputs single or many visualizations given different user inputs.
Visualization Enhancement processes and applies enhancement to an input visualization.
Visualization Analysis concerns organizing and exploiting a visualization collection.
4.1 Visualization Generation
One of the central research problems in the visualization community is to ease the creation of visualizations. This is important since authoring effective and elegant visualizations is challenging even for professionals [qin2020making]. It is typically tedious and time-consuming to craft visualizations that clearly convey the insights while satisfying effectiveness and aesthetic goals. As such, the ultimate goal of work in this category is to automatically generate visualizations. We identified four subcategories that distinguish visualization generation approaches by user input.
Data-based generation outputs visualizations given a database or a data-table. These approaches assist in visual data analysis and have been extensively studied over the last decades. Early research dates back to 1986 [mackinlay1986automating], while it still remains an important question nowadays. Recent work like Draco [moritz2018formalizing], DeepEye [luo2018deepeye], VizML [hu2019vizml], and DataShot [wang2019datashot] make efforts on the direction.
Visual analysis is an iterative process where the next step of analysis often depends on earlier insights, motivating the research on anchor-based generation. The problem is to recommend a visualization given an anchor visualization. For instance, SeeDB [vartak2015seedb] intelligently recommends visualizations with large deviation to the anchor visualizations, since they deem most “interesting” to users. Similarly, DiVE [mafrur2018dive] aims to vary the visualization recommendation, while Dziban [lin2020dziban] targets at maintaining consistency.
Related to anchor-based generation, design-based generation studies the problem of generating visualizations by injecting the target data to a reference design. This is referred to as style transfer in Harper and Agrawala’s approach [harper2017converting] or visualization-by-example [wang2019visualization]. Another recent example is Retrieve-Then-Adapt [qian2020retrieve] that applies pre-defined design templates to user information.
The last category is context-based generation, where the input only provides some contextual information such as a natural language description [cui2019text] or news articles [lin2018vizbywiki]. An important task for context-based generation is to recommend data that is mostly related to the given context.
4.2 Visualization Enhancement
The proliferation of visualizations gives rise to research efforts in enhancing the use of existing visualizations. An important question is to retarget visualizations to different environments. For example, VisCode [zhang2020viscode] and Chartem [fu2020chartem] study how to encode additional information in the visualization images. MobileVisFixer [wu2020mobilevisfixer] attempts to automatically convert web visualizations into mobile-friendly designs. Some other work explore adding interactions to visualizations to improve the legibility and interactivity. Graphical Overlays [kong2012graphical] uses layered elements to aid chart reading. Interaction+ [lu2017interaction+] enhances visualizations with dynamic, interactive visual exploration. It is also common to summarize visualizations to generate natural language descriptions such as captions and annotations. This approach transforms visualizations from visual to non-visual modality, whereby enabling multimodal interactions [lai2020automatic] or enabling people with vision impairments or low vision to consume visualizations [choi2019visualizing]. Related to natural language, several recent research challenges machines to perform question answering on visualizations, that is, to generate answers given a question (e.g., [kafle2018dvqa]).
4.3 Visualization Analysis
Finally, with the increasing availability of visualization data, research has constructed visualization database and investigated methods for managing and analyzing these collections. Retrieval has been largely studied in the field of information systems and databases, helping users search for visualizations that match their needs. For instance, Retrieve-Then-Adapt [qian2020retrieve] assists users in finding an example visualization that is suitable for encoding their data. Saleh et al [saleh2015learning] developed a search engine that returns stylistically similar visualizations given a query visualization.
Another promising set of work has started to mine visualization collections to derive useful information such as the visualization usage on web [battle2018beagle] or in the scientific literature [ray2015architecture, lee2017viziometrics], as well as design patterns in visualizations [hoque2019searching, smart2020color] and multiple-view systems [chen2020composition]. The mined patterns provide evidence for recommending visualizations. Besides, some work [dai2018chart, zhao2020chartseer] considers charts to be the analytical target and provides a visual analytic approach for analyzing data patterns from charts ensembles.
5 Data: What is Visualization Data
In this section, we formalize the concept of visualization data. Specifically, we discuss and categorize visualization data in terms of its raw data format (subsection 5.1) and representations (subsection 5.2). As shown in Figure 4, we classify visualization data formats into graphics, programs, and hybrid that blends the benefits of both. In addition to raw data, we note visualizations are sometimes represented as carefully designed internal representation formats in surveyed systems. Internal representations are usually proposed to facilitate the computing by removing unnecessary information, e.g., the VQL format [luo2020interactive] only stores data transformation and encoding without style information. As such, internal representations are usually not exposed (outputted and shared). Finally, we review feature presentations including feature engineering and feature learning. Feature presentations are vital for machine learning tasks by concerting visualizations into features that are mathematically and computationally convenient to analyze. We discuss them due to the increasing interests in applying machine learning to visualizations .
Visualization data can be stored in different content formats such as graphics and programs. The choice of content formats directly influences the downstream operation possibly on the visualization data, since different content formats have their own advantages and disadvantages. Here, we discuss three formats we identified in our survey: graphics, programs, and hybrid formats.
Graphics are a natural and expressive content format of visualizations, since visualizations are defined as a graphical representation of data. It is common to author and store visualization as raster graphics (bitmaps) for easy usage and sharing [satyanarayan2019critical]. Nevertheless, raster graphics are a standalone and lossy representation of visualizations which lose the visualization semantics (e.g., chart type, visual encoding, underlying data). To perform automated analysis, reverse engineering is often a pre-requisite, i.e., to reconstruct the lost information from raster graphics using computer vision and machine learning approaches [poco2017reverse, savva2011revision]. However, reverse engineering still remains as an open problem with challenges to overcome in terms of robustness and accuracy [fu2020chartem]. In conclusion, the lossy nature of raster graphics hinders the machines from easily interpreting and transforming the visualization [fu2020chartem].
represents a less lossy alternative. They have advantages over raster graphics in that they can be scalable up without aliasing. Visualizations are usually stored in the Scalable Vector Graphics (SVG) format[satyanarayan2019critical], which allows describing visual elements as shapes (e.g., rectangles and text) with styles (e.g., positions and fill-color). Those low-level descriptions reduce the difficulties of reverse engineering, e.g., it is no longer necessary to apply computer vision techniques to detect objects such as texts [Moritz2017]. Besides, this format enables support for interactivity and animation. Nevertheless, high-level visualization semantics such as visual encoding and underlying data is still lost whose extraction requires considerable efforts [harper2014deconstructing, wu2020mobilevisfixer].
To conclude, graphics are a human-friendly and expressive content format for visualizations. However, their lossy nature restricts the availability of machine interpretation and computation and requires reverse engineering. To that end, vector graphics are more advantageous to reverse engineering than raster graphics.
Researchers have developed approaches for describing and storing visualizations as computer programs. Programs retain necessary information to construct the visualization, e.g., the underlying data. The information is usually represented by languages, which are classified into imperative and declarative programming.
Declarative visualization languages ask programmers to directly describe the desired results, which is usually referred to visualization specifications. Specifications (e.g., Vega [satyanarayan2015reactive] and Vega-Lite [satyanarayan2016vega]) encapsulate step-by-step commands for visualization reconstruction into semantic components such as data encodings, axes and legend properties. This encapsulation is achieved by providing sensible defaults and introducing constraints with prescribed properties and structures. As such, declarative programs tend to be less or equally expressive as imperative programs, depending on their design. Since specifications contain tags or markers to separate semantic elements and enforce hierarchies, they are deemed semi-structured and thus more helpful for computer processing tasks [buneman1997semistructured]. It is, therefore, a common practice to generate or collect specifications to conduct data-driven research, e.g., VizML [hu2019vizml] collects the Plotly corpus to train visualization recommendation systems.
Arguably, programs are not friendly to most people except for programmers. As such, programs tend to be less shared by laypeople online, which hinders their collection and reuse. This might be exemplified by the data collection for visualization research. Even though programs are more commonly used in AI approaches than graphics, existing corpora mainly include graphics [battle2018beagle, borkin2013makes, lee2017viziometrics] or tabular datasets [hu2019viznet] instead of programs. This suggests the need for more recognition of balancing the machine- and human-friendliness of visualization formats.
Recent research proposes several hybrid content formats that incorporate the benefits of both graphics and programs. Although such efforts remain limited, we provide our embryonic classification here, hoping to motivate future theories and models.
Two approaches aim to embed programs into graphics. VisCode [zhang2020viscode] presents an embedding approach based on deep image steganography, that is, to conceal visualization specifications and meta information within the bitmap image. Similarly, Chartem [fu2020chartem] encodes information (e.g., the visualization specifications) in the background of a chart image. Both embedding techniques reduce the overhead to decode underlying visualization specifications, by showing that the encoded information can be extracted in an efficient and less error-prone manner. Besides, both approaches reduce the interference to human perception by avoiding visually important areas of bitmap images.
Loom [raji2020dataless] takes a different approach and seeks to organize graphics by programs. Loom proposes to share interactive visualizations by filling the gap between two extremes, i.e., sharing non-interactive formats such as images, and sharing the data, source codes, and software. Specifically, it formulates interface visualizations as a standalone object built on an action tree. Each intermediate node of the tree represents an interaction such as hovering and clicking, and the leaf node stores the resulting visualization image. Therefore, users can interact with the graphics as if they have access to the original source codes and software. In this way, the hybrid approach provides a reproducible and sustainable format that promotes the sharability of visualizations.
general image descriptors [savva2011revision, lee2017viziometrics, choudhury2016scalable, ray2015architecture]
element positions or regions [choi2019visualizing, savva2011revision, poco2017reverse, battle2018beagle, huang2007system]
element styles [battle2018beagle]
parameters [wu2020mobilevisfixer, qian2020retrieve]
communicative signals [burns2012automatically]
design rules [moritz2018formalizing, lin2020dziban]
statistical models [oppermann2020vizcommender, choudhury2016scalable, kim2018multimodal, luo2020steerable]
statistics [hu2019vizml, hu2019viznet, luo2018deepeye, key2012vizdeck, luo2018deepeyekeyword]
one-hot vector [xu2018chart]
convolutional neural network [bylinskii2017learning, lin2018vizbywiki, ma2020ladv, choi2019visualizing, poco2017reverse, siegel2016figureseer, fu2019visualization, dai2018chart, kim2018multimodal, chagas2018evaluation, tsutsui2017data, fu2019visualization, tang2016deepchart, haehn2018evaluating, chaudhry2020leaf, chen2019neural, chen2020figure, kafle2018dvqa, kahou2017figureqa, methani2020plotqa, reddy2019figurenet, singh2020stl]
autoencoder [zhang2020viscode, fu2019visualization]
|autoencoder [zhao2020chartseer, obeid2020chart]||embedding models [xu2018chart, oppermann2020vizcommender]||autoencoder [dibia2019data2vis]|
Now that we have considered the different content formats of raw visualization data, the next challenge is its representation in AI approaches. Firstly, raw visualization data in the format of images or programs might not explicitly represent the semantic information needed (e.g., chart type). Thus, it is usually helpful to store and operate visualization data in the internal representation formats to facilitate the process. Secondly, raw data need to be converted into feature representation to enable machine-learning techniques. Thus, we discuss feature engineering and feature learning approaches used to extract visualization features.
5.2.1 Internal Representation
Visualization programs tend to contain extraneous details (e.g., visual style) or miss semantics (e.g., chart type) that might not meet particular needs of research. Thus, systems usually express and operate on visualizations in simpler or more structured formats by removing unwanted or unnecessary specifications and adding customized information, which we refer to as internal representations. However, we find that most surveyed papers do not explicitly discuss their data structure of internal representations. Nevertheless, we note three common formal internal representation formats that are designed towards the high-level goal of facilitating computation (Figure 5).
Draco [moritz2018formalizing] (Figure 5) uses Answer Set Programming to express visualization specifications as logical facts. This falls into logic programming [lifschitz1999action], which expresses problem domains as facts or rules and benefits logical computation. For instance, the logical facts in Draco can be used to check whether the specifications satisfy compound rules regarding the visualization design knowledge.
Literature from data mining and databases [luo2018deepeye, luo2018deepeyekeyword, ehsan2017efficient, ehsan2016muve, mafrur2018dive, luo2020interactive, luo2020steerable, siddiqui2016effortless, vartak2015seedb, wu2017combining] usually uses relational programming to form visualizations as queries into database (Figure 5). Those relational queries facilitate operations on collections of visualizations such as composing, filtering, comparing and sorting. This programming paradigm is also adopted by the CompassQL language in Voyager [wongsuphasawat2016towards, wongsuphasawat2015voyager, wongsuphasawat2017voyager].
Finally, Wang et al [wang2019visualization] proposed a set-theoretic programming representation that describes visualizations as a set of visual elements (Figure 5). This representation facilitates set computation, e.g., to determine whether a visualization is a super-set of another.
5.2.2 Feature Engineering and Feature Learning
In this section, we discuss the features of visualizations, which are the measurable properties serving as the input to machine learning models. Features are extracted by feature engineering or feature learning [bengio2013representation]. Feature engineering is the process of using domain knowledge to extract features from raw data, while feature learning replaces this manual process by developing automated approaches that automatically discover useful representations. For our discussion, we classify existing approaches according to the feature space, including graphics, program, text, and underlying data (Table I). It should be noted that some papers use multiple features for different tasks [choi2019visualizing, poco2017reverse] or use hybrids by feature fusion for improving performances [savva2011revision, choudhury2016scalable, kim2018multimodal]. In the following text, we describe each category in detail.
Graphical features are the most common features of visualizations. The overarching goal is for predictive tasks, e.g., to predict the chart type or the “goodness”, or detection tasks. For those purposes, early work uses general image descriptors such as bag-of-keypoints [choudhury2016scalable] or patch descriptors [savva2011revision]. These image descriptors are designed for general visual content, containing only low-level information such as shapes and regions. To raise the level of abstraction, researchers have proposed special domain descriptors that capture visualization-specific information, and in many cases outperform the general descriptors (e.g., [poco2017reverse]). Examples include the regions of text elements (e.g., titles and labels) [savva2011revision], positions of text and mark elements [choi2019visualizing, poco2017reverse, battle2018beagle], the relative positions between text and marks [huang2007system], and the visual styles of elements [battle2018beagle]
. Despite promising results, such feature engineering process remains labour-intensive that requires expertise. Besides, it remains unclear whether human-crafted features are informative and discriminating for accomplishing machine learning tasks. Therefore, recent work has predominantly adopted automated feature learning by leveraging deep learning models. Particularly, convolutional neural networks (CNNs) have been widely used to automatically learn spatial hierarchies of image features and shown to outperform early approaches[lin2018vizbywiki, ma2020ladv, choi2019visualizing, poco2017reverse, siegel2016figureseer, dai2018chart, kim2018multimodal, chagas2018evaluation, tsutsui2017data, tang2016deepchart, haehn2018evaluating, bylinskii2017learning, chaudhry2020leaf, chen2019neural, chen2020figure, kafle2018dvqa, kahou2017figureqa, methani2020plotqa, reddy2019figurenet, singh2020stl]. VisCode [zhang2020viscode] recently uses an autoencoder to learn an effective representation that could conceal additional information. Fu et al [fu2019visualization] used the latent vectors of the autoencoder model to predict an assessment score of visualization images. However, these deep learning models currently face the same challenge as early general image descriptors in that they might not capture visualization-specific information. This limits their capability in high-level tasks such as automatic assessment [fu2019visualization] and visual question answering [kafle2018dvqa] where the performances are relatively dissatisfactory.
features are extracted from the programs such as specifications. Probably the most straightforward representation is the parameters. For instance, MobileVisFixer[wu2020mobilevisfixer] and Retrieve-then-Adapt [qian2020retrieve] train models that learn to operate on the chart parameters, e.g., positioning the element. Burns et al [burns2012automatically] extracted communicative signals (such as whether a group of bars is colored differently from the other bars) that they fed into a Bayesian model. Draco [moritz2018formalizing] contains constraints over facts that encode visualization design knowledge. These constraints describe whether a visualization conforms to best practices of effective visual design. However, little work uses programs as the training input which might be due to several reasons, including the lack of training data or the overheads of reserve engineering to extract the program from visualization graphics. Still, programs are a promising representation as they contain high-level visualization-specific information. This could be exemplified by the recent work ChartSeer [zhao2020chartseer], where the Vega-Lite specifications of charts are converted into visualization embeddings by autoencoders. The resulting embeddings are used to measure similarities between charts to assist in analyzing chart ensembles, and proven effective in controlled user studies. Their results suggest that program features are promising in semantically characterizing charts.
Text features refer to the text content in visualizations such as titles. They are considered to improve the feature informativeness by incorporating semantic information. For instance, two systems describe text information with statistical models, i.e., bag-of-words [choudhury2016scalable, kim2018multimodal], and incorporate the resulting text features with graphical features to improve the performance for chart detection and classification tasks. Moreover, text features can capture the subject matter of visualizations. VizCommender [oppermann2020vizcommender] is a recent content-based recommendation system built on machine learning models for predicting semantic similarity between two visualization repositories, which shows a high agreement with a human majority vote. Chart Constellations [xu2018chart] uses word embeddings to measure the similarity between charts. However, it still remains unclear how text features could be effectively fused with graphical features to more comprehensively represent a visualization, e.g., to balance text-based and style-based similarities.
The last category of visualization features is in the underlying data that the visualization encodes. Chart Constellations [xu2018chart] describes the encoded data columns by a one-hot vector encoding, which is then used to compute chart-wise similarities. Besides such descriptive purposes (i.e., to describe a visualization), data features are found to be mainly predictive (i.e., to predict the visual encoding). Data2Vis [dibia2019data2vis] adopts a sequence-to-sequence autoencoder structure that models the input dataset in the JSON format. However, the sequence-to-sequence structure might not well capture the characteristics of the underlying data such as the data type. DeepEye [luo2018deepeye, luo2018deepeyekeyword, luo2020steerable] and VizDeck [key2012vizdeck] perform feature engineering to consider data statistics such as the number of unique values in a column. VizML [hu2019vizml]
further extends this approach to 841 features including single- and pairwise-column features of the input dataset and significantly outperforms Data2Vis and DeepEye. The analysis of VizML suggests that those features are not independent and some appear to be of little importance. Future research should propose effective feature selection approaches or complementary feature learning methods.
In summary, we find the following open research questions regarding the visualization features:
Learning visualization-specific features. Existing off-the-shelf computer vision or machine learning models are originally designed for general visual content or relational data. As such, they might fall short when applied to visualizations, e.g., visualization assessment [fu2019visualization] and Data2Vis [dibia2019data2vis] since they do not capture informative visualization-specific features. To address this problem, researchers have proposed feature engineering approaches to manually craft features, which, however, is labour-intensive without guarantees for success. With the rapid development of deep learning techniques, we envision a deep learning model tailored to visualizations that effectively learn visualization-specific features.
Fusing multi-modal features. Visualizations are unlikely to be comprehensively represented by only one of the aforementioned features. In several cases, researchers have demonstrated that the feature fusion could improve the performances of chart detection and classification [choudhury2016scalable, kim2018multimodal]. Promising avenues for future work lie in leveraging feature fusion for high-level tasks such as question answering and similarity-based recommendation, which are not well-solved by the single-modal feature.
6 Tasks: How to Apply AI to Visualization Data
In this section, we focus on common tasks that researchers apply to visualization data. We organize the observed tasks into seven primary tasks as follows:
Transformation processes visualization graphics to output corresponding programs or another graphic.
Assessment measures the absolute or relative quality of a visualization in terms of scores or rankings.
Comparisonestimates the similarity or other metrics between two visualizations.
Querying refers to the problem of finding the target visualization relevant with a user query within visualization collections.
Reasoning challenges machines to interpret visualizations to derive high-level information such as insights and summaries.
Recommendation automates the creation of visualizations by suggesting data and/or visual encodings.
Mining aims to discover insights from visualization databases.
Most tasks originate from well-known terminology. For instance, transformation, assessment, and visual reasoning are well-studied tasks in the field of computer vision [computerVision]
, while querying and mining come from database and information system research. Two exceptions are comparison and recommendation. Although comparison is similar to the image similarity search task in computer vision[computerVision], we find a large body of visualization research studying other metrics (e.g., difference) between two visualizations and thus decide the wording comparison. Recommendation is a widely studied task in the visualization literature (e.g., [wongsuphasawat2016towards, zhu2020survey]).
In the following text, we describe the problem statement and challenges, summarize the existing techniques, and outline open research questions for each task.
Transformation is the operation that converts the content formats of visualizations. Particularly, it is straightforward to transform visualization programs into graphics by visualization tools or libraries (e.g., [bostock2011d3, satyanarayan2016vega]). A more challenging problem is the reverse process, i.e., reconstructing programs from graphics. This process is also known as reverse engineering [poco2017reverse]. In the following text, we focus on the reverse engineering problem.
Relations to goals and other tasks. Transformation is usually the first task for visualization enhancement and analysis, especially when the input is images. As such, it is often a prerequisite for remaining tasks. For instance, the extracted information can be used for querying [hoque2019searching] and reasoning [kim2020answering].
Relations to visualization data. Several works study the problem of transforming visualization images into another image by altering the data or visual styles [qian2020retrieve, hoque2019searching, harper2017converting]. Nevertheless, their approaches are built on reverse engineering, i.e., to extract the encodings first and then replace the data. Little research explored the direct transformation in the image space [fu2020chartem, zhang2020viscode, brosz2013transmogrification, zhang2020dataquilt]. As such, we do not dedicate a separate discussion on image-to-image transformation at the current stage. Future research could extend our taxonomy with further development in this field.
6.1.1 Challenges and Methods
Visualization reverse engineering has been widely and extensively studied over the past decades. Early research could date back to 2001 when Zhou and Tan [zhou2001learning] proposed a learning-based paradigm for chart recognition. Since then, much research has devoted to extracting semantic information from visualization images such as chart types, visual encoding, and underlying data. Ideally, reverse engineering is expected to yield cycle-consistency, that is, its output should be able to re-generate the original visualization. Despite promising preliminary results, several challenges remain to be overcome since much work makes simplifying assumptions on the expected input or output. For instance, several approaches take vector graphics as input [hoque2019searching, harper2017converting, harper2014deconstructing, wang2018narvis, lu2017interaction+, huang2007extraction, wu2020mobilevisfixer, battle2018beagle], assuming the type of each visual element is available as SVG. Most work is limited to a predefined set of chart types, while little work [lin2018vizbywiki, hoque2019searching, harper2017converting, harper2014deconstructing, wang2018narvis, wu2020mobilevisfixer, ma2020ladv, poco2017extracting] applies to more bespoke visualizations. Besides, a large body of research focuses on extracting a particular portion of information, such as chart classification [lin2018vizbywiki, ma2020ladv, lee2017viziometrics, kim2018multimodal, chagas2018evaluation, tsutsui2017data, tang2016deepchart], object separation [browuer2008segregating, ray2016curve], and object clustering [lu2017interaction+, wang2018narvis, wu2020mobilevisfixer].
We summarize a conceptual framework of the reverse engineering process to provide an overview of existing technical development and identify research gaps. We developed the framework via a bottom-up approach, where we abstract existing methods, identify their simplifying assumptions, and iteratively merge the results. As shown in Figure 6, we find that reverse engineering can be classified into two distinct phases. The first phase decomposes visualization graphics into semantic elements (e.g., axis, mark) through machine learning and computer vision techniques including object detection, classification, and clustering. The second phase performs mathematical computation over the decomposed semantic elements to extract visual encoding and/or the underlying data. In the following text, we describe each phase in detail.
effectiveness rankings [mackinlay1986automating]
learning-to-rank[lin2018vizbywiki, luo2018deepeye, luo2018deepeyekeyword, luo2020steerable]
convert rankings to scores [wongsuphasawat2015voyager, wongsuphasawat2017voyager, moritz2018formalizing]
hand-crafted metrics [cui2019text, ehsan2016muve, ehsan2017efficient, bryan2016temporal, savvides2019significance, lee2019avoiding, zhang2020viscode, kim2020gemini, wu2020mobilevisfixer]
learning-to-rank with scores [qian2020retrieve]
predictive regression [key2012vizdeck, hu2019viznet, fu2019visualization]
|learning-to-weight hand-crafted metrics [moritz2018formalizing]|
Decomposing. The decomposing phase varies depending on the input (Figure 6). The primary step for raster graphics is to detect and classify visual elements such as text and shapes. This is approached by traditional image processing techniques (e.g., edge detection, morphological operations) in early work [savva2011revision, choi2019visualizing, gao2012view, huang2007system] and machine learning or deep learning approaches (e.g., Mark RCNN) in work published in 2015 and later [lai2020automatic, choi2019visualizing, chen2019towards, poco2017extracting, poco2017reverse, choudhury2016scalable, chen2015diagramflyer, siegel2016figureseer, dai2018chart, al2017machine]. This element recognition step faces chart-specific challenges, e.g., to cope with visual clutter in line charts and scatterplots [browuer2008segregating, ray2016curve]. In addition, since existing object detection models are prone to rotation that is common for pie sectors, Choi et al [choi2019visualizing]
proposed a special heuristic to pie charts by grouping the nearby pixels with the same color. The output of this element recognition step is usually the position and class of each visual element, which are already available in the SVG specifications of vector graphics. In other words, vector graphics remove the overhead of element recognition.
Chart detection and classification are another step of the decomposing phase. This step faces two important choices: the classifier and the feature representation. Classical classifiers (e.g., support vector machine, random forest)[savva2011revision, battle2018beagle, choudhury2016scalable, prasad2007classifying] have been gradually superseded by deep learning classifiers (e.g., convolutional neural network (CNN)) [lin2018vizbywiki, ma2020ladv, choi2019visualizing, lee2017viziometrics, siegel2016figureseer, dai2018chart, kim2018multimodal, chagas2018evaluation, tsutsui2017data, tang2016deepchart] in visualization classification tasks. This is mainly because CNNs can effectively learn abstract features from raw visualization images, while classical classifiers require hand-crafted image features such as histograms of the image gradients (HOG) [prasad2007classifying] and dense sampling [savva2011revision, choudhury2016scalable]. Several approaches seek to improve the representativeness of features by incorporating element-level features such as text [savva2011revision, kim2018multimodal] and shape style features [battle2018beagle]. The last step of the decomposing phase addresses element clustering, that is, to cluster visual elements into semantic groups including guides (axes and legends), marks, and other information such as annotations. This step is typically separately discussed for text and shape elements. On one hand, clustering text is usually formalized as a classification problem, that is, to classify and group text according to their text roles such as x-axis-label and legend-title [al2016automatic, lai2020automatic, choi2019visualizing, savva2011revision, poco2017reverse, choudhury2016scalable, chen2015diagramflyer, dai2018chart, al2017machine, huang2007system]. On the other hand, this classification-based approach is not always readily applicable to shape clustering, since the roles of shape depend on the chart type. As such, researchers usually simplify this problem by focusing on common charts where shapes are well-defined. For instance, Poco and Heer [poco2017reverse] trained a classifier to detect area, bar, line, and plotting shapes, and consequently grouped shapes of the same type. To support more bespoke visualizations, several approaches [wang2018narvis, wu2020mobilevisfixer, hoque2019searching, harper2017converting, harper2014deconstructing] use the node hierarchy information to group shape nodes under the same ancestor. Nevertheless, those approaches are only applicable to vector graphics. Finally, the shape clusters are associated with the text clusters to identify axes, legends, and label-mark relationships.
Composing. After the visualization graphic has been decomposed into semantic groups (e.g., guides and marks), the composing phase (Figure 6) aims to extract the visual encoding and/or the underlying data from this semantic information. Different from the decomposing phase that uses computer vision and machine learning tasks, this composing phase mainly uses heuristics by leveraging domain knowledge about visualizations. We find two common methodological themes of those heuristics, depending on whether the chart type has been extracted from the previous phase.
The first class of heuristics uses information about the chart type and guides to determine the scale and the encoding, which is dominant in our corpus [al2016automatic, lai2020automatic, choi2019visualizing, chen2019towards, savva2011revision, choudhury2016scalable, siegel2016figureseer, dai2018chart, al2017machine, gao2012view, huang2007system]. For instance, given a scatterplot with axes and legends, it is straightforward to derive the visual encodings, i.e., the x/y positions maps to numerical values and the color encodes the categorical data, and to calculate the scale. Consequently, the underlying data could be computed via applying the reserve scale computations over the marks. However, those heuristics are often limited to a small set of chart types.
Another class [hoque2019searching, harper2017converting, harper2014deconstructing] studies the more challenging problem of decoding bespoke visualizations where the chart type is unknown. However, they focus on D3 charts where the underlying data is available by crawling the SVG node on the web. In this way, they develop heuristics to determine the scale from guides and data, and to derive the encoding from data and marks.
6.1.2 Discussion and Open Questions
In conclusion, it remains an open challenge to derive both the visual encoding and the underlying data from bespoke visualization graphics, whose chart types are not limit to common ones. An interesting future research direction would be to improve the current heuristics for determining visual encodings from bespoke visualizations, i.e., by machine-learning approaches.
Another primary challenge of reverse engineering lies in the robustness and accuracy. As discussed above, the pipeline of reverse engineering usually consists of multiple sequentially dependent tasks that are prone to single points of failure, that is, the failure of one task would spread to the whole system. For instance, researchers have reported common failure cases such as text detection (e.g., [poco2017reverse, lai2020automatic]), which impede the extraction of guides and consequently the visual encoding.
This motivates the use of semi-automatic approaches that address imperfect algorithms with human intervention [kong2012graphical, mendez2016ivolver, jung2017chartsense]. For example, ChartSense [jung2017chartsense] allows users to adjust incorrectly recognized data shapes. Nevertheless, its framework of incorporating automatic algorithms with human intervention is specific to chart types and therefore not readily applicable to more general situations. A related research direction is to investigate a general framework for bespoke visualizations, as the reverse process for constructing bespoke charts [ren2018charticulator].
There is a long history of research on teaching machines to assess and rank the quality of data visualizations. As shown in Table II, assessment outputs a numerical score of the visualization quality or measures the relative quality in terms of ranking.
Relations to goals and other tasks. The key motivation of assessment is to improve visualization design, e.g., to derive scoring metrics that can be used as cost functions for automatic generation. That said, assessment is often combined with recommendation.
Relations to visualization data. Most surveyed techniques take visualization programs as input, focusing on the visual encoding and data quality. Nevertheless, Fu et al [fu2019visualization] propose an approach for assessing visualization images.
6.2.1 Challenges and Methods
Assessment is challenging due to the human-centred nature of visualizations that requires large-scale empirical experiments to understand what makes a visualization “good”. However, the knowledge derived from large-scale empirical experiments is often represented as design guidelines instead of quantifiable rules. As such, much research aims to quantify knowledge about “good” visualizations. In 1986, Mackinlay [mackinlay1986automating] developed the APT system that ranked the effectiveness according to the accuracy rankings of quantitative perceptual tasks for different visual encoding channels.
However, this ranking-based approach only reflects the relative quality of visualizations. Scoring-based approaches are often more desirable since scores measure the absolute quality and therefore benefit down-streaming tasks, e.g., scores can be used as the cost function for optimization. To that end, Voyager [wongsuphasawat2015voyager, wongsuphasawat2017voyager] and Draco [moritz2018formalizing] maps different single-criteria rankings to numerical scores. Besides, researchers often leverage domain knowledge to design hand-crafted, rule-based metrics that measures the visualization quality such as informativeness [cui2019text], interestingness [ehsan2016muve, ehsan2017efficient, bryan2016temporal], accuracy [ehsan2016muve, ehsan2017efficient], significance [savvides2019significance], saliency [lee2019avoiding], visual importance [zhang2020viscode], complexity [kim2020gemini], and mobile-friendliness [wu2020mobilevisfixer]. However, designing hand-crafted metrics usually requires considerable effort. More critically, the design process is often unsystematic and lacks a strong methodological base. For example, Wu et al [wu2020mobilevisfixer] demonstrated that even seemingly reasonable metrics do not always survive experimental scrutiny. Besides, questions arise about how to weigh scores to reflect the overall, multi-criteria quality. Several systems [cui2019text, wongsuphasawat2015voyager, wongsuphasawat2017voyager, kim2020gemini, ehsan2016muve, ehsan2017efficient] determine the weights for each score through manual refinements that could become unsystematic.
As such, another line of research seeks to propose more systematical machine learning approaches that learn to rank and/or score visualizations from data collected from empirical studies. VizByWiki [lin2018vizbywiki] and DeepEye [luo2018deepeye, luo2018deepeyekeyword, luo2020steerable] formulate a learning-to-rank problem that learns to rank visualizations from crowdsourced data. Retrieve-then-Adapt [qian2020retrieve] extends the learning-to-rank model that simultaneously outputs paired scores. VizDeck [key2012vizdeck] learns a linear scoring function from users’ up- and downvotes. VizNet [hu2019viznet] demonstrates the feasibility of training a machine-learning model to predict the effectiveness of visual encodings from crowdsourced data.
Nevertheless, those machine-learning approaches face two major challenges including poor generalisability and explainability. First, the aforementioned models are trained over statistical or visual encoding properties, assuming the underlying dataset and specification is available. More importantly, they only support a limited number of chart types. To that end, Fu et al [fu2019visualization] propose a more general approach to assess the quality of visualization images that generalizes to different visualization type and does not require additional information except the visualization images. Second, machine-learning models lack explainability that might decrease trust. Moreover, they often exclude knowledge derived from empirical studies. To address these limitations, Draco [moritz2018formalizing] takes a hybrid perspective by encoding design knowledge as constraints and learning a weighting function to trade off those constraints, whereby outputting a final score. Besides automatic chart design, Draco can be used as a “visualization spell checker” that explains violations of design guidelines and why they matter.
6.2.2 Discussion and Open Questions
Existing approaches predominately focus on objective qualities that can be measured via user studies (e.g., task performance and competition time). However, subjective metrics such as aesthetics are relatively underexplored, despite that they are considered as important features of good visualizations [McCandless09]. This is challenging since it is difficult to harvest crowdsourced data of subjective quality of visualizations, since crowdsourced judgments can be inconsistent and inaccurate. This underscores research needs to propose methods for generating large-scale training datasets for visualization research in a reliable and sustainable manner. One promising way is to incorporate expert knowledge and crowdsourcing experiments in dataset generation.
Going forward, we envision machine-learning approaches that not only assess the visualization but also provide insightful explanations. In this way, the approaches embrace explainability by translating ML models into human-readable explanations and even useful design guidelines.
Characterizing the similarity or other metrics between two visualizations is helpful when dealing with a visualization collection.
Relations to goals and other tasks. Comparison is found to assist in visualization generation and analysis. The comparison metrics can be used as cost-functions in 1) recommendation to perform anchor-based visualization generation (e.g., [vartak2015seedb]), as well as 2) querying to perform “query-by-example” (e.g., [chen2020composition]). Besides, the assessment metrics can be used to compute the difference.
Relations to visualization data. Comparison is studied on both visualization programs and graphics, as shown in Table III.
|Distance||[zhao2020chartseer, luo2020steerable, mafrur2018dive]||[saleh2015learning, chen2020composition]||[oppermann2020vizcommender, xu2018chart]||[kandel2012profiler, law2018duet, luo2020interactive, vartak2015seedb, zhao2020chartseer, luo2020steerable, mafrur2018dive]|
|Difference||[lin2020dziban, kim2017graphscape, xu2018chart]|
6.3.1 Challenges and Methods
Perhaps the most straightforward approach for comparing two visualizations is to calculate the difference. GraphScape [kim2017graphscape] is a directed graph model where each link represents an edit operation (e.g., add field) and nodes denote the resulting visualizations. Subsequently, each edit operation is registered with a cost, which is learned from human judgments. Dziban [lin2020dziban] further translates the graph model into a set of constraints and weights, similar to Draco [moritz2018formalizing]. In this way, both approaches explicitly model the difference between two charts as an operation associated with a numerical cost. Nevertheless, this difference-based approach becomes sophisticated when there exist multiple, often under-specified, operations between two charts, where graph traversal is essential for searching and weighting all possible paths. This is further complicated by the limitation that GraphScape only includes operations regarding data transformation and visual encodings. Although it is methodologically feasible to extend GraphScape to support other operations such as recoloring, such extensions are labor-intensive, without a guarantee for exhaustiveness.
Partly due to the above challenges of difference-based methods, most research adopts distance-based measurements [chen2020composition, kandel2012profiler, law2018duet, luo2020interactive, luo2020steerable, mafrur2018dive, oppermann2020vizcommender, saleh2015learning, vartak2015seedb, xu2018chart, zhao2020chartseer]. The key idea is to convert a visualization into a feature vector, and compute the distance between two feature vectors according to distance functions. Thus, the technical challenges of distance-based measurements are two-fold: the choice of features and the distance function.
The features of visualizations vary among the underlying sources, as discussed in subsubsection 5.2.2. In the context of comparison, we identify four primary sources including graphics, text, data, and specifications. For instance, Saleh et al [saleh2015learning] extract low-level visual features from graphics to learn style similarity, and Chen et al [chen2020composition] model the configuration pattern of multiple-view visualization systems as a vector measuring the layout. Regarding text features, VizCommender [oppermann2020vizcommender] uses both hand-crafted features (e.g., TF-IDF) and learned features (e.g., Doc2Vec). More work uses hand-crafted features for the data [kandel2012profiler, law2018duet, luo2020interactive, vartak2015seedb]. Finally, ChartSeer [zhao2020chartseer] uses deep learning approaches to convert Vega-Lite specifications into embeddings.
The derived features are subsequently fed into a distance function to derive the distance. Examples of common distance functions include mutual information [chen2020composition, kandel2012profiler], Earth Mover’s Distance [luo2020interactive, vartak2015seedb], Bhattacharyya coefficient [law2018duet], and Jaccard coefficient [mafrur2018dive]. Nevertheless, the process of selecting distance functions is hardly detailed in the literature, leaving rationales and insights unexplored. This is worsened by the potential downside of distance-based measurements that the feature representation is less interpretable than the operations in difference-based measurements. Thus, it is often difficult to interpret the results, and the user study suggests that the similarity measurement “does not fully understand their (users’) intent” [zhao2020chartseer].
An important question that then arises is how to measure the overall similarity when combining multiple sources. Naive feature concatenation is a natural way to combine features from different sources [mafrur2018dive, luo2020interactive]. For instance, Luo et al [luo2020steerable] propose a feature vector concatenated from five aspects, each represented by a one-hot vector describing visualization types, x-axis, y-axis, group/bin operations, and aggregation functions. Another method is to compute the aggregated distance by weighting hybrid distances, e.g., chart encoding distance, keyword tagging distance, and dimensional interaction distance by Xu et al [xu2018chart]. Nevertheless, both feature concatenation and distance aggregation assume a linear relationship among different vectors, which seems far from capturing the real-world complexity and thus yield limited performances when perceived by users.
6.3.2 Discussion and Open Questions
Comparison and assessment are closely related and share the same goal of outputting a numerical score. Comparison has been predominately rooted in feature engineering and hand-crafted distance functions. The major downside is that it does not actually learn from user feedback and thus usually fails to meet the users’ intent. Unlike assessment, few little machine learning approaches have been applied to comparison. Several approaches (e.g., [xu2018chart]) use pre-trained ML models to perform feature learning. However, they do not fine-tune the models on user feedback data. Thus, proposing dataset and ML approaches for comparison is a clear step to improve the performance.
Nevertheless, it is non-trivial to adapt ML approaches to comparison, since comparison involves two visualizations while standard ML models only take one entity as input. ScatterNet [ma2018scatternet] addresses this issue. It is a deep learning model for predicting similarities between scatterplots by learning from crowdsourced human feedback data. Nevertheless, it is unclear how to adapt this approach to other statistical charts. A key challenge is that CNN models used in ScatterNet are worse at capturing human perception in other charts [haehn2018evaluating, fu2019visualization].
Querying is the task of retrieving relevant visualizations that satisfy the users’ needs from a visualization collection. It is a crucial component of Information Retrieval (IR) systems, which are also known as search engines especially in the context of the web [cerulo2004taxonomy]. Querying in this context is distinct from visualization query language (Figure 5). The latter specifies visualizations as a query into a database, while the former describes a query into a visualization collection.
Relations to goals and other tasks. Querying is mainly for visualization retrieval (Table IV). It is often built upon other tasks like transformation (e.g., [hoque2019searching]) and comparison (e.g., [chen2020composition]).
Relations to visualization data. Querying directly in the image space can be difficult since semantic information is lost. As such, it is often performed on visualization programs, where semantic information such as titles and axis labels are available.
|Exact||[chen2020composition, hoque2019searching, siegel2016figureseer]|
|Best||[ray2015architecture, srinivasan2018augmenting]||[li2014infographics, li2015novel]||[qian2020retrieve, chen2015diagramflyer]||[saleh2015learning]|
6.4.1 Challenges and Methods
There are two main viewpoints that characterize querying: how to specify users’ needs and how to return visualizations that match the needs.
The simplest form of querying syntax is keywords. Keywords are popular since they are intuitive and easy to express. Choudhury and Giles [ray2015architecture] developed a search engine that allowed users to search figures by keywords in the captions. Similarly, Voder [srinivasan2018augmenting]
supports keyword-based queries into data fields as well as general words like ‘outlier’ from the data fact associated with a visualization. Li et al[li2014infographics, li2015novel] extends keywords to natural language queries by extracting structural keywords from queries and matching the extracted words with text in visualizations. Nevertheless, keywords-based queries often fail to disambiguate unstructured queries since words have multiple meanings.
Structural queries are a mechanism for resolving ambiguity and improving the retrieval quality and have been used in multiple systems [chen2015diagramflyer, chen2020composition, hoque2019searching, qian2020retrieve, siegel2016figureseer]. They are built on keywords with the addition of structural constraints. For instance, DiagramFlyer [chen2015diagramflyer] is a search engine where a query contains eight key structural fields (e.g., type, x-label, legend) that can uniquely describe a visualization. Those structural constraints make it possible to search information more than text in visualizations. For instance, Retrieve-then-Adapt [qian2020retrieve] retrieves infographics based on a query composed of graphical and textual elements. Notably, visualization specifications are a functional candidate for structural queries. Hoque and Agrawala [hoque2019searching]’s search engine indexes D3 visualizations as Vega-Lite specifications and supports queries in the Vega-Lite syntax. Accordingly, it lets users find visualizations based on a wide range of constraints such as mark types, encodings, and non-data-encoding attributes. In the best case where the input query is a complete Vega-Lite specification, their research engine actually supports “query-by-example”.
This example-based query is another format that offers an intuitive method for users to specify their intent. Saleh et al [saleh2015learning] implemented a search engine for stylistic search over infographic corpora by returning stylistically similar images given a query image. However, more sophisticated analysis methods are necessary to capture characteristics beyond stylistic similarities.
Now that we have discussed the query syntax, the next challenge is how to reason which visualizations are most relevant to the user-input query, which can be classified into exact-match and best-match methods. Exact-match techniques are used for filtering visualizations by strict conditions, e.g., to retrieve visual analytic systems containing four views [chen2020composition], bar charts [hoque2019searching], or charts describing a dataset [siegel2016figureseer]. However, exact-matching is not always possible especially when the input conditions are too strict. As such, more systems use best-match approaches [chen2015diagramflyer, li2014infographics, li2015novel, qian2020retrieve, ray2015architecture, srinivasan2018augmenting, saleh2015learning] that rank visualizations according to metrics. Those metrics measure the degree to which a visualization is relevant to the input query. They use natural language models for text components (e.g., synonyms [chen2015diagramflyer, srinivasan2018augmenting], text relevance [li2014infographics, li2015novel]) and similarity metrics [qian2020retrieve, chen2020composition]. However, relevance or similarity metrics are insufficient for retrieving visualizations that are not only relevant but also of high quality. In response, Retrieve-then-Adapt [qian2020retrieve] learns the distribution of visual elements in the corpus in an attempt to empathize visualizations with common elements, assuming that the more frequent an element is, the better it is.
6.4.2 Discussion and Open Questions
Research on indexing visualizations has been relatively limited in the past decade, leaving room to boost technical development. Due to the diversity of visualizations, even state-of-the-art methods are often restricted to certain types of visualizations, e.g., proportion-related infographics [qian2020retrieve] or basic D3 visualizations [hoque2019searching]. How to generalize those approaches to more types of visualizations is a non-trivial issue that requires a deeper understanding of how visualizations should be indexed. For instance, Hoque and Agrawala’s approach [hoque2019searching] indexes D3 visualizations in Vega-Lite syntax, which is insufficient in expressing user queries such as “sunburst diagrams”. Indexing becomes more challenging when applying to raster graphics (bitmap images) that deserve research efforts, e.g., reverse engineering.
Intention gap is another challenge for querying visualizations. From a theoretical perspective, there are not enough empirical studies to understand the user needs in searching for visualizations. Such studies are formative approaches to motivate the design space of the indexes of visualizations. In a related venue, Oppermann et al [oppermann2020vizcommender] recently found that information seeking is a core task when browsing visualization repositories, and that users were more interested in content rather than styles. Similar studies are needed to survey users and inform the design of future search engines for visualization. From a practical perspective, it is crucial to design convenient query interfaces that assist users in specifying their intent. For instance, it is easy for users to specify the region of interest on an example visualization and interactively browse the results. Hoque and Agrawala’s query-by-example feature seems a promising start. We also notice a large body of research that studies query-by-pattern or query-by-sketch in time-series visualizations, e.g., [wattenberg2001sketching, fan2020sketch]. Future work could generalize those query paradigms to all visualization types.
Reasoning challenges machines to “read charts made for humans” [ono2018should]. Reasoning requires interpreting visualizations to derive high-level information such as insights beyond extracting visual encoding and data via reverse engineering. Reasoning is distinct from assessment since reasoning usually outputs semantic information (e.g., insights, text summary, visual importance map) rather than a numerical score. As shown in Table V
, we find three common classes depending on the targeted output of the reasoning process: visual perceptual learning, chart summarization, and visual question answering. In the following text, we first describe each scheme separately, followed by an organized discussion of existing methods and research gaps.
Relations to goals and other tasks. Reasoning is mainly for visualization enhancement (e.g., summarize natural language descriptions [obeid2020chart]). It sometimes relies on reverse engineering to improve the algorithm performance (e.g., [kim2020answering]).
Relations to visualization data. Reasoning is studied on both visualization images and programs.
|Visual Perceptual Learning||[bryan2016temporal]||[bylinskii2017learning, haehn2018evaluating]|
|Chart Summarization||[mittal1998describing, luo2018deepeyekeyword, burns2012automatically, choudhury2016scalable, cui2019datasite, demir2012summarizing]||[liu2020autocaption, chen2019neural, chen2020figure, obeid2020chart]|
|Visual Question Answering||[huang2007system, kim2020answering]||[chaudhry2020leaf, methani2020plotqa, kahou2017figureqa, reddy2019figurenet, kafle2018dvqa]|
6.5.1 Challenges and Methods
Visual perceptual learning aims to solve visual tasks by analyzing visual information. For instance, Temporal Summary Images [bryan2016temporal] automatically extracts points of interest in charts according to predefined heuristics. Recently, deep learning methods, trained on labeled datasets, have been applied to improve machine perception of visualization images. Bylinskii et al [bylinskii2017learning] presented neural network models to predict human-perceived visual importance of visualization images. Similarly, Haehn et al [haehn2018evaluating] evaluated the performances of CNNs on Cleveland and McGill’s 1984 perception experiments [cleveland1984graphical] and concluded that off-the-shelf CNNs were not currently a good model for human graphical perception. Their initial results underscore the importance of continued research to improve the performance.
Chart summarization becomes increasingly important with the rapid popularization of visualizations. Most existing approaches generate text summaries such as natural language description or captions [burns2012automatically, choudhury2016scalable, cui2019datasite, demir2012summarizing, luo2018deepeyekeyword, obeid2020chart, mittal1998describing, liu2020autocaption]. The simplest approach is to provide a short description of how to interpret the chart [mittal1998describing, luo2018deepeyekeyword], e.g., “This chart shows the trend of average departure delay in January’. More advanced approaches focus on explaining and communicating high-level insights conveyed by charts. Those approaches extract the data patterns and subsequently convert patterns to natural language according to pre-defined templates [burns2012automatically, choudhury2016scalable, cui2019datasite, demir2012summarizing], e.g., “X was the most/least frequent sub-category in A”. Liu et al [liu2020autocaption] offered an alternative perspective that learns most noteworthy insights with deep learning approaches. A common limitation of the above work is that natural language summaries are generated via pre-defined templates and therefore confined to few variations and generality. Thus, recent research proposes several end-to-end deep-learning solutions for generating chart captions [chen2019neural, chen2020figure]obeid2020chart]. However, summarization still remains highly under-explored since the algorithm performances have much space for improvements.
Visual question answering is another emerging research area that aims to answer a natural language question given a visualization image. Traditional methods [huang2007system] first decode visualizations into data tables and then parse template-based questions into queries over the data table to generate answers. Kim et al [kim2020answering] recently improved a natural language parser to support free-from, crowdsourced questions. Another line of research studies end-to-end deep learning approaches [chaudhry2020leaf, methani2020plotqa, kahou2017figureqa, reddy2019figurenet, kafle2018dvqa]. The key challenge is that answering questions for visualizations requires high-level reasoning of which existing visual question answering models are not capable [kahou2017figureqa, kafle2018dvqa]. Besides, visualization images are sensitive to small local changes, i.e., shuffling the color in legends greatly alters the charts’ information. Therefore, the major problem is how to learn the features from visualization images and fuse them with features from natural language questions. DVQA [kafle2018dvqa] learns and fuses features via a sophisticated model containing multiple sub-networks, each responsible for different components such as spatial attention. Later work expands this work with improvements to the models, e.g., PlotQA [methani2020plotqa] and LEAF-QA [chaudhry2020leaf]
explicitly apply reverse engineering to retrieve visual elements and feed the extracted information into sub-networks. Despite advancements, there remains room for future work since the datasets, models, and evaluation metrics are lacking.
6.5.2 Discussion and Open Questions
The reasoning task is currently undergoing changes since machine learning approaches, particularly deep learning models, are increasingly used. This change may attribute to the rapid advancement and successful applications of deep learning in visual reasoning. In the visualization context, research gaps emerge since off-the-shelf models for natural images are often shown to yield dissatisfactory performances on visualizations (e.g., [haehn2018evaluating, kafle2018dvqa]). This gap is not surprising since visualizations contain relational information that is sensitive to small details that are not commonly present in natural images, i.e., a local change to a bar shape might significantly impact the encoded data and conveyed meanings. This research area remains largely under-explored. First, limited datasets are available, which hinders model development and validation. For instance, most existing datasets for visual question answering contain synthetic questions and charts, which are far from being representative for the actual use. Second, it would be pertinent to study feature learning models that are tailored to visualization images. The recent trend of decomposing end-to-end models to structures containing multiple sub-networks might be a promising method (e.g., [methani2020plotqa]).
Recommendation is an important step for automating the creation of visualizations. As shown in Table VI, there are three methods for recommending visualizations [wongsuphasawat2016towards]:
Data recommendation suggests interesting data, insights, or data transformation to be visualized from a database
Encoding recommendation determines the visual encoding (including both data and non-data encodings) given the data or other visualization elements
Hybrid recommendation decides both data and encodings
Relations to goals and other tasks. Recommendation is mainly for visualization generation. It is related to assessment and comparison since the derived metrics are used as cost functions.
Relations to visualization data. Recommendation outputs visualization programs and subsequently visualization graphics.
6.6.1 Challenges and Methods
Data Recommendation. Given a dataset, one step of recommending visualizations is to select data fields to be visualized, and when applicable, corresponding data transformation as well. The simplest approach to decide fields is enumeration. For instance, Voyager [wongsuphasawat2015voyager, wongsuphasawat2017voyager] enumerates all possible fields according to a predefined display order by the type and name. DataSite [cui2019datasite] improves this enumeration approach by computing and communicating data facts associated with the selected fields according to pre-defined templates (e.g., “Correlation of A was found between X and Y” if selected data fields are X and Y). However, enumeration imposes a heavy burden on users that motivates other research to recommend the most useful data facts, also known as insights.
|Data||[demiralp2017foresight, kandel2012profiler, srinivasan2018augmenting, wang2019datashot, luo2020interactive, shi2020calliope, ding2019quickinsights, mafrur2018dive, vartak2015seedb]|
|Encoding||[mackinlay1986automating, mackinlay2007show, wongsuphasawat2015voyager, ananthanarayanan2018datavizard, cui2019datasite, kandel2012profiler, narechania2020nl4dv, shi2020calliope, bouali2016vizassist, srinivasan2018augmenting, moritz2018formalizing, lin2020dziban, bryan2016temporal, cui2019text, kim2020gemini, ma2020ladv, wu2020mobilevisfixer, qian2020retrieve, smart2020color]||[dibia2019data2vis, hu2019vizml, wang2019datashot]|
|Hybrid||[ananthanarayanan2018datavizard, cui2019datasite, kandel2012profiler, shi2020calliope, chen2020augmenting, wang2019datashot, wongsuphasawat2017voyager, wongsuphasawat2016towards, luo2018deepeye, luo2018deepeyekeyword, luo2020steerable]|
Recommending insights is often approached by proposing a taxonomy of insight types (e.g., extrema), each type associated with an assessment metric. Examples are Foresight [demiralp2017foresight], Profilier [kandel2012profiler], Voder [srinivasan2018augmenting], DataShot [wang2019datashot], VisClean [luo2020interactive], and Calliope [shi2020calliope]. One key challenge that then arises is the assessment metric. Voder [srinivasan2018augmenting] introduces threshold-based heuristics that classify data facts into different tiers, while the remaining systems propose various cost functions that better capture the differences between data facts. Remarkably, QuickInsights [ding2019quickinsights] propose a unified formulation of insights and scoring metrics irrespective of the type. In the context of anchor-based generation where the goal is to recommend data that meets some criteria with respect to anchor data, the aforementioned assessment metrics are augmented by or replaced by comparison metrics in VisClean [luo2020interactive], DiVE [mafrur2018dive], and SeeDB [vartak2015seedb]. For instance, SeeDB [vartak2015seedb] recommends data by deviation with an anchor visualization.
Given the above assessment or comparison metrics, the next challenge is to compute the best or top-k insights. This is challenging since the space of data facts grow exponentially with the number of data table columns. As such, researchers have proposed multiple strategies to speed up the computation. For instance, Foresight [demiralp2017foresight] uses sketching to quickly approximate the costs. Other systems introduce efficient searching algorithms to recommend the top-k insights. Those searching algorithms are primarily progressive or iterative [luo2020interactive, shi2020calliope, vartak2015seedb, mafrur2018dive, wang2019datashot], outputting intermediate solutions that approximate the optimal one. That said, this data recommendation problem has not yet been formulated as a prediction problem that is solved by machine-learning models, probably due to the lack of labeled training data. Notably, two approaches explicitly adopts tree-based algorithms [luo2020interactive, shi2020calliope], leveraging the idea of GraphScape [kim2017graphscape] that the visualization design space can be modeled as a graph for greedy or dynamic programming.
Several systems recommend data according to other input beyond a database. Particularly, those inputs are related to natural language that is beyond the core scope of this survey. Examples include natural language statements in Text-to-Viz [cui2019text], news articles in VizByWiki [lin2018vizbywiki], keyword queries [luo2018deepeyekeyword], and natural language interfaces (e.g., NL4DV [narechania2020nl4dv] and FlowSense [yu2019flowsense]).
Encoding Recommendation decides data encodings and/or non-data encodings for styling (e.g., positions).
Data encodings are extensively studied in the literature. Early approaches date back to 1986 where the APT system [mackinlay1986automating] enumerates the visual encoding space and selects the “best” encodings according to the assessment ranking in terms of expressiveness and effectiveness. This ranking-based recommendation is implemented and extended in later systems such as ShowMe [mackinlay2007show] and Voyager [wongsuphasawat2015voyager]. In addition to ranking, several systems propose heuristic rules to decide visual encodings given the insights extracted from data recommendation or visual tasks, e.g., extreme insights or finding extremes are mapped to histograms or scatterplots. Examples include DataVizard [ananthanarayanan2018datavizard], DataSite [cui2019datasite], Profiler [kandel2012profiler], NL4DV [narechania2020nl4dv], Calliope [shi2020calliope], VizAssist [bouali2016vizassist], and Voder [srinivasan2018augmenting].
The above heuristic-based data-encoding recommenders have recently been superseded by machine learning approaches due to the increasing availability of datasets. Draco [moritz2018formalizing] and Dziban [lin2020dziban] learn to assess visualizations by weighting design rules for visualizations. Subsequently, they formulate a constraint optimization problem to recommend the best or top-k visualizations. As discussed, such approaches combining assessment and optimization are commonly adopted for recommending insights. Other ML approaches seek to directly learn the mappings between data and visual encodings by training an end-to-end model, including Data2Vis [dibia2019data2vis], VizML [hu2019vizml], and DataShot [wang2019datashot].
In addition to data encodings, other approaches study how to recommend non-data-encoding attributes such as layouts and colors. Most approaches formulate optimization problems with the primary goal to define the optimization target, that is, the assessment metrics. Several metrics are human-crafted cost functions [bryan2016temporal, cui2019text, kim2020gemini, ma2020ladv, wu2020mobilevisfixer], while other metrics are data-driven, including machine learning models trained on human assessment dataset [qian2020retrieve] and distances to common patterns mined from a corpus [smart2020color]
. Several systems contribute novel optimization algorithms to improve the efficiency, including reinforcement learning[wu2020mobilevisfixer]
and Markov chain Monte Carlo methods[qian2020retrieve]. Similar to recommending top-k insights, both algorithms are progressive.
Hybrid Recommendation decides both data and encodings. A straightforward approach for hybrid recommendation is to combine data and encoding recommendation sequentially. This approach is widely implemented in visualization recommenders including DataVizard [ananthanarayanan2018datavizard], DataSite [cui2019datasite], Profiler [kandel2012profiler], Calliope [shi2020calliope], Voder [chen2020augmenting], and DataShot [wang2019datashot]. Other approaches take an end-to-end perspective, formulating the recommendation tasks as an optimization problem. Examples are Voyager [wongsuphasawat2017voyager, wongsuphasawat2016towards], DeepEye and its extensions [luo2018deepeye, luo2018deepeyekeyword, luo2020steerable]. Thus, the core problem is to provide an overall assessment score regarding both the data and encodings. The first step is to translate a visualization into a formal representation, i.e., visualization query language (VQL). Assessment metrics are then proposed to evaluate the quality of a VQL representation. Finally, Voyager ranks the recommended visualizations, while DeepEye proposes efficient algorithms to generate the top-k visualizations.
6.6.2 Discussion and Open Questions
An ongoing discussion in recommending visualizations is the problem formulation including optimization and prediction. Optimization requires the careful design of optimization functions which are often the assessment scores. Another important concern is the efficient algorithms for solving the complex, sometimes multi-objective, optimization problem due to the huge design space of visualizations. Generally speaking, optimization-based approaches have the potential to be extended to the human-in-the-loop approaches, since the predicted assessment scores help users determine the visualization quality. Nevertheless, hand-crafted cost functions for assessment are often insufficient. On the other hand, machine learning assessment requires massive training data that labels human assessment and therefore is expensive. In contrast, prediction-based approaches such as VizML [hu2019vizml] only demand training data describing the dataset and visualizations without the need for human labelling. This reduces the overhead of constructing dataset and thus helps boost the performance. Thus, the above discussion opens up many interesting general questions: How to interpret the prediction-based ML models to understand the reasoning process and potentially derive assessment scores? How to reduce the costs for collecting training dataset?
There exist other research gaps with respect to recommendation. For instance, few ML approaches have yet been applied to recommend insights and non-data-encodings. Is it beneficial to collect training data for those tasks such as predicting important data fields given a data table? Besides, many existing approaches only recommend top-k, separated visualizations given a large data table, while it is unlikely that “top-k visualizations fit all”. Future research should study how to recommend dashboards, and even visual analytics systems for more comprehensive and intelligent data analysis. Generating a coherent data story compromising of multiple data facts is another interesting question. Finally, existing systems are confined to well-known charts. Recent research in computer vision has proposed generative models for generating synthetic images. An interesting question is to apply generative models to recommend synthetic, novel visualizations.
Mining is an emerging task motivated by the rapid popularization and accumulation of visualization data online (Table VII). Generally, there are two kinds of mining tasks, i.e., mining design patterns and mining data patterns, that are discussed in the following text.
Relations to goals and other tasks. The goal of mining is visualization analysis, i.e., to discover useful information or patterns from a visualization collection. Reverse-engineering is often a prerequisite for mining to obtain semantic information.
Relations to visualization data. Reasoning is studied on both visualization graphics and programs.
|Design Pattern||[ray2015architecture, lee2017viziometrics, battle2018beagle, hoque2019searching, chen2020composition]||[smart2020color]|
|Data Pattern||[xu2018chart, zhao2020chartseer]|
6.7.1 Challenges and Methods
Mining Design Patterns. The concept of design mining refers to leveraging data mining techniques to derive design principles from existing artifacts. Smart et al [smart2020color] formally introduced design mining in visualizations and proposed an unsupervised clustering technique to derive common color ramps. However, design mining has been implicitly practiced in several early systems but for different goals. Choudhury and Giles [ray2015architecture] investigate the design patterns (e.g., colored or not) of 300 line graphs sampled from top computer science conferences. Viziometrics [lee2017viziometrics] computes the average number of visualizations in academic papers from different research domains. In the visualization community, Beagle [battle2018beagle] automatically crawled SVG-based visualizations online to investigate how popular, e.g., line and bar charts are on the web. Hoque and Agrawala [hoque2019searching] performed a design demographics analysis of 7,860 D3 visualizations to identify common patterns such as how frequently circles are used. More recently, Chen et al [chen2020composition] studied the composition and configuration patterns in multiple-view visualizations collected from visualization papers.
The above systems mostly apply simple statistical analysis [ray2015architecture, lee2017viziometrics, battle2018beagle, hoque2019searching, chen2020composition] or clustering techniques [smart2020color]. However, patterns might become meaningless numbers unless they are interpreted as insights or used to empower novel applications, e.g., to motivate design ideas. As such, several systems [hoque2019searching, chen2020composition] propose interactive visual interfaces for users to browse the visualization database and find example visualizations. Other approaches also leverage mined design patterns to recommend visual designs [smart2020color, chen2020composition].
Mining Data Patterns. Another line of research aims to explore the data patterns encoded in visualization ensembles. This concept has been widely implemented in exploring visualization ensembles with respect to a specified type of charts (e.g., ScagExplorer [dang2014scagexplorer] and TimeSeer [dang2012timeseer]). We identify two approaches that are irrespective of the chart types, namely Chart Constellations [xu2018chart] and ChartSeer [zhao2020chartseer]
. Both adopt a visual analytic approach by projecting charts into a 2D space whereby supporting clustering analysis and interactive analysis. Chart Constellations[xu2018chart] defines several cost functions for computing the distances between charts, while ChartSeer [zhao2020chartseer] uses features learned via deep learning models and improves the performance.
6.7.2 Discussion and Open Questions
There exist several promising directions for future work. For design patterns, existing approaches focus on visualization usage. It would be beneficial to mine underlying semantic patterns from visualization collections, e.g., the relationships between linked views in a multiple-view visualization, and the “good” or “bad” practices in existing visualization designs. Those guidelines could motivate the design of recommender systems that automate the creation of visualizations. For instance, many recommender systems in our survey rely on manual coding to derive the design space of visualizations (e.g., [wang2019datashot, wu2020mobilevisfixer, cui2019text]). Automated mining of design space would significantly reduce the manual efforts.
Most current mining techniques are limited to simple statistics. However, it is likely that there exist hidden patterns in visualization corpora. Therefore, one could explore the visualization collection with advanced mining techniques and human-in-the-loop analytics to uncover those patterns. A research challenge would be to develop visual analytic systems for analyzing visualization collections.
7 Future Research Opportunities
Despite the extensive research efforts, there exist sufficient research gaps and potentials for future research. We have identified and discussed research opportunities regarding representation and each task in section 5 and section 6. In this section, we outline an organized overview of future research directions.
7.1 Visualization Standards and Interoperability
Our analysis reveals different content formats of visualization data that have been inconsistently adapted in the systems and techniques reviewed here (subsection 5.1). Such inconsistency impedes interoperability among different visualization systems. Being able to combine different systems and libraries is a common need in application settings. For instance, visualization generation tools such as VizML [hu2019vizml] recommend data encodings, while other systems like MobileVixFixer [wu2020mobilevisfixer] adjust non-data encodings (visual styles). Both systems do not work well together since their formats are incompatible with each other. Notably, it is a common practice to select partial information from the full specifications as the intermediate format. This leads to the need for a common standard of visualizations that cover all existing partial formats, as well as derivative tools for auto-completing partial specifications to generate universally compatible formats.
Besides, since visualizations are naturally shared and stored in the graphical format, systems for visualization enhancements usually have to engage in the reverse engineering to extract the underlying program. Despite extensive research efforts, reverse engineering remains computationally expensive and lacks robustness [poco2017reverse]. Particularly, our analysis in subsection 6.1 suggests that it is currently impossible to perform reverse engineering on bespoke charts. Although recent work like Chartem [fu2020chartem] and VisCode [zhang2020viscode] proposes new standards for storing programs in graphics, there is a long way before such standards are adopted and implemented in existing systems. Thus, continued research on reverse engineering is essential and beneficial to interoperability.
7.2 Visualization-Tailored Machine Learning
Recent research has increasingly leveraged machine learning to generate or transform visualization. However, it remains an open challenge to choose the “best” representation and machine learning models for visualizations. On the one hand, programs are compact and effective representations of visualizations that are computationally inexpensive [hu2019vizml, moritz2018formalizing]. However, programs do not generalize since they are often limited to specific visualization types and might not apply to parameter values not observed during training. On the other hand, graphics appear to be a general representation. However, much research suggests that off-the-shelf models for natural images achieve dissatisfactory results in tasks such as visual perceptual learning [haehn2018evaluating], visual question answering [kafle2018dvqa], and assessment [fu2019visualization].
This gap is not surprising due to the characteristics of visualization images. Compared with natural images, visualizations are particularly sensitive to local details, e.g., the sizes of graphical marks are of vital importance. Text information is also critical, taking on various roles (e.g., legends, axis labels) that have no tolerance for misinterpretation, e.g., missing a single character in axis labels leads to a great blunder. The above characteristics make it challenging to design a machine learning model that is tailored to visualizations. Recent work in visualization question answering [methani2020plotqa, chaudhry2020leaf, kafle2018dvqa] has proposed sophisticated models to capture visualization-specific features. However, such attempts are limited and their efficacy remains to be thoroughly evaluated.
Another important perspective lies in augmenting machine learning models with knowledge derived from empirical studies. Machine learning models are often criticized for poor explainability. Fortunately, empirical research in the visualization field has accumulated valuable knowledge base about how visualizations should be interpreted, assessed, and created. It is therefore promising to incorporate that knowledge in ML models that is tailored to visualization research, e.g., Draco [moritz2018formalizing].
7.3 A Big Data Perspective to Visualization Data
In this survey, we highlight the theme of considering visualizations as a data type. Moving forward, an evolving and promising theme is big visualization data, which concerns processing and analyzing visualization data at larger scales. This big visualization data perspective leads to several issues that deserve research efforts. First, visualization database systems should effectively store, manage, and retrieve visualizations that are heterogeneous and often unstructured. Second, it would be beneficial to facilitate the sharing of visualization to facilitate its distribution and reuse. Sharing visualizations currently faces multiple problems such as the dependency on the raw data-table and an exponentially growing number of states with user interactions [raji2020dataless]
. This underscores the research needs for effectively sharing visualizations. Finally, it is unclear how to mine and analyze big visualization data, since existing approaches focus on small-scale visualization collections and mainly use simple statistics. These challenges demand a holistic adaptation of data mining techniques to visualization data, including but not limited to data cleaning, data transformation, data reduction, feature extraction and analysis. For instance, when collecting visualization datasets by web crawling, it is important to clean the visualization collection by removing noise such as non-visualization images. Therefore, there are still many open research topics in terms of analyzing big visualization data.
8 Discussion and Limitation
In this section, we discuss the limitations in terms of mutual exclusiveness and collective exhaustiveness, as well as generalizability.
8.1 Mutual Exclusiveness and Collective Exhaustiveness
In this survey, we use an inductive approach to organize the literature and construct our taxonomy by observing existing work and iteratively generalizing the classification.
We note several dependencies among tasks. For instance, assessment and comparison metrics are often used as optimization functions for recommendation. However, those dependencies should not be interpreted as violations of mutual exclusiveness. Instead, dependencies suggest that tasks can be sequentially combined into a system pipeline for solving complex problems.
Because we inductively collected the papers for this survey, we do not claim that our taxonomy is exhaustive, especially for our task and goal taxonomy. This is because there exist many potential research questions that we have not observed yet. Therefore, it would be interesting to improve the taxonomy from a deductive perspective by referring to task taxonomies in related fields such as computer vision, artificial intelligence, and databases. For instance, image compression and style transfer are well-studied tasks in the computer vision community [computerVision], which, however, remain unexplored in the context of visualizations. That said, there exist promising directions for future research of applying artificial intelligence for visualizations.
8.2 Generalization to Visualizations Beyond Charts
In this survey, we make simplifying assumptions by focusing on charts and infographics, excluding scientific visualizations and work tailored for a specific type of visualizations. Critically, an important concern regarding our taxonomy is how to generalize or extend it to a wider spectrum of visualization data. During our analysis process, we find that most of the excluded work can fit well into our proposed taxonomy. However, there exist a few exceptions that warrant future improvement. In the following text, we discuss notable extensions to our what-why-how taxonomy.
What. Our what taxonomy primarily focuses on visualization data. However, visualizations are hardly considered as standalone data in AI-empowered systems. Instead, it is often necessary to provide ground-truth labels for visualizations as auxiliary data, e.g., chart types [poco2017reverse]. Another type of auxiliary data are user-generated, i.e., interaction logs [fan2018fast] and analysis provenance [xu2020survey]. Future research should study the taxonomy of auxiliary data to better contextualize the opportunities for AI for visualization research.
Why. The why taxonomy is organized along two axes: single or many visualizations versus inputting or outputting visualizations. A missing perspective lies in work that neither inputs nor outputs visualizations but instead exploits visualization data in the middle stage. For instance, Lallé et al [lalle2015prediction] collected eye tracking data when browsing visualizations and trained a ML model to predict the learning curves of visualizations. Our survey does not cover this kind of research due to our core theme centered on considering visualization as a data format, emphasizing how visualization data are processed and produced. Another perspective for improving our taxonomy is to expand the sub-categories by adding goals that are currently confined to a particular visualization. For instance, several systems propose machine learning methods for brushing point-based visualizations [chen2019lassonet, fan2018fast], which falls under the visualization enhancement category.
How. Echoing the above discussion about “what”, there exist corresponding needs of identifying tasks for processing and analyzing auxiliary data. This is crucial since our current task taxonomy concerns visualization data. Moreover, we notice another potential task for visualization data, namely visualization collection summarization that aims to represent visualization collections in an effective and compact manner. However, existing approaches are limited and specific to visualization types, e.g., Scagnostics for scatterplots [dang2014scagexplorer] and line charts [dang2012timeseer]. In addition to those statistical summarization approaches, more recent visual analytic approaches use glyphs [zhao2020chartseer]. Future work could provide a systematical overview of visualization summarization with increasing research efforts.
In this paper, we probe the concept of considering visualization as an emerging data format and investigate the advance of applying artificial intelligence to visualization data. We present a novel classification that enables the readers to find relevant literature among a wide variety of research areas in computer science. Our classification can also help readers to understand what techniques have been developed and find areas for future research. We hope that our survey could serve as a tutorial that helps stimulate new theories, problems, techniques, and applications.
The authors would like to thank…