VisImages: A Large-scale, High-quality Image Corpus in Visualization Publications

07/09/2020 ∙ by Dazhen Deng, et al. ∙ 0

Images in visualization publications contain rich information, such as novel visual designs, model details, and experiment results. Constructing such an image corpus can contribute to the community in many aspects, including literature analysis from the perspective of visual representations, empirical studies on visual memorability, and machine learning research for chart detection. This study presents VisImages, a high-quality and large-scale image corpus collected from visualization publications. VisImages contain fruitful and diverse annotations for each image, including captions, types of visual representations, and bounding boxes. First, we algorithmically extract the images associated with captions and manually correct the errors. Second, to categorize visualizations in publications, we extend and iteratively refine the existing taxonomy through a multi-round pilot study. Third, guided by this taxonomy, we invite senior visualization practitioners to annotate visual representations that appear in each image. In this process, we borrow techniques such as "gold standards" and majority voting for quality control. Finally, we recruit the crowd to draw bounding boxes for visual representations in the images. The resulting corpus contains 35,096 annotated visualizations from 12,267 images with 12,057 captions in 1397 papers from VAST and InfoVis. We demonstrate the usefulness of VisImages through the following four use cases: 1) analysis of color usage in VAST and InfoVis papers across years, 2) discussion of the researcher preference on visualization types, 3) spatial distribution analysis of visualizations in visual analytic systems, and 4) training visualization detection models.



There are no comments yet.


page 2

page 5

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As an old saying goes, a picture is worth a thousand words. Images are crucial in the publications, especially in the visualization community. Specifically, images in visualization publications showcase visual designs, system framework, model details, experiment results, etc. Building a dataset of images from visualization publications contributes to the community in three aspects. Firstly, images contain rich information, including detailed visual designs, the co-occurrence of the charts, color styles, etc. The analysis of images from visualization publications provides new perspectives for understanding the trend of the field and discovering new research interests. In visual literature analysis, current methods mainly focus on four types of data, e.g., text, citations, authors, and metadata[stasko2013citevis, isenberg2016keyvis, ponsard2016paperquest, delest2004exploring, li2019galex, chuang2013topic], lacking the exploration on visualization images [federico2016survey].

Secondly, visualization images are critical for cognitive studies, such as aesthetics and memorability, and can shed light on the design criteria [borkin2013makes, borkin2015beyond, li2018toward]. Therefore, with an image dataset from visualization publications, numerous studies on colors, data-ink ratio, feature integration theory[treisman1991search, treisman1980feature], and texton theory[julesz1981textons, julesz1984brief] can be supported.

Thirdly, such a dataset affords new opportunities for developing machine learning models for the visualization community. Existing studies [battle2018beagle, savva2011revision, siegel2016figureseer]

have collected chart corpora and utilized computer vision algorithms for chart classification, recognition, and redesign. However, the models trained on these corpora are inapplicable to images in visualizations publications due to three reasons. First, the images in these corpora are usually single-chart, which is different from those in visualization publications consisting of multiple chart types with complex layouts. Second, due to the lack of proper annotation, such as bounding boxes of charts, existing corpora are not applicable in computer vision tasks, e.g., object detection. Finally, existing corpora usually do not contain semantic information, providing limited support for multimodal applications

[baltruvsaitis2018multimodal], such as generating visualizations from task descriptions.

In this paper, we aim to build a high-quality dataset containing a large scale of images from visualization publications with rich annotations, including image captions, visualization types, and visualization locations in the images (Fig. VisImages: A Large-scale, High-quality Image Corpus in Visualization Publications). With the dataset, we hope to open up a wide range of applications such as literature analysis, cognitive study, and machine learning in the visualization community.

However, the establishment of the proposed dataset faces three major challenges.  The first challenge is the diversity of visualizations. Researchers have proposed various taxonomies to define and categorize visualizations based on data types (e.g., spatial, temporal, hierarchical, etc.[shneiderman1996eyes] or graphical representations (e.g., point, line, area, etc.[borkin2013makes, roberts2000display]. However, these taxonomies cannot cover visualizations in the publications, which include variations of existing visualizations or novel glyphs (Fig. 1(A) and (B)). Second, the layout of visualizations is diverse in the publication images. For example, in both single-panel charts (Fig. 1(D)) and multiple-view visual analytics (VA) systems (Fig. 1(C))), different visualizations are organized and structured for analytical tasks. Identifying the boundaries between visualizations would be ambiguous for images in academic publications. Third, the diversity of visualization and the complexity of layouts require extensive visualization knowledge to understand and recognize visual representations from images. However, annotating a huge scale of images with limited professional resources is cost-inefficient and unreasonable.

Figure 1: The cases with new designs and complex layouts. (A) shows the design of OpinionSeer[wu2010opinionseer], consisting of a triangle scatterplot, a donut chart, and a bar chart. (B) shows the pixel bar charts[Keim2002PixelBC], an idea of combining bar chart and heatmap. (C) shows the interface of TPFlow[liu2018tpflow], with a set of views of different visualizations. (D) shows the design by Lex et al.[lex2010comparative], with a combination of sankey diagram, heatmap, and tree.

To address the first challenge, we conduct a pilot study and iteratively build up a new taxonomy with 12 categories and 30 visualization subtypes. The taxonomy covers most of the visualizations in visualization publications while being feasible for annotation. For the second challenge, we set up a series of criteria that explicitly specify how to annotate the visualization types and draw bounding boxes for images with diverse graphics combinations and complex layouts. To address the third challenge, we adopt an annotation pipeline that involves both senior visualization practitioners and crowd workers to ensure the efficiency and quality of annotation. The pipeline starts with specifying visualization types by visualization practitioners. Then crowd workers annotate bounding boxes in the images. During the process, we adopt a series of measures, including the gold standard, majority voting, and sampling test for quality control. Our contributions are threefold:

  • A broad taxonomy including 12 categories and 30 visualization subtypes in visualization publications.

  • A large-scale dataset called VisImages containing mages collected from visualization publications. Each image is accompanied by a textual caption and contains the annotation of visualization types and bounding boxes. We make our dataset and all related tools for image data collection and processing publicly available on

  • Four cases demonstrating the usefulness of the dataset, i.e., investigating color evolution, analyzing the visualization preferences of researchers, discovering spatial distribution of each chart in VA systems, and developing a visualization detection model.

2 Related Work

This section introduces the related studies on visualization datasets and visualization literature analysis.

2.1 Image Datasets in Visualization

The visualization community builds a variety of image datasets and trains machine learning models for different purposes. ReVision[savva2011revision] incorporates the images from Prasad et al.[prasad2007classifying] and Huang et al.[huang2007system] and delivers a dataset containing ingle-chart images with 10 categories. The images are used to study the feature representation of bitmap charts and the extraction method for marks and data for chart reconstruction. Similarly, Reverse-engineering Visualization[poco2017reverse] collects more than 5000 bitmap images from online resources and annotate the corresponding visualization types (e.g., area chart, bar chart, etc.) and roles of the text content (e.g., chart title, axis label, etc.). The data are used to develop a pipeline for recognizing visual encoding and reconstructing visualizations with declarative grammars, such as Vega-Lite[satyanarayan2016vega]. Beagle[battle2018beagle] gathers over VG-based charts from the Internet, manually labels them under 24 visualization types, and trains classification models to analyze the chart type distribution on the web. Viziometrics[lee2017viziometrics] focus on the classification of non-textual information in scientific publications, including equation, table, photo, diagram, and plots. VisImages are different from the aforementioned datasets in three aspects. First, VisImages focus on visualization publications, where the visualizations exhibit diverse designs with varying layouts. Second, our corpus contains rich annotations, including images captions, visualization types, and bounding boxes. Finally, VisImages support applying diverse machine learning models to visualization scenarios, such as visualization detection, visualization captioning, etc.

Several image corpora are created for empirical studies. Borkin et al.[borkin2013makes] manually annotated more than ingle-chart visualizations including infographics and news media and conducted user studies using the annotated visualizations to understand the memorability of visualizations. The results reveal that the memorability is a critical measure of the information utility. They further extended the dataset and formed MassVis[borkin2015beyond], a dataset with richer content about visualizations (e.g., annotations, axis, data, etc.) for memorability study. Li et al.[li2018toward] conducted a user study using the images from SciVis and discovered the correlation between memorability and the clutter and number of distinct colors. Different from the corpora mentioned above, our corpus focuses on the images from InfoVis and VAST, providing new opportunities for cognition and perception research.

2.2 Visualization Literature Analysis & Datasets

Literature analysis is an important research area for indexing and understanding the publications. Current works mainly focus on the following four types of data: text, citations, authors, and metadata[federico2016survey]. In this paper, we focus on the studies in the visualization research community.

Many datasets (e.g., [fekete2004infovis, xie2016visualizing, plaisant2007promoting, cook2014vast, isenberg2016vispubdata]) propose to support interactive literature analysis for visualization publications. The most up-to-date dataset is vispubdata.org111[isenberg2016vispubdata], which contains metadata of publications in IEEE VIS sub-conferences, i.e., VIS, InfoVis, VAST, and SciVis. The publication data, including authors, references, keywords, etc., are collected from the electronic proceedings. A series of VA tools, such as CiteVis2, CiteMatrix, and VisList[isenberg2016vispubdata], are proposed based on the basis of To assist researchers in conducting literature reviews, Ponsard et al. proposed PaperQuest[ponsard2016paperquest], a tool searching for the relevant papers of interest of the user. Several works[chuang2013topic, isenberg2016keyvis] also attempted to organize publications based on research topics. Chuang et al.[chuang2013topic] introduced a framework using topic modeling to analyze the InfoVis corpus. Isenberg et al. proposed KeyVis[isenberg2016keyvis], which extracts the keywords of visualization papers and presents an interactive interface for exploration. However, none of the above studies investigate images in the publications. VisImages serve as a complement to these works and provide a large corpus of images with rich annotations and semantic information, including visualization types, bounding boxes, and captions.

3 Dataset Construction

In this study, we focused on 2D static visualizations and collected the images from VAST (IEEE Conference on Visual Analytics Science & Technology) and InfoVis (IEEE Conference on Information Visualization). We excluded SciVis (IEEE Conference on Scientific Visualization) since there are a large number of images depicting the results of 3D rendering, which are different from 2D static visualizations. Therefore, we began by downloading papers according to a paper list provided by [isenberg2016vispubdata]. Next, we used PDFFigures 2.0[clark2016pdffigures] to extract images and the corresponding captions from these papers. We focused on figures and tables indexed by Fig. and Table., and inline images without a caption. To ensure quality, we manually checked and revised the image bounding boxes and captions obtained by PDFFigures 2.0. As a result, we in total processed apers in VAST and InfoVis ranging from 1996 to 2018 and collected mages with aptions.

With the images and captions, we further specified the visualization types in the images, as well as their positions. However, recognizing the visualization types in an image from a VAST and InfoVis paper requires extensive domain knowledge, since some of them can be a combination or variant of basic visualizations. For example, the visualization in Fig. 1(A) consists of a triangle scatterplot, a donut chart, and a bar chart; the pixel bar chart in Fig. 1(B) is quite different from the typical bar charts. To ensure the quality of annotation, we adopted a construction pipeline with carefully-designed tasks and cross-validation procedures. We firstly built a taxonomy for visualizations based on prior work[borkin2013makes] and several rounds of refinement. With the taxonomy, we further designed a lab study to annotate chart types and employed crowd workers for bounding box annotation. The taxonomy and annotation pipeline are introduced in Section 4 and 5.

Figure 2: Distribution of the visualization subtypes. The internal bars show the numbers of images containing each subtype, and the external bars show the numbers of bounding boxes of each subtype.

4 Taxonomy

Categories Subtypes
Area area chart, proportional area chart
Bar bar chart
Circle donut chart, pie chart, sector chart
Diagram flow diagram, chord diagram, sankey diagram
Statistic box plot, error bar, stripe graph
Grid matrix, table, small multiple
Line line chart, storyline, polar plot, parallel coordinate
Map map
Point scatter plot
Units & Glyph heatmap, glyph-based visualization, unit visualization
Word word cloud
Tree & Graph graph, tree, treemap, hierarchical edge bundling, sunburst/ icicle chart
Table 1: The taxonomy of VisImages.
Figure 3: The pipeline of data annotation. (A) shows the original image for annotation. (B) shows the process of visualization type annotation, and three senior visualization practitioners have different selections on the visualization subtypes in the image. (C) shows the process of majority voting, in which the visualization types obtain less than two votes will be removed. (D) shows the process of bounding box drawing, in which each crowd worker focuses on one type of visualization and draws the bounding boxes.

To identify the categories for annotation, we started from the taxonomy by Borkin et al.[borkin2013makes] that covers 12 categories (e.g., Bar, Line, Point, etc.) and 63 subtypes of visualization. However, their taxonomy was built on images in social and scientific domains (i.e., infographics, news media, scientific publications, and government & world organization), which are aesthetically pleasing. On the contrary, visualizations in publications were complex in both visual representation and layout.

To build a taxonomy catered for visualizations in publications, we iteratively refined the taxonomy through several rounds of refinement. Each round consisted of three steps, namely, annotation, discussion, and refinement. First, in order to validate the feasibility of taxonomy, we annotated a small-scale corpus of images with the taxonomy. We focused on recent papers where diverse and novel visualization categories increasingly emerged. We randomly selected 5% (40/816) VAST and InfoVis papers from 2010 to 2018 and obtained 394 images. Three authors of this paper annotated each image independently. In the discussion step, we conducted internal discussions to identify challenges in the annotations. The authors only reached the consensus on 70% (274/394) images after discussion. From the discussion, we identified two problems: 1) some visualizations cannot be properly labeled using existing taxonomy, and 2) many subtypes seems redundant with a small number of instances. Third, to address the challenges identified in the second step, we interviewed two senior visualization experts with more than six-year experience in research. The experts started by suggesting missing subtypes to help categorize the unlabeled cases according to their domain knowledge. Experts also suggested merging subtypes with a similar definition. Based on their suggestions, we refined and obtained a new taxonomy. We iteratively refined the taxonomy until the authors reached the consensus on all sampled images. The refinement process suggested four operations: subtype addition, subtype merging, subtype deletion, and category adjustment.

  • Subtype Addition. We added visualization types according to expert knowledge. Experts indicated that some visualizations, such as unit visualization, glyph-based visualization, should be included motivated by recent publications[park2017atom, borgo2013glyph]. In addition, a new subtype, sunburst/icicle chart, was split from the type treemap because of different visual encodings of the hierarchical information.

  • Subtype Merging. We merged subtypes that are similar in definition. For example, slope graph was merged to parallel coordinate because slope graph could be regarded as a simplified case of parallel coordinate. Similarly, the overlapped area chart, stacked area chart, and error band were merged to area chart.

  • Subtype Deletion. After merging, we further deleted subtypes with low frequency (vector graph, timeline, venn diagram, etc.) and the seldom-occurred subtypes (e.g., stem-and-leaf plot, text chart, etc.).

  • Category Adjustment. After the above operations, we adjusted the categories to better group the subtypes according to their geometry feature. We added a new category called Units & Glyph, including heatmap, unit visualization, and glyph-based visualization. The category Grid consisted of matrix, small multiples, and table, and we deleted the category Table.

As a result, we obtained a new taxonomy with 12 categories and 30 subtypes, as shown in Table 1.

5 Data Annotation

In this step, we assign two types of annotations for each image, i.e., visualization types and bounding boxes.

5.1 Visualization Type Annotation

Distinguishing visual representations and their variations is challenging and requires extensive knowledge in visualization. To ensure quality, we recruit senior visualization practitioners to annotate the subtypes.

Tasks. We designed a multi-label task for type annotation. We first developed an interface for the annotation tasks. The interface contained buttons for 30 different subtypes and a “submit” button. Besides, we provided an “others” category for exceptional cases. To accelerate locating the buttons of different subtypes, we organized the buttons by their categories. Each task included a visualization image and a multi-choice request: “please select all subtypes occurring in the image.” Each round contained 40 tasks and took 8 to 12 minutes to finish. We adopted “gold standards” and majority voting to ensure data quality. The “gold standards” aimed to make sure the participants are focused on the tasks. We manually selected “gold standards,” which are simple charts in prominent positions of the image. Each round contained eight tasks with “gold standards.” If a participant failed in two or more “gold standards,” all results of this round would be rejected, and our interface would reassign these images to other participants. In addition, we used majority voting to address ambiguity in the annotation. Each task would be accomplished by three participants, and a selection of a subtype by a participant was counted as one vote. For each image, the subtypes that gained at least two votes would be accepted. Otherwise, the subtypes would be suspended for further discussion. Considering the majority voting, the entire annotation process contained at least mages 3 repetitions = asks.


We recruited 25 participants, including one senior visualization expert who has six-year experience in visualization research; 13 Ph.D. candidates whose research interests include information visualization, visual analytics, etc.; seven master students majoring in information visualization; and four undergraduate students who have taken the undergraduate course of data visualization. Most of them (15/25) had publications accepted by IEEE VAST or InfoVis.

Procedure. The annotation procedure consisted of a training workshop and a formal study. The training workshop aimed to ask participants to get familiar with the annotation tasks. During the workshop, we introduced the definition with examples for each visualization type. After that, a test was taken to ensure that the participants correctly understand the definition of the taxonomy. The test contained 20 tasks covering all visualization types. Participants were considered eligible only after they successfully answered more than 18 tasks. After the training workshop was successfully passed, the formal study started, and each participant was asked to finish 40 rounds of tasks. In total, each participant was assigned 40 tasks 40 rounds = 1600 tasks. To avoid overloading participants, they were allowed to accomplish all tasks within five days. We sent 0.05$ for each accepted task.

Results. Finally, we found ut of mages were categorized into the taxonomy. The numbers of images containing each subtype are shown in the internal bars in Fig. 2. We observed that the top five subtypes contained in most images are bar chart, scatterplot, graph, line chart, and table.

Figure 4: (A), (B), (C), and (D) show different cases of the criteria for bounding box drawing. (E) shows how to compute the IoU of two bounding boxes. (F) shows the computation of recall, precision, and score with a case, where red boxes represent the annotated boxes by crowd worker and dotted box represent the ground truth. In this case, the red dotted boxes are true positives, the dotted boxes are condition positives, and the red boxes are annotated positives.

5.2 Bounding Box Annotation

Figure 5: Horizon charts showing the evolution of each visualization type over time.
Figure 6: Comparison of the distribution of visualization categories from visualization publications, scientific publications, infographics, news media, government, and world organizations. The data of the sources in light blue comes from the work of Borkin et al.[borkin2013makes].

Based on the visualization types annotated by senior visualization practitioners, we further focused on bounding box annotation to specify the position of the visualizations in the images. We employed the crowd from a data annotation company, whose workers are well-trained for similar tasks.

Tasks. The challenge of drawing bounding boxes in our scenario is to recognize various visual representations and their variations. To reduce the mental load, each crowd worker was guided to focus on only one subtype. Each task included an image containing this specific subtype and a request to draw the bounding boxes around all visualizations of this subtype. The rule of drawing a bounding box was that the bounding box should only cover one object of this subtype. We established a set of rules which are introduced in the following paragraph. During annotation, sampling tests were adopted to ensure the task quality of the workers. Each accepted bounding box was paid with 0.03$.

Criteria. The key to annotation criteria is to specify the contents of visualization for bounding box drawing. Our criteria are based on the layout of the visualizations, i.e., visualization with coordinate system and without coordinate system. For a visualization with a coordinate system, the bounding box should cover all the components of the coordinate, e.g., axis name, axis labels, chart title, and legends if they are close to the visual representations (Fig. 4(B)). If more than one subtypes share a coordinate system (e.g., error bar & bar chart in Fig. 4(B)), the area of their bounding boxes was the same. For the visualizations without a coordinate system, we distinguish two situations, i.e., 1) independent visualizations without any connection or overlapping with other visualizations and 2) the visualizations connected to or overlapped with other visualizations. For the first case, the contents are the visualization itself (Fig. 4(A1)). For the second case, we only focus on the contents of the requested subtype. For example, the tree in Fig. 4(C) is connected to the sankey diagram, and the word cloud in Fig. 4(D) overlays on the area charts. The bounding boxes only cover the contents of tree and word cloud, respectively. However, in addition to the aforementioned criteria, there is an exception that requires further specification. Some visualizations contain multiple smaller visualizations of identical subtype (e.g., the donut charts in the map in Fig. 4(A2)). In this case, we annotate them integrally with a single bounding box.

Quality Measurement & Control. We defined bounding box correctness and task correctness to measure the quality of annotation. The correctness of a bounding box was measured by intersection over union (IoU, whose definition is illustrated in Fig. 4(E)) with the ground-truth bounding box. Only when the IoU is higher than 0.9, the bounding box was accepted. Besides, the quality of a series of tasks was measured by the score (Fig. 4(F3)), a metric balancing the recall (Fig. 4(F1)) and precision (Fig. 4(F2)).

For quality control, we adopted a sampling test on both batch level and worker level. We equally divided the mages into five batches and performed annotations batch by batch. The batch level sampling test was performed after completing a batch of annotations. We randomly sampled 10% of the results and evaluated the score. If the score was not higher than 95%, the whole batch of annotation would be rejected. The rejected batch would be annotated again until the score reached 95%. The worker level sampling test was conducted during one batch of annotations, where 15% annotations of a worker would be randomly sampled for score evaluation. If the score was not higher than 95%, all finished tasks of this worker in this batch would be rejected and annotated again. For the workers who failed the sampling test, their sampling rate would increase by 5% at the next test.

Procedure. The annotation procedure consisted of a training session and formal annotation. During the training session, each crowd worker was assigned to a subtype. The definition, examples, and annotation criteria were introduced to the crowd workers. The training session also included a test annotation, whose pipeline is the same as the formal annotation. Through the test annotation with the first batch of images, the crowd workers familiarized the subtypes and criteria. After the training session, the crowd workers pursued formal annotation with the last four batches of images.

6 VisImages

In this section, we first introduce an overview of the VisImages. Next, we conduct a comparative analysis with other sources to exhibit visualization distribution in VisImages. Finally, we revisit the taxonomy and gain insights about the subtypes that are easily get confused during annotation.

6.1 Overview of the Data

Through data annotation, we obtain a dataset containing mages from VAST and InfoVis with aptions and ounding boxes. Fig. 2(A) shows the number of images (internal bars) and bounding boxes (external bars) of each subtype. We find that the height of internal bars of some subtypes is close to the height of the external bars, such as table and flow diagram, which means that the number of images containing these subtypes is similar to the number of visualizations of these subtypes. This is because, in visualization publications, these subtypes always occupy the entire image. On the contrary, the heights of external bars of some subtypes are about two times the internal bars, e.g., bar chart, scatterplot, graph, etc. The images usually contain more than one instance of these subtypes because they are basic charts that commonly serve as components of small multiples (e.g. scatterplot matrix).

To analyze the evolution of each visualization type, we count the number of bounding boxes for each type (Fig. 5) across years and depict the distribution with horizon charts. The horizon charts are vertically aligned according to years and horizontally aligned with the same height of 50. The darker the color is, the larger the number of visualization is.

Because the number of papers increases, many visualizations become increasingly popular, such as bar chart, area chart, scatterplot, and table (Fig. 5(A)). We notice that dark area in graph distributed evenly across years (Fig. 5(C)). This is because graph visualization has long been a hot research topic in the visualization community. Besides, we observe that the dark area of treemap becomes larger in 2005 (Fig. 5(F)) while the dark area of tree gets larger since 2003 (Fig. 5(E)). That indicates that treemap became popular after the increasing popularity of tree. In addition, the number of error bar increases after 2006, as indicated by Fig. 5(B). This is because VAST is established in 2006, which boosts the development of visual analytic systems. As a result, more error bars are adopted in the user studies to evaluate the usefulness and effectiveness of the systems. Moreover, the type map achieves its largest number in 2016 (the darkest area in Fig. 5(D)) because of the population of urban visualization, reflected in the fact that a specific VAST session “Traffic and Urban Planning” is held at that year.

6.2 Comparison with Visualizations from Other Sources.

We characterize the distribution of VisImages in comparison to four corpora described in Borkin et al.[borkin2013makes], i.e., scientific publications, infographics, news media, and government & world organization. To compare the above corpora under the same metric, we categorize the types in VisImages according to the taxonomy described in Borkin et al.[borkin2013makes]

. We include an “Others” category to classify visualizations beyond the scope of the original taxonomy. In Fig. 

6, we notice that the distribution of visualization publications is more balanced compared to the others. Tree and Networks occupy the largest share in visualization publications, which is not frequently appeared in other sources. Since an amount of research in our community focuses on presenting data with complex relationships, trees and networks are frequently employed. On the other hand, news media and government & world organizations prefer basic visual representations such as Bars, Table, and Lines because they target a general audience. Scientific papers prefer Diagrams, Lines, and Points for the presentation of methodology and experiment results. We notice that Text accounts for a portion in visualization publications but rarely appears in other sources. In the original taxonomy, the category Text includes word cloud, word tree, and phrase net. These visualizations are not frequently used in other sources. On the contrary, a lot of research has investigated variations of word cloud to make it more informative and effective, such as ManiWordle[koh2010maniwordle] and dynamic word cloud[cui2010context].

6.3 Taxonomy Revisiting

Figure 7: Confusion matrix between different visualization types. The white lines between the cells separate the matrix into blocks of different categories. The HEB and PAC are the abbreviations of hierarchical edge bundling and proportional area chart, respectively. (A), (B), and (C) indicate the cells of similar definitions, similar geometry, and high co-occurrence, respectively.
Figure 8: Case 1: the evolution of color used in VAST and InfoVis over the years. (A) shows the stream graph of the colors. The streams in (A2) shows the time interval of 19972009 that the color purple and red are frequently used, and the streams in (A3) shows the interval of 20092018 that they are less used. The bar of blue and yellow are shown in (A1) and (A4), respectively. The image examples containing the color of (A1), (A2)&(A3), and (A4) are shown in (B), (C), and (D), respectively.

We borrow the idea of confusion matrix to revisit our taxonomy and analyze the confusion between different subtypes. We define the confusion score between subtype and subtype based on the intuition. Taking donut chart and pie chart as an example, a donut chart might be recognized as a pie chart mistakenly in some cases. Therefore, the two subtypes tend to be selected by three different participants during annotation, and the majority voting will keep one and reject the other. To conclude, if and tend to be confused with each other, the possibility that and are selected at the same time but one is rejected is high. We define the confusion score based on the above observation:

where and are two different subtypes, and the and represent the set of selected types and the set of rejected types after majority voting, respectively. The confusion matrix is depicted as a heatmap in Fig. 7. The darker of a cell’s color, the higher of confusion score between the corresponding subtypes. By analyzing cells with high confusion score, we derive three insights into the relationships among chart types.

Similar Definition. Some types are similar in definition (Fig. 7(A)). For example, the small multiple and matrix are all based on grids, and their difference lies in the complexity of grid elements: the grid cells of small multiple are usually composed of complex visualizations while the cells in matrix are usually simple shapes. Another example is tree and graph, where the tree is a special case of graph. Due to the similarity in definition, participants may get confused in distinguishing between two visualizations.

Similar Geometry. Some types are similar in visual representation (Fig. 7(B)). For example, pie chart and sunburst are both in radial layouts, but sunburst focuses on hierarchical data, while pie chart focuses on proportional sectors. Another example is sankey diagram and area chart. If the participants are not concentrated, they might make mistakes in recognizing the subtypes with similar geometry.

High Co-occurrence. Some subtypes have high co-occurrence within a single visualization (Fig. 7(C)). During annotation, the participants might focus on the more obvious one and overlook the others. For example, when matrix and heatmap occur at the same time, they serve as different features of a visualization. The matrix is the layout, while heatmap implies visual encoding. However, the participants might be attracted by the color pattern of heatmap and do not notice the grid layout. Another example is hierarchical edge bundling and chord diagram, where hierarchical edge bundling focuses on the bundling techniques, but chord diagram focuses on the radial layouts. The participants tend to label the chord diagram and miss the bundled edges.

From the above observation, we report two reflections for future annotation. First, we can give notification when the participant selects the confusing subtypes, especially for the ones with similar geometry and high co-occurrence. Second, to ensure annotation quality, we can add more rounds of tasks for confusing subtypes.

7 Use Cases

In this section, we will introduce four use cases. Case 1 shows the usage of color information gained from the images; Case 2 presents how different chart types are preferred by scholars; Case 3 shows how we analyze the spatial distribution of different charts in visual analytic systems using captions and bounding boxes; Case 4 shows how the bounding boxes benefit applying machine learning models to visualization community.

7.1 Color Evolution in VAST & InfoVis

Our goal is to analyze how color is evolved in VAST & InfoVis papers. The images are represented in the CIELAB color space, in which the color change matches human perception change. The CIELAB color space is composed of three dimensions, i.e., the lightness , the green-red component , and the blue-yellow component . We divide each dimension into five bins and obtain a discrete color space with color values. For an image, we count the pixel number of each color value and obtain a histogram with 125 color bins. Afterward, for each year, we sum up the color histogram of images by color values and normalize them with the number of pixels. Most of the colors of the images are in grayscale ( and ). Therefore, we remove the grayscale colors and visualize the evolution of the chromatic colors by a stream graph with a bar chart summarizing the color distribution(Fig. 8).

We notice that purple and red are more frequently used before 2009 (Fig. 8(A2)) than after (Fig. 8(A3). Some examples that contribute the most purple and red pixels are shown in (Fig. 8(C)). Purple serves as the background color in some charts, while red is used a lot in matrix, map, and parallel coordinates that gives users a strong visual impact. From the distribution of the colors, the top two popular colors are yellow (Fig. 8(A4)) and blue (Fig. 8(A1)). The yellow is used a lot in the map as background (Fig. 8(D1)) or in the heatmap encoding the lower values of the attribute (Fig. 8(D2)). The blue is used to present graphical elements in the charts, such as bar chart (Fig. 8(B1)) and sea in the maps (Fig. 8(B2)).

7.2 Visualization Preference of the Top Researchers

Figure 9: Case 2 shows the evolution of visualization preference of distinguished authors. The top three authors with the most publications in InfoVis and VAST are presented in A) and B), respectively. The bar charts on the right of the stream graphs show the distribution of different visualization types.

This case explores the visualization preference of top researchers in the visualization community. We analyze the researchers who have published the most papers in InfoVis (from 2004 to 2018) and VAST (from 2006 to 2018). For each researcher, we obtain the chart distribution annually and illustrate the evolution of the distribution in Fig. 9.

Fig. 9(A) shows the comparison between three distinguished researchers in InfoVis, i.e., Hanspeter Pfister (22 papers, same below), Jeffrey Heer (21), and Sheelagh Carpendale (19). For Pfister, flow diagram and graph occupy a large portion of the number of visualization types in 2013 (Fig. 9(A1)), because the work Entourage[lex2013entourage] contains numerous flow diagrams and graphs to visualize the relationship between biological pathways. Besides, the subtypes matrix and heatmap occupy a larger share in 2017 (Fig. 9(A2)) because HiPiler[lekschas2017hipiler] and LSTMVis[strobelt2017lstmvis] contains numerous matrices and heatmaps. From the bar charts on the right, we discover that the distributions of Pfister (Fig. 9(A3)) and Heer (Fig. 9(A4)) have multiple peaks while Carpendale (Fig. 9(A5)) has only one peak. The reason is that Pfister develops many applications[strobelt2015vials, dinkla2016screenit, strobelt2017lstmvis], where multiple visualizations are adopted to facilitate coordinated analysis. For Heer, one of his research interests is visualization grammars, such as Vega-Lite[satyanarayan2016vega] and Reactive Vega[satyanarayan2015reactive], in which various visualizations are used as cases to prove the usefulness of the grammar. Carpendale, on the other hand, mostly focuses on design studies[carpendale1996distortion, carpendale2004phyllotrees], in which bar charts are used to exhibit the experiment results, reflecting on the single peak in (Fig. 9(A5)).

Fig. 9(B) shows the chart preference of Daniel A. Keim (23), Huamin Qu (18), and William Ribarsky (17) in VAST. The stream of Qu is the shortest because Qu starts publishing VAST papers later than Keim and Ribarsky. And Qu catches up quickly and achieves a competitive number of publications. The pink and purple area of Qu (Fig. 9(B2)) and Ribarsky (Fig. 9(B4)) is larger than that of Keim (Fig. 9(B1)), indicating that the Qu and Ribarsky prefer using bar chart and area chart more than Keim. By looking up the papers, we discover that Ribarsky uses a lot of bar charts and area charts (Fig. 9(B5)) in visual analytic systems such as DemographicVis[dou2015demographicvis], VAiRoma[cho2015vairoma], HierarchicalTopics[dou2013hierarchicaltopics], and NewsLab[ghoniem2007newslab]. Similarly, Qu employs plenty of bar charts in SmartAdp[liu2016smartadp], DropoutSeer[chen2016dropoutseer], NameClarifier[shen2016nameclarifier], and adopt plentiful area charts in iForum[fu2016visual] and VisMatchmaker[law2016vismatchmaker]. Instead, Keim prefers scatterplot and heatmap (Fig. 9

(B3)), which occupy a large portion over the years. The reason is that one of Keim’s research interests is high-dimensional data

[sacha2016visual, jackle2015temporal], in which scatterplots are commonly used to visualize the results of dimension reduction. Keim is also interested in sports analytics[sacha2014feature, stein2017bring], where heatmaps are frequently used to present Spatio-temporal data.

7.3 Spatial distribution of each type in VA systems

Figure 10: Case 3: heatmaps showing the position distribution of the visualization in visual analytic systems.

To understand how visualizations are spatially distributed in the VA systems, we obtain the images of VA systems and analyze the visualization positions. To collect interface images of VA systems, we use a support vector machine (SVM) to classify the captions describing VA systems. To build the training data for the SVM, we first search for the captions containing “interface” and “system overview,” and manually verify if these images are system interfaces. If yes, we assign the label “interface” to these captions. In addition, we randomly select captions that are not about system interface and label them as “others.” In total, we obtain 250 “others” captions and 251 “interface” captions. We then adopt term frequency-inverse document frequency (TF-IDF) to vectorize the captions and conduct binary classification with SVM on all captions. We manually verify the predicted results and obtain 332 images of the system interface. Finally, we aggregate the bounding boxes of each chart type according to their relative positions in the images. Each image is scaled to the same size; thus, the bounding boxes can be aligned accordingly. Then we draw the aligned bounding boxes on the same canvas and obtain the heatmaps (Fig. 

10). The brightness indicates the spatial density of each chart appeared in VA systems.

In Fig. 10, in the plot of scatterplot, glyph-based visualization, graph, map, and heatmap, the upper center areas are bright, which means that these categories tend to distribute at the upper center of VA systems, the most eye-catching area. The scatterplot is usually used in projection views to show the results of dimension reduction or clustering, while heatmaps are used by the density view to show the distribution of the samples, serving as an important part of VA systems. Besides, many applications focus on the multi-variable data, spatial data, and network data and develop methods based on glyph-based visualization, map, and graph. Therefore, these visualizations tend to be placed in the dominant position serving as the main views. The bright area of bar charts covers most areas in the heatmap, indicating that they could be placed at any position of the VA system. The reason is that the bar chart is the most commonly used visualization (ranked in Fig. 2(A)) adopted by both main views and supporting views. Interestingly, compared to the bar chart, another basic and commonly used chart type line chart is generally placed at the top position with a long stripe area. Because the line charts are commonly used to present the overview from the temporal aspect, therefore they are usually placed at the top position as the starting point of analysis. We also notice that the matrix, parallel coordinate, and table are distributed close to the peripheral of the VA systems We infer that for the VA systems not targeting at matrix visualization, the matrices usually serve as the reference for the analysis (e.g., confusion matrix). Similarly, tables are usually used to show raw data. In addition, parallel coordinates are borrowed to support filtering for multi-dimensional data. Therefore, these visualizations are commonly placed in the subordinate position at VA systems.

7.4 Faster R-CNN on VisImages

Figure 11: Cases 4: results of visualization detection with Faster R-CNN. The red boxes are the predicted results of the model, and the blue boxes indicate the position that the model did not predict.

Owing to the large scale, VisImages can be used as a benchmark for training and evaluating machine learning models in the visualization community. In this section, we train the Faster R-CNN[ren2015faster], one of the most popular object detection models to localize and recognize different visualizations from images. However, the task in our case is slightly different from typical object detection in the number of predicted categories. The models in object detection usually categorize bounding boxes into exactly one of many classes. In our case, a chart can be labeled with multiple subtypes, for example, error bar and bar chart in Fig. 4

(B). To achieve our task, we train the Faster R-CNN for each subtype respectively and adopt cross-subtype merging on the bounding boxes to obtain multiple labels. We selectively choose subtypes with more than 300 image samples and obtain 14 subtypes. For each subtype, 85% of images are used for training and validation, and 15% for testing. All models are trained by stochastic gradient descent (SGD) with a learning rate of

or 15k mini-batches. A momentum of 0.9 and a weight decay of re adopted. The IoU threshold for cross-subtype bounding box merging is set as 0.7.

We select some images with multiple chart types and perform the model on them. The results are shown in Fig. 11. From the results, we notice some failed cases with interesting patterns. In Fig. 11(A), (B), and (E), the model can correctly localize and recognize the bar charts. However, the model misses cases in Fig. 11(C2) and (D6), where we use blue boxes to indicate the expected position of the bounding boxes, but some similar cases can be recognized properly(Fig. 11(C1) and (D2)). The model seems not robust to the bar charts with imbalanced distribution. In addition, the model includes an area chart as part of the bar chart during region proposal in Fig. 11(D4), where the area chart looks like a bar of the bar chart. We notice that the model fails to distinguish the bars with a different shape of edge. From the bounding boxes in Fig. 11(D1), (D3) and (D5), the model recognizes them as scatterplots. However, the charts in Fig. 11(D3) and (D5) are matrix and bar chart, respectively. The graphical elements similar to points may confuse the model. From the cases above, the models seem to focus excessively on low-level visual features (e.g., the overall distribution of the bars, edge shape of a bar, point shape). The false cases reveal a potential that researchers from machine learning and visualization work jointly to adapt the machine learning models for visualization scenarios.

8 discussion

VisImages contain fine-grained annotation of various types of information, i.e., visual information of images, semantic information of chart types, and textual information of captions. Considering rich annotations, our corpus is the largest human-annotated image dataset in the visualization community.

Opportunities for Machine Learning with VisImages. Our corpus can serve as a benchmark for a wide range of machine learning models, such as visualization detection, recommendation, captioning, etc

, providing opportunities to develop and apply intelligent models to visualization scenarios. For example, image captioning is the task aiming at generating textual description from an image. Image captioning models trained on VisImages can facilitate computer intelligence in understanding and explaining the visualizations with human-readable text. Another example is training visualization detection models for parsing interfaces of visual analytics systems. Well-performed models trained on VisImages can contribute to scenarios such as automatic layout generation and recommendation.

Benefits to Empirical studies. VisImages enable large-scale empirical studies with a collection of high-quality and fruitful visual representations from the top venues in visualization. For example, study on human memorability in professional information visualizations can be conducted by following the pipeline of Borkin et al.[borkin2013makes]. With such a study, we might obtain criteria for designing novel visualization with the purpose of better memorability.

Limitations. Though we have demonstrated the significance and usefulness of VisImages, it still has limitations. First, although “gold standards” and majority voting are adopted to control the quality, mislabeling is inevitable due to the challenge in recognizing various charts and their variations. We hope to invite the visualization community, especially the authors of the publications, to post requests on our website to correct the mislabeled visualizations. The website contains an interactive interface for annotation exploration, provided with a form for problem report. Second, we revise the taxonomy based on the visualizations in InfoVis and VAST, leaving some important categories uncovered, such as volume rendering. Building a taxonomy to include all visualizations seems impossible, because the our community is developing rapidly and more and more novel visual designs emerge. To improve facilitate a broad range of annotation, we will design taxonomies for other sources and scenarios. Third, our corpus contains three types of annotation (i.e., image captions, visualization types and positions), leaving a wealth of information unexplored, such as the texts in visualizations e.g., x ticks, y ticks, axis name, etc

. We argue that more annotations can be conducted with the help of VisImages. Finally, our annotation pipeline heavily relies on human resource, especially senior visualization practitioner, which is not scalable with the increase of the number of images. A potential solution is using deep learning models trained on VisImages to conduct annotation automatically. We plan to explore this direction in the future.

9 conclusion and future work

We have presented VisImages, a high-quality and large-scale dataset containing images from the visualization research community. To facilitate the annotation, we have proposed a new taxonomy catered for visual representations in publications. The type of visualizations were annotated by senior visualization practitioners and the bounding boxes were drawn by crowd workers. Besides, we adopted several measures, such as “gold standards,” majority voting, and sampling test, for quality control. The usefulness and significance of VisImages are demonstrated by four cases including visual literature analysis and building machine learning for the visualization community. To benefit our community, the dataset and all related tools for image data collection and processing are publicly available on

The current version of VisImages is the first step of a long-term project, and we plan to continuously improve VisImages in following three aspects. First, we will expand VisImages to cover images of other important but unexplored conferences and journals, such as EuroVis and TVCG. Thus, our corpus will expand dramatically which poses challenges in the annotation process. We then plan to develop a pipeline combining both human and machine intelligence to scale up the annotation process. To be specific, we plan to train a chart classification model to resolve the need for visualization experts. Lastly, we will gradually refine and design new taxonomies to meet the growing diversity of visualization designs. For example, we will built a taxonomy to include visual representations in SciVis papers.