Learning Visual Importance for Graphic Designs and Data Visualizations

08/08/2017 ∙ by Zoya Bylinskii, et al. ∙ 0

Knowing where people look and click on visual designs can provide clues about how the designs are perceived, and where the most important or relevant content lies. The most important content of a visual design can be used for effective summarization or to facilitate retrieval from a database. We present automated models that predict the relative importance of different elements in data visualizations and graphic designs. Our models are neural networks trained on human clicks and importance annotations on hundreds of designs. We collected a new dataset of crowdsourced importance, and analyzed the predictions of our models with respect to ground truth importance and human eye movements. We demonstrate how such predictions of importance can be used for automatic design retargeting and thumbnailing. User studies with hundreds of MTurk participants validate that, with limited post-processing, our importance-driven applications are on par with, or outperform, current state-of-the-art methods, including natural image saliency. We also provide a demonstration of how our importance predictions can be built into interactive design tools to offer immediate feedback during the design process.



There are no comments yet.


page 1

page 2

page 3

page 4

page 6

page 7

page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: We present two neural network models trained on crowdsourced importance. We trained the graphic design model using a dataset of 1K graphic designs with GDI annotations [34]. For training the data visualization model, we collected mouse clicks using the BubbleView methodology [23] on 1.4K MASSVIS data visualizations [4]. Both networks successfully predict ground truth importance and can be used for applications such as retargeting, thumbnailing, and interactive design tools. Warmer colors in our heatmaps indicate higher importance.

A crucial goal of any graphic design or data visualization is to communicate the relative importance of different design elements, so that the viewer knows where to focus attention and how to interpret the design. In other words, the design should provide an effective management of attention [40]. Understanding how viewers perceive a design could be useful for many stages of the design process; for instance, to provide feedback [41]. Automatic understanding can help build tools to search, retarget, and summarize information in designs and visualizations. Though saliency prediction in natural images has recently become quite effective, there is little work in importance prediction for either graphic designs or data visualizations. Our online demo, video, code, data, trained models, and supplemental material are available at visimportance.csail.mit.edu.

Figure 2: We show an interactive graphic design application using our model that lets users change and visualize the importance values of elements. Users can move and resize elements, as well as change color, font, and opacity, and see the updated realtime importance predictions. For instance, a user changes the color of the text to the left of the runner to increase its importance (middle panel). The rightmost panel includes a few additional changes to the size, font, and placement of the text elements to modify their relative importance scores. A demo is available at visimportance.csail.mit.edu.

We use “importance” as a generic term to describe the perceived relative weighting of design elements. Image saliency, which has been studied extensively, is a form of importance. However, whereas traditional notions of saliency refer to bottom-up, pop-out effects, our notion of importance can also depend on higher-level factors such as the semantic categories of design elements (e.g., title text, axis text, data points).

This paper presents a new importance prediction method for graphic designs and data visualizations. We use a state-of-the-art deep learning architecture, and train models on two types of crowdsourced importance data: graphic design importance (GDI) annotations [34] and a dataset of BubbleView clicks [23] we collected on data visualizations.

Our importance models take input designs in bitmap form. The original vector data is not required. As a result, the models are agnostic to the encoding format of the image and can be applied to existing libraries of bitmap designs. Our models pick up on some of the higher-level trends in ground truth human annotations. For instance, across a diverse collection of visualizations and designs, our models learn to localize the titles and correctly weight the relative importance of different design elements (Fig. 


We show how the predicted importance maps can be used as a common building block for a number of different applications, including retargeting and thumbnailing. Our predictions become inputs to cropping and seam carving with almost no additional post-processing. Despite the simplicity of the approach, our retargeting and thumbnailing results are on par with, or outperform, related methods, as validated by a set of user studies launched on Amazon’s Mechanical Turk (MTurk). Moreover, an advantage of the fast test-time performance of neural networks makes it feasible for our predictions to be integrated into interactive design tools (Fig. 2). With another set of user studies, we validate that our model generalizes to fine-grained design variations and correctly predicts how importance is affected by changes in element size and location on a design.

Contributions: We present two neural network models for predicting importance: in graphic designs and data visualizations. This is the first time importance prediction is introduced for data visualizations. For this purpose, we collected a dataset of BubbleView clicks on 1,411 data visualizations. We also show that BubbleView clicks are related to explicit importance annotations [34] on graphic designs. We collected importance annotations for 264 graphic designs with fine-grained variations in the spatial arrangement and sizes of design elements. We demonstrate how our importance predictions can be used for retargeting and thumbnailing, and include user studies to validate result quality. Finally, we provide a working interactive demo.

2 Related Work

Designers and researchers have long studied eye movements as a clue to understanding the perception of interfaces [10, 17]. There have also been several recent studies of eye movements and the perception of designs [3, 13]. However, measuring eye movements is an expensive and time-consuming process, and is rarely feasible for practical applications.

Figure 3: Left: Comparison of eye movements collected in a controlled lab setting [3], and clicks that we crowdsourced using the BubbleView interface [23, 24]. Right: Comparison of importance annotations from the GDI dataset, and clicks that we crowdsourced using the BubbleView interface. These examples were chosen to demonstrate some of the similarities and differences between the modalities. For instance, compared to eye fixations, clicks are sometimes more concentrated around text. Compared to the GDI annotations, clicks do not assign uniform importance to whole design elements. Despite these differences, BubbleView data leads to similar importance rankings of visual elements (Evaluation).

Few researchers have attempted to automatically predict importance in graphic designs. The DesignEye system [41] uses hand-crafted saliency methods, demonstrating that saliency methods can provide valuable feedback in the context of a design application. O’Donovan et al. [34] gather crowdsourced importance annotations, where participants are asked to mask out the most important design regions. They train a predictor from these annotations. However, their method requires knowledge of the location of design elements to run on a new design. Haass et al. [12] test three natural image saliency models on the MASSVIS data visualizations [4]

, concluding that, across most saliency metrics, these models perform significantly worse on visualizations than on natural images. Several models also exist for web page saliency. However, most methods use programmatic elements (e.g., the DOM) as input to saliency estimation rather than allowing bitmap images as input 

[5, 48]. Pang et al. predict the order in which people will look at components on a webpage [37] by making use of the DOM and manual segmentations. Other works use the web page image itself as input to predict saliency [44, 46]. Most of these methods use older saliency architectures based on hand-crafted features that are inferior to the state-of-the-art neural networks we use in our approach. Our work also relates to the general program of applying computer vision and machine learning in the service of graphic design tools [28, 29, 43].

Predicting eye movements for natural images is a classic topic in human and computer vision. The earliest natural image saliency methods relied on hand-coded features (e.g., [16]). Recently, deep learning methods, trained on large datasets, have produced a substantial jump in performance on standard saliency benchmarks [8, 9, 15, 30, 36, 50]. However, these methods have been developed exclusively for analyzing natural images, and are not trained or tested on graphic designs. Our work is the first to apply neural network importance predictors to both graphic designs and data visualizations.

3 Data Collection

To train our models we collected BubbleView data [23, 24] for data visualizations, and used the Graphic Design Importance (GDI) dataset by O’Donovan et al. [34] for graphic designs. We compared different measurements of importance: BubbleView clicks to eye movements on data visualizations, and BubbleView clicks to GDI annotations on graphic designs.

3.1 Ground truth importance for data visualizations

Large datasets are one of the prerequisites to train neural network models. Unfortunately, collecting human eye movements for even hundreds of images is extremely expensive and time-consuming. Instead, we use the BubbleView interface by Kim et al. [23, 24] to record human “attention” that is correlated with eye fixations. Unlike eye tracking, which requires expensive equipment and a controlled lab study, BubbleView can be used to to collect large datasets with online crowdsourcing.

In BubbleView, a participant is shown a blurry image and can click on different parts of the image to reveal small regions, or bubbles, of the image at full resolution. Initial experiments by Kim et al. [24] showed a high correlation between eye fixations collected in the lab and crowdsourced BubbleView click data. In this paper, we confirm this relationship.

Concurrent work in the computer vision community has applied a similar methodology to natural images. SALICON [19] is a crowdsourced dataset of mouse movements on natural images that has been shown to approximate free-viewing eye fixations. Current state-of-the-art models on saliency benchmarks have all been trained on the SALICON data [9, 15, 30, 36, 50]. BubbleView was concurrently developed [24] to approximate eye fixations on data visualizations with a description task. Some advantages of BubbleView over SALICON are discussed in [23].

Using Amazon’s Mechanical Turk (MTurk), we collected BubbleView data on a set of 1,411 data visualizations from the MASSVIS dataset [4], spanning a diverse collection of sources (news media, government publications, etc.) and encoding types (bar graphs, treemaps, node-link diagrams, etc.). We manually filtered out visualizations containing illegible and non-English text, as well as scientific and technical visualizations containing too little context. Images were scaled to have a maximum dimension of 600 pixels to a side while maintaining their aspect-ratios to fit inside the MTurk task window. We blurred the visualizations using a Gaussian filter with a radius of 40 pixels and used a bubble size with a radius of 32 pixels as in [23]. MTurk participants were additionally required to provide descriptions for the visualizations to ensure that they meaningfully explored each image. Each visualization was shown to an average of 15 participants. We aggregated the clicks of all participants on each visualization and blurred the click locations with a Gaussian filter with a radius of 32 pixels, to match the format of the eye movement data.

We used the MASSVIS eye movement data for testing our importance predictions. Fixation maps were created by aggregating eye fixation locations of an average of 16 participants viewing each visualization for 10 seconds. Fixation locations were Gaussian filtered with a blur radius of 32 pixels. Fig. 3a includes a comparison of the BubbleView click maps to eye fixation maps from the MASSVIS dataset.

3.2 Ground truth importance for graphic designs

We used the Graphic Design Importance (GDI) dataset [34] which comes with importance annotations for 1,078 graphic designs from Flickr. Thirty-five MTurk participants were asked to label important regions in a design using binary masks, and their annotations were averaged. Participants were not given any instruction as to the meaning of “importance.” To determine how BubbleView clicks relate to explicit importance annotations, we ran the BubbleView study on these graphic designs and collected data from an average of 15 participants per design. Fig. 3b shows comparisons between the GDI annotations and BubbleView click maps. In both data similar elements and regions of designs emerge as important.

Each representation has potential advantages. The GDI annotations assign a more uniform importance score to whole elements. This can serve as a soft segmentation to facilitate design applications like retargeting. BubbleView maps may be more appropriate for directly modeling human attention.

4 Models for predicting importance

Given a graphic design or data visualization, our task is to predict the importance of the content at each pixel location. We assume the input design/visualization is a bitmap image. The output importance prediction at each pixel is , where larger values indicate higher importance. We approach this problem using deep learning, which has lead to many recent breakthroughs on a variety of image processing tasks in the computer vision community [26, 39], including the closely related task of saliency modeling.

Similar to some top-performing saliency models for natural images [15, 27], our architecture is based on fully convolutional networks (FCNs) [33]

. FCNs are specified by a directed acyclic graph of linear (e.g., convolution) and nonlinear (e.g., max pool, ReLU) operations over the pixel grid, and a set of parameters for the operations. The network parameters are optimized over a loss function given a labeled training dataset. We refer the reader to Long et al. 

[33] for more details.

We predict real-valued importance using a different training loss function from the original FCN work, which predicted discrete object classes. Given ground truth importances at each pixel , , we optimize the sigmoid cross entropy loss for FCN model parameters over all pixels :


where is the output prediction of the FCN

composed with the sigmoid function

. Note that the same loss is used for binary classification, where . Here, we extend it to real-valued . We use a different loss than other saliency models based on neural networks that optimize Euclidean [27, 35], weighted Euclidean [9], or binary classification losses [30, 50]. Our loss is better suited to values, and is equivalent to optimizing the KL loss commonly used for saliency evaluation.

We trained separate networks for data visualizations and for graphic designs. For the data visualizations, we split the 1.4K MASSVIS images for which we collected BubbleView click data into 1,209 training images and 202 test images. For the test set we chose MASSVIS images for which eye movements are available [3]. For the graphic designs, we split the 1,078 GDI images into 862 training images and 216 test images (80-20% split). We used the GDI annotations [34] for training. We found that training on the GDI annotations rather than the BubbleView clicks on graphic designs facilitated the design applications better, since the GDI annotations were better aligned to element boundaries.

Figure 4: We increase the precision of our FCN-32s predictions by combining output from the final layer of the network with outputs from lower levels. The resulting predictions, FCN-16s and FCN-8s, capture finer details. We found FCN-16s sufficient for our model for graphic designs, as FCN-8s did not add a performance boost. For our model for data visualizations, we found no performance gains beyond FCN-32s.

Model details:

We converted an Oxford VGG-16 convolutional neural network 


to an FCN-32s model via network surgery using the implementation in Caffe 

[18]. The model’s predictions are 1/32 of the input image resolution, due to successive pooling layers. To increase the resolution of the predictions and capture fine details, we followed the procedure in Long et al. [33] to add skip connections from earlier layers to form FCN-16s and FCN-8s models, that are respectively, 1/16 and 1/8 of the input image resolution. We found that the FCN-16s (with a single skip connection from pool4) improved the graphic design importance maps relative to the FCN-32s model (Fig. 4), but that adding an additional skip connection from pool3 (FCN-8s) performed similarly. We found that skip connections lead to no gains for the data visualization importance. For our experiments we used the trained FCN-16s for graphic designs and the FCN-32s for data visualizations.

Since we have limited training data we initialized the network parameters with the pre-trained FCN32s model for semantic segmentation in natural images [33], and fine-tuned it for our task. The convolutional layers at the end of the network and the skip connections were randomly initialized. Training details are provided in the Supplemental Material.

We opted for a smaller architecture with fewer parameters than some other neural network saliency models for natural images. This makes our model more effective for our datasets, which are currently an order-of-magnitude smaller than the natural image saliency datasets.

5 Evaluation of model predictions

We compare the performance of our two importance models to ground truth importance on each dataset. For data visualizations, we compare predicted importance maps to bubble clicks gathered using BubbleView, and to eye fixations from the MASSVIS dataset. For graphic designs, we compare predicted importance maps to GDI annotations.

5.1 Evaluation criteria

We evaluate the similarity of our predicted and ground truth importance maps using two metrics commonly used for saliency evaluation [7]

: Kullback-Leibler divergence (KL) and cross correlation (CC). CC measures how correlated the pixel-wise values are in the two maps, and treats both false positives and false negatives equally. KL, however, measures how well one distribution predicts another. Our importance maps can be interpreted as providing, for each pixel, the probability that the pixel would be considered important by ground truth observers. KL highly penalizes missed predictions, so a sparse map that fails to predict a ground truth important location will receive a large KL value (poor score). Given the ground truth importance map

and the predicted importance map , KL is computed as:


where is the entropy of the ground truth importance map and is the cross entropy of the prediction and ground truth. Note the similarity to the loss in Equation (1

), which is over a Bernoulli random variable; here the random variable is instantiated. A large KL divergence indicates a high dissimilarity between maps, whereas

indicates two maps are identical. KL is in principle unbounded, so to provide a feasible range, we include chance baselines in our experiments. CC is computed as:


where , and respectively for . CC ranges from -1 to 1, where 1 indicates maximal correlation between two maps and . For further intuition about how KL and CC metrics score similarity, we provide scores above each image in Fig. 5, and additional examples in the Supplemental Material, showing high- and low-scoring predictions.

5.2 Prediction performance on data visualizations

We include predictions from our importance model in Fig. 5. Notice how we correctly predict important regions in the ground truth corresponding to titles, captions, and legends. We quantitatively evaluate our approach on our collected dataset of BubbleView clicks. We report CC and KL scores averaged over our dataset of 202 test images in Table 1.

We compare against the following baselines: chance, Judd saliency [22], and DeepGaze [30], a top neural network saliency model trained on the SALICON dataset [19] of mouse movements on natural images. The chance baseline, used in saliency benchmarks [7, 21], is computed by uniformly sampling a real value between 0 and 1 at each image pixel. Our approach out-performs all baselines. KL is highly sensitive to false negatives and drastically penalizes sparser models [7]111Because of the sensitivity of KL to output regularization, we advise against using it (solely) to compare models [7]., explaining the high KL values for DeepGaze in Table 1. Post-processing or directly optimizing models for specific metrics can yield more favorable performances [31].

to c c c Model & CC score & KL score
Chance & 0.00 & 0.75
Judd [22] & 0.11 & 0.49
DeepGaze [30] & 0.57 & 3.48
Our model & 0.69 & 0.33

Table 1: How well can our importance model predict the BubbleView click maps? We add comparisons to two other top-performing saliency models and a chance baseline. Scores are averaged over 202 test data visualizations. A higher CC score and lower KL score are better.

How well does our neural network model, trained on clicks, predict eye fixations? We find that the predicted importance is representative of eye fixation patterns as well (Table 2), although the difference in scores indicates that our model might be learning from patterns in the click data that are different from fixations.

to c c c Model & CC score & KL score
Chance & 0.00 & 1.08
Judd [22] & 0.19 & 0.74
DeepGaze [30] & 0.53 & 3.10
Our model & 0.54 & 0.63
Bubble clicks & 0.79 & 0.28

Table 2: How well can human eye fixations be predicted? We measured the similarity between human fixation maps and various predictors. Scores are averaged over 202 test data visualizations. A good model achieves a high CC score and low KL score. Our neural network model was trained on BubbleView click data, so that is the modality it can predict best. Nevertheless, its predictions are also representative of eye fixation data. As an upper bound on this prediction performance, we consider how well the BubbleView click data predicts eye fixations, and as a lower bound, how well chance predicts eye fixations.

Which elements are most important? For our analysis, we used the element segmentations available for the visualizations in the MASSVIS dataset [3]. We overlapped these segmentations with normalized maps of eye fixations, clicks, and predicted importance. We computed the max score of the map within each element to get an importance ranking across elements222A similar analysis was used to rank the relative importance of objects in natural images [8, 19].. Text elements, such as titles and captions, were the most looked at333Among the text and other content in a visualization, titles tend to be best remembered by human observers [3]., and clicked on, elements, and were also predicted most important by our model (Fig. 6). Even though our model was trained on BubbleView clicks, the predicted importance remains representative of eye fixation patterns. With regards to differences, our model overpredicts the importance of titles. Our model learns to localize visualization titles very well (Fig. 5).

Figure 5: Importance predictions for data visualizations, compared to ground truth BubbleView clicks and sorted by performance. Our model is good at localizing visualization titles (the element clicked on, and gazed at, most by human participants) as well as picking up the extreme points on graphs (e.g., top and bottom entries). We include a failure case where our model overestimates the importance of the visual map regions. More examples in the Supplemental Material.
Figure 6: Relative importance scores of different elements in a data visualization assigned by eye fixation maps, BubbleView click maps, and model predictions. Scores were computed by overlapping element segmentations with normalized importance maps, and taking the max of the map within each element as its score. The elements that received the most clicks also tended to be highly fixated during viewing (Spearman’s ). Text (titles, labels, paragraphs) received a lot of attention. Our neural network model correctly predicted the relative importance of these regions relative to eye movements ().

5.3 Prediction performance on graphic designs

The closest approach to ours is the work of O’Donovan et al. [34] who computed an importance model for the GDI dataset. We re-ran their baseline models on the train-test split used for our model (Table 3). To replicate their evaluation, we report root-mean-square error (RMSE) and the coefficient, where indicates a perfect predictor, and is the baseline of predicting the mean importance value (details in Supplemental Material). The full O’Donovan model (OD-Full) requires manual annotations of text, face, and person regions, and would not be practical in an automatic setting. For a fair comparison, we evaluate our automatic predicted importance model (Ours) against the automatic portion of the O’Donovan model, which does not rely on human annotations (OD-Automatic). We find that our model outperforms OD-Automatic. Our model is also 100X faster, since it requires a single feed-forward pass through the network (0.1 s/image on a GPU). O’Donovan’s method requires separate computations of multiple CPU-based saliency models and image features (10 s/image at the most efficient setting).

In Table 3, we include the performance of Ours+OD, where we added our importance predictions as an additional feature during training of the O’Donovan model, and re-estimated the optimal weights for combining all the features. Ours+OD improves upon OD-Full indicating that our importance predictions are not fully explainable by the existing features (e.g., text or natural image saliency). This full model is included for demonstration purposes only, and is not practical for interactive applications.

We also annotated elements in each of the test graphic designs using bounding boxes, and computed the maximum importance value in each bounding box as the element’s score (Fig. 7). We obtain an average Spearman rank correlation of 0.56 between the predicted and ground truth scores assigned to the graphic design elements.

Some examples of predictions are included in Fig. 8. Our predictions capture important general trends, such as larger and more central text and visual elements being more important. However, text regions are not always well segmented (predicted importance is not uniform over a text element), and text written in unusual fonts is not always detected. Such problems could be ameliorated through training on larger datasets. Harder cases are directly comparing the importance of a visual and text, which can depend on the semantics of the text itself (how informative it is) and the quality of the visual (how unexpected, aesthetic, etc.).

to c c c Model & RMSE &
Saliency & .229 & .462
OD-Automatic & .212 & .539
Ours & .203 & .576
[2pt/2pt] OD-Full & .155 & .754
Ours+OD & .150 & .769

Table 3: A comparison of our predicted importance model (Ours) with the model of O’Donovan et al. [34]. Lower and higher are better. Our model outperforms the fully automatic O’Donovan variant (OD-Automatic). Another fully automatic model from [34] is Saliency, a learned combination of 4 saliency models: Itti&Koch [16], Hou&Zhang [14], Judd et al. [22], and Goferman et al. [11]. We also report the results of the semi-automatic OD-Full model, which includes manual annotations of text, face, and person regions. When we combine our approach with OD-Full (Ours+OD), we can approve upon the OD model. More comparisons are included in the Supplemental Material.
Figure 7: An example comparison between the predicted importance of design elements and the ground truth GDI annotations. The heatmaps are overlapped with element bounding boxes and the maximum score per box is used as the element’s importance score (between 0 and 1).
Figure 8: Importance predictions for graphic designs, sorted by performance. Performance is measured as the Spearman rank correlation (R) between the importance scores assigned to design elements by ground truth (GDI annotations) and predicted importance maps. A score of 1 indicates a perfect rank correlation; a negative score indicates the element rankings are reversed. The predicted importance maps distribute importance between text and visual features. We include a failure case where the importance of the man in the design is underestimated. More examples in the Supplemental Material.

5.4 Prediction performance on fine-grained design variations

To check for feasibility of an interactive application we perform a more fine-grained test. We want the importance rankings of elements to be adjusted accurately when the user makes changes to their current design. For example, if the user makes a text box larger, then its importance should not go down in the ranking. Our predicted importance model has not been explicitly trained on systematic design variations, so we test if it can generalize to such a setting.

We used the Design Improvement Results dataset [34] containing 11 designs with an average of 35 variants. Across variants, the elements are preserved but the location and scale of the elements varies. We repeated the MTurk importance labeling task of O’Donovan et al. [34] on a subset of 264 design variants, recruiting an average of 19 participants to annotate the most important regions on each design. We averaged all participant annotations per design to obtain ground truth importance heatmaps. We segmented each design into elements and used the ground truth and predicted importance heatmaps to assign importance scores to all the elements, calculating the maximum heatmap value falling within each segment. The predicted and ground truth importance scores assigned to these elements achieved an average Spearman’s correlation . As Fig. 9 shows, even though we make some absolute errors, we successfully account for the impact of design changes such as the location and size of various elements.

Figure 9: Sample input designs, and how the relative importance of the different design elements changes as they are moved, resized, and otherwise modified. For instance, compared to in (a), the event date stands out more and gains importance when it occurs at the bottom of the poster, in large font, on a contrasting background (b). Similarly, when the most important text of the design in (c) is moved to the upper righthand corner where it is not surrounded by other text, it gains prominence (d). Our automatic model makes similar predictions of the relative importance of design elements as ground truth human annotations.

6 Applications

We now demonstrate how automatic importance prediction can enable diverse applications. An importance map can provide a common building block for different summarization and retrieval tasks, including retargeting, thumbnailing, and interactive design tools. These prototypes are meant as proofs-of-concept, showing that our importance prediction alone can give good results with almost no additional post-processing.

6.1 Retargeting

The retargeting task is to take a graphic design as input, and to produce a new version of that design with specific dimensions. Retargeting is a common task for modern designers, who must work with many different output dimensions. There is a substantial amount of work on automatic retargeting for natural images, e.g., [2, 42]. Several of these methods have shown that saliency or gaze provide good cues for retargeting, to avoid cropping out image content that people are likely to pay most attention to, such as faces in photographs.

The only previous work on retargeting graphic designs is by O’Donovan et al. [34]. They assumed knowledge of the underlying vector representation of the design and used an expensive optimization with many different energy terms. The method we propose uses bitmap data as input, and is much simpler, without requiring any manual annotations of the input image.

Importance-based retargeting for graphic designs should preserve the most important regions of a design, such as the title and key visual elements. Given a graphic design bitmap as input and specific dimensions, we use the predicted importance map to automatically select a crop of the image with highest importance (Fig. 10). Alternative variants of retargeting (e.g., seam carving) are discussed in the Supplemental Material.

Evaluation: We ran MTurk experiments where 96 participants were presented with a design and 6 retargeted variants, and were asked to score each variant using a 5-point Likert scale with 1 = very poor and 5 = very good (Fig. 11). Each participant completed this task for 12 designs: 10 randomly selected from a collection of 216 designs, and another 2 designs used for quality control. We used this task to compare crops retargeted using predicted importance to crops retargeted using ground truth GDI annotations, Judd saliency [22], DeepGaze saliency [30], and an edge energy map. We extracted a crop with an aspect ratio of 1:4 from a design using the highest-valued region, as assigned by each of the saliency/importance maps. As a baseline, we selected a random crop location.

After an analysis of variance showed a significant effect of retargeting method on score, we performed Bonferonni paired t-tests on the scores of different methods.

Across all 216 designs, crops obtained using ground truth GDI annotations had the highest score (Mean: 3.19), followed by DeepGaze (Mean: 2.95) and predicted importance (Mean: 2.92). However, the difference between the latter pair of models was not statistically significant. Edge energy maps (Mean: 2.66) were worse, but not significantly; while Judd saliency (Mean: 2.47) and the random crop baseline (Mean: 2.23) were significantly worse in pairwise comparisons with all the other methods ( for all pairs). Results of additional experimental variants are reported in the Supplemental Material.

Our predicted importance outperforms Judd saliency, a natural saliency model commonly used for comparison [32, 34]. Judd saliency has no notion of text. Predicted importance, trained on less than 1K graphic design images, performs on par with DeepGaze, the currently top-performing neural network-based saliency model [6] which has been trained on 10K natural images, including images with text. Both significantly outperform the edge energy map, which is a common baseline for retargeting. These results show the potential use case of predicted importance for a retargeting task, even without any post-processing steps.

6.2 Thumbnailing

Thumbnailing is similar to retargeting, but with a different goal. It aims to provide a visual summary for an image to make it easier to find relevant images in a large collection [20, 47]. Unlike previous methods, our approach operates directly on a bitmap input, rather than requiring a specialized representation as input. For this example our domain is data visualizations rather than graphic designs.

Given a data visualization and an automatically-computed importance map as input, we generate a thumbnail by carving out the less important regions of the image. The importance map is used as an energy function, whereby we iteratively remove image regions with least energy first. Rows and columns of pixels are removed until the desired proportions are achieved, in this case a square thumbnail. This is similar to seam carving [2, 42], but using straight seams, found to work better in our setting. The boundaries of the remaining elements are blurred using the importance map as an alpha-mask with a fade to white. Qualitatively, the resulting thumbnails consist of titles and other main supporting text, as well as data extremes (from the top and bottom of a table, for instance, or from the left and right sides of a plot).

Figure 10: (a) Input designs, (b) our predicted importance maps, and (c) automatic retargeting results using the predicted importance maps to crop out design regions with highest overall importance. This is compared to: (d) edge-based retargeting, where gradient magnitudes are used as the energy map, and (e) Judd saliency, a commonly-used natural image saliency model. Additional comparisons are provided in the Supplemental Material.
Figure 11: MTurk interface for evaluating retargeting results of predicted importance compared to other baselines. More experimental details are provided in the Supplemental Material.

(a) (b)

Figure 12: Given a set of data visualizations (a), we use our importance maps to automatically generate thumbnails (b). The thumbnails facilitate visual search through a database of visualizations by summarizing the most important content. More examples can be found in the Supplemental Material.

Evaluation: We designed a task intended to imitate a search through a database of visualizations. MTurk participants were given a description and a grid of 60 thumbnails, and were instructed to find the visualization that best matches the description. We ran two versions of the study: with the original visualizations resized to thumbnails (Fig. 12a), and another with our automatically-computed importance-based thumbnails (Fig. 12b). We measured how many clicks it took for participants to find the visualization corresponding to the description in each version.

A total of 400 participants were recruited for our study. After filtering, we compared the performance of 200 participants who performed the study with resized visualizations and 169 participants who saw the importance-based thumbnails.

Each MTurk assignment, containing a single search task assigned to a single participant, was treated as a repeated observation. We ran an unpaired two-sample t-test to compare the task performance of both groups. On average, participants found the visualization corresponding to the description in fewer clicks using the importance-based thumbnails (1.96 clicks) versus using the resized visualizations (3.25 clicks, t(367) = 5.10, p < .001). Our importance-based thumbnails facilitated speedier retrieval, indicating that the thumbnails captured visualization content relevant for retrieval.

6.3 Interactive applications

An attractive aspect of neural network models is their fast run-time performance (Table 4). As a prototype, we integrated our importance prediction with a simple design layout tool that allows users to move and resize elements, as well as change color, text font, and opacity (Fig. 2). With each change in the design, an importance map is recomputed automatically to provide immediate feedback to the user. The accompanying video and demo (visimportance.csail.mit.edu) demonstrate the interactive capabilities of our predictions. Our experiments in the Evaluation section provide initial evidence that our model can generalize to the kind of fine-grained design manipulations, like the resizing and relocation of design elements, that would be common in an interactive setting. Determining how best to use importance prediction to provide feedback to users is an interesting problem for future work. For example, importance prediction could help in formulating automatic suggestions for novice users to improve their designs.

Timing: On a Titan-X GPU, our model computes the importance map for a design in the GDI dataset (600450 pixels) in 100 ms. Table 4 provides some timing information for our model on differently-sized images.

to c c c c c c Image size (pixels/side) & 300 & 600 & 900 & 1200 & 1500
Avg. compute time (ms) & 46 & 118 & 219 & 367 & 562

Table 4: Time (in milliseconds) taken by our model to compute an importance map for differently-sized images, averaged over 100 trials.

7 Limitations

Our neural network model is only as good as the training data we provide it. In the case of data visualizations, there is a strong bias, both by the model and the ground truth human data, to focus on the text regions. This behavior might not generalize to other types of visualizations and tasks. Click data, gathered via the BubbleView interface, is not uniform over elements, unlike explicit bounding box annotations (i.e., as in the GDI dataset [34]). While this might be a better approximation to natural viewing, non-uniform importance across design elements might cause side-effects for downstream applications like thumbnailing, by cutting off parts of elements or text.

8 Conclusions

We curated hundreds of examples of graphic designs and data visualizations, annotated with importance, to train fully convolutional neural network models to predict importance maps for novel designs. We showed that our computational predictions approximate ground truth human data enough to be used for a number of automatic applications. Our importance maps act as a common underlying representation for retargeting of graphic designs, thumbnailing of data visualizations, and in a prototype interactive design application.

This paper presents the first neural network model for predicting saliency or importance in graphic designs and data visualizations, capable of generalizing to a wide range of design formats. Moreover, the fast test-time performance of our model makes it feasible for the predictions to be used in interactive design tools. Our approach is not limited to graphic designs and data visualizations. The methodology and models can easily be adapted to other visual domains, such as websites [23]. As better webcam-based eyetracking methods become available (e.g., [25, 38, 49]) possibilities also open up for directly training our model from eye movement data. Future work can also explore the use of importance predictions to offer more targeted design feedback and to provide automated suggestions to a user.

9 Acknowledgements

We would like to thank Joel Brandt and the anonymous reviewers for useful feedback. Thank you to Matthias Kümmerer and Adrià Recasens for help computing saliency models. ZB would like the acknowledge the support of the Natural Sciences and Engineering Research Council of Canada Postgraduate Doctoral Scholarship (NSERC PGS-D).