An Automated Approach to Reasoning About Task-Oriented Insights in Responsive Visualization

Authors often transform a large screen visualization for smaller displays through rescaling, aggregation and other techniques when creating visualizations for both desktop and mobile devices (i.e., responsive visualization). However, transformations can alter relationships or patterns implied by the large screen view, requiring authors to reason carefully about what information to preserve while adjusting their design for the smaller display. We propose an automated approach to approximating the loss of support for task-oriented visualization insights (identification, comparison, and trend) in responsive transformation of a source visualization. We operationalize identification, comparison, and trend loss as objective functions calculated by comparing properties of the rendered source visualization to each realized target (small screen) visualization. To evaluate the utility of our approach, we train machine learning models on human ranked small screen alternative visualizations across a set of source visualizations. We find that our approach achieves an accuracy of 84 ranking visualizations. We demonstrate this approach in a prototype responsive visualization recommender that enumerates responsive transformations using Answer Set Programming and evaluates the preservation of task-oriented insights using our loss measures. We discuss implications of our approach for the development of automated and semi-automated responsive visualization recommendation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

page 6

page 7

page 9

page 10

page 11

04/15/2021

Design Patterns and Trade-Offs in Responsive Visualization for Communication

Increased access to mobile devices motivates the need to design communic...
07/09/2019

A Comparative Evaluation of Animation and Small Multiples for Trend Visualization on Mobile Phones

We compare the efficacy of animated and small multiples variants of scat...
05/27/2015

Prototyping Information Visualization in 3D City Models: a Model-based Approach

When creating 3D city models, selecting relevant visualization technique...
08/14/2018

VizML: A Machine Learning Approach to Visualization Recommendation

Data visualization should be accessible for all analysts with data, not ...
02/14/2020

VisMaker: a Question-Oriented Visualization Recommender System for Data Exploration

The increasingly rapid growth of data production and the consequent need...
05/29/2017

On the "Calligraphy" of Books

Authorship attribution is a natural language processing task that has be...
06/11/2021

Visualization Techniques to Enhance Automated Event Extraction

Robust visualization of complex data is critical for the effective use o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Related Work

1.1 Responsive Visualization and Design Transformations

We use visualization design transformation to refer to the transformation of a source visualization specification to a new visualization specification intended to better achieve certain context-specific constraints. These might be screen size limitations for responsive visualization, audience-related constraints in visualization simplification or audience retargeting (e.g., [Bottinger2020, Johnson2014]), or style constraints in style transfer [Harper2014, harper2017], for example. Visualization design transformation differs from creating multiple different views from the same dataset (e.g., a visualization sequence or dashboard) in that transformations of an original source view typically are intended to preserve many properties of the source while changing select properties.

Prior work on responsive visualization, which tends to focus on Web-based communicative visualization, or scalable visualization [cook2005illuminating] more broadly, emphasizes the importance of maintaining intended takeaways between source and transformed views. Analyzing 378 responsive visualization pairs on desktop and mobile devices, Kim et al [kim2021] identify density-message trade-offs in responsive visualization where authors need to balance adjusting visual density or complexity for different screen types while maintaining patterns, trends or other important information conveyed in the source view. Focusing on maintaining key information at different scales, earlier work on visualization resizing introduces algorithms that repeatedly remove the pixels determined to be least important [Giacomo2015] and iteratively minimize scaling in more salient regions [Wu2013], for example. We extend prior approaches by proposing approximation methods for task-oriented visualization insights.

Defaulting to simpler views over complex, over-encoded plots is often recommended when exploring or publicizing complex data [kelleher2011]. Authors accomplish this through data-level transformations, such as data abstraction, clutter reduction, filtering, or clustering. For example, data abstraction studies have attempted to enhance the simplicity of a view while preserving original structure or insights (e.g., aggregating a large movement dataset [Adrienko2011], using interactive dimensionality reduction [Johansson2009], using hierarchical aggregation [Elmqvist2010], and measuring the quality of an abstraction [cui2006]).

1.2 Visualization Recommendation

We discuss two approaches in visualization recommendation—insight-based and similarity-based—that are relevant to our goal of approximating changes in task-oriented insights. Prior work on visualization recommendation employs statistical calculations to characterize properties of a visualization thought to relate to the insights a user can draw from it. Often these ‘insights’ are intended to capture how well a user can perform analytic tasks, such as recognizing trends or identifying and comparing data points. Tang et al [tang2017topk] suggest detecting ‘top-k insights’ from data using statistical significance testing (e.g., low

-value of a linear regression coefficient for slope insight). Similarly, Foresight 

[demiralp2017], DataSite [cui2019], and Voder [Srinivasan2019] use statistics calculated on the data, such as correlation coefficient and interquartile range, and recommend visualization types predicted to better support extracting such information. However, statistics on data are invariant for views sharing the same data set and hence of limited use for comparing different ways of visualizing the same underlying data. Our work instead considers statistics calculated on the rendered visualization.

Several prior visualization recommenders model similarity between views, but assume a scenario where the underlying dataset changes. GraphScape [kim2017graphscape] offers a view similarity model that assigns costs to visualization pairs that are intended to approximate the cognitive cost of transitioning from one view to another in a visualization sequence. GraphScape applies an a priori cost model in which data transformation (e.g., binning, modifying scales) is always less costly than changes in encoding. Hence, filtering data has a lower cost than transposing axes. However, filtering operations like removing a bar from a bar chart or rescaling a y-axis can significantly change the presumed “take-aways” of a chart (e.g., [gelman2020, hofman2020visualizing]). The space of transformations covered by GraphScape also does not include view size transformations, so it cannot assign costs to changes in aspect ratio common to responsive visualization. Although Dziban [dziban] extends GraphScape to suggest a view that is ‘anchored’ to the previous view for an exploratory data analysis process, it also assumes different subsets of data between the previous and current views and focuses more on similar chart encodings than on preserving task-oriented insights.

1.3 Comparing Visual Structure by Processing Signal

Signal processing-based approaches analyze the underlying visual or perceptual structure of a visualization to enable multi-scale visualizations (i.e., providing different insights at different scales) and to enhance visualization effectiveness. Prior work has attempted to enable multi-scale views through perceptual organization analysis of a information graphic at each scale [Wattenberg2003, Wattenberg2004] and hybrid-image visualization that displays different aggregation levels at different viewing distances [Isenberg2013], for example. Signal processing approaches have also been applied to improve the effectiveness of a visualization, for instance, by measuring the difference between the visual salience of a representation and salience of signals in data [janicke2010, Kindlmann2014]

, comparing kernel density estimations between a LOESS curve and different representations 

[Wang2018]

, and extending a structural similarity index for image compression to data visualization 

[Veras2020]. Signal processing-based approaches have typically been applied to single views, and are generally confined to a predefined set of marks and visual variables (e.g., a line chart, a scatterplot), restricting their applicability for settings like ours.

2 Problem Formulation

We propose formulating responsive visualization as a search problem from an input source view to transformed target views, following the characterization proposed by Kim et al [kim2021]. Consider a recommender that takes a source desktop view as input and returns a ranked set of targets as illustrated in Figure 2. The first step in creating such a recommender is to define a search space that can enumerate well-formed responsive targets. To generate useful target views from a source (large screen) visualization, a search space should cover common transformation strategies in responsive visualization, such as rescaling, aggregating, binning, and transposing [kim2021, hoffswell2020].

After enumerating target views, a responsive visualization recommender should evaluate how well each target preserves certain information or “insights.” While the term insight can be overloaded [zgraggen2018investigating], a relatively robust way to define insights comes from typologies for describing visualization judgments or patterns [amar2005, Brehmer2013]. These typologies suggest defining insights around common low-level visual analysis tasks like identifying and comparing data. In an automated design recommendation scenario, these task-oriented insights can be approximated by objective functions (i.e., loss measures) that capture support for common tasks, applied to both the source and target view. Finally, the recommender returns the set of target designs based on how well they minimize these loss measures. We formalize this problem and motivate and define three loss measures that we call task-oriented insight preservation measures. In section 4, we describe a prototype visualization recommender in which we implemented the approach.

Figure 2: A pipeline for a responsive visualization recommender

2.1 Notation

We define a visualization (or a view), , as a three tuple

(1)

where is the data used in , is a visualization specification (defining encodings, chart size, mark type, etc), and is a set of rendered values that we compute our measures on. For example, suppose a bivariate data set with GDP and GNI fields (i.e., , where ). maps GDP and GNI to x and y positions of point marks, respectively, producing a scatterplot. The corresponding set of rendered values is a set of Cartesian coordinates on the XY-plane (i.e., , where is the tuple of rendered values for ). Similarly, for a data set containing a field (emission) that is mapped to , would correspond to the rendered value of . For brevity, we define

as a vector of

field values and as a vector of rendered values in channel. Our notation is also illustrated in Figure 3

Figure 3: Our notation for a visualization. Rendered values are defined in the space implied by the visual variable (e.g., pixel space for position or size, color space for color).

Given a source view and a transformation (or target) , we represent the loss of insight type from to as below:

(2)

For example, indicates trend loss from to .

3 Task-oriented Insight Preservation Measures

High level criteria for preserving task-oriented insights of a visualization include preserving datum-level information, maintaining comparability of data points, and preserving the aggregate features [Brehmer2013]. We use these distinct classes of information to define task-oriented insight loss measures for approximating how well a responsive transformation preserves support for low-level tasks of identifying data, comparing data, and identifying trend. Our goal is to define a small set of measures that capture important types of low-level tasks a designer might wish to preserve in responsive transformation. Each measure should be distinct (i.e., mostly independent of the others) and should improve accuracy when combined with the others (such as through regression or ML modeling) to predict human judgments about how visualization transformations rank. Together, the measures should outperform reasonable baseline approaches based on simple heuristics. While chosen to cover three important classes of low-level analytic task, the measures we describe are not meant to be exhaustive, as there are many ways one could approximate support for task-oriented insights.

3.1 Identification Loss

Responsive visualization strategies often alter the number of visual attributes of marks that viewers can identify (affecting a low-level identification task [amar2005, Brehmer2013]). As illustrated in Figure 4, when the number of bin buckets of a histogram is decreased in a mobile view (a), each bar encodes more information on average than in the desktop view, such that some information about the distribution is lost. Similarly, strategies to adjust graphical density, like aggregating distributions (b) and filtering certain data (c), also reduce the number of identifiable attributes. We use identification loss to refer to changes to the identifiability of rendered values between a source view and a target.

Figure 4: Responsive transformations that may cause identification loss.

Information theory, and in particular Shannon Entropy (entropy, hereafter) captures the information in a signal by measuring the minimum number of bits needed to encode it [shannon1948]

. Given a random variable

, entropy is defined as . Applying this to visualization, suppose that for a source visualization , a vector for data field , , is mapped to an encoding channel . The corresponding rendered values compose a random variable that takes the set of unique values of

as its outcome space, where the probability of

taking is defined as the relative frequency of in , formalized as

(3)
(4)

We can similarly compute the probabilities of rendered values, , and the entropy of an encoding channel, , for a target view . Finally, we can calculate the identification loss for the channel as the absolute difference in entropy (i.e., ), where 0 difference is the identity. The final identification loss from to is the sum of absolute differences in entropy for each encoding channel between the two views:

(5)

3.2 Comparison Loss

Responsive transformations like resizing or scaling a view or aggregating data can alter the number of possible data comparisons that a user can make and how perceptually difficult they are (affecting a low-level comparison task [amar2005, Brehmer2013]). For instance, in Figure 5

, resizing (a) diminishes the magnitude of difference between two highlighted data points in the small screen design. In a mobile design with aggregation (b), viewers are no longer able to make each comparison that is available in the large screen view. This motivates estimating how similarly viewers are able to discriminate between pairs of points in a target view compared to the source view, which we refer to as

comparison loss.

Figure 5: Responsive transformations that may cause comparison loss.

Empirical visualization studies (e.g., [szafir2018, kim2018]) often operationalizes accuracy as the viewer’s ability to perceive relationships between pairs of values. While simpler scalar statistics like a sum or mean might suffice under some transformations, a method that preserves the distribution of distances will be more robust to transformations that change the number of data points or scales (e.g., log-scale). We operationalize comparison loss as the difference in pairwise discriminability, measured using Earth Mover’s Distance (EMD), between the source and a target in each encoding channel used in a visualization:

(6)

where and are the discriminability distributions of the source and target views in encoding channel and , respectively, that encode the same data field.

Given a source visualization , we define the discriminability distribution , of an encoding channel for a view , as the set of distances between each pair of rendered values () of in terms of . This is formalized as

(7)

where is a distance metric for the encoding channel .

Distance metrics: Ideally, comparison loss should account for differences in how well visual channels support perception of numerical values. Informed by visual perception models, we select several distance metrics intended to provide a rough proxy of the perceptual difference between two visual signals. While visual variables can have interaction effects [brychtova2017, smart2019, szafir2018], for simplicity in demonstrating our approach, we limit our use of perceptual distance metrics to encoding channel specific measures. However, as the state-of-the-art in predicting effects of visual variable interactions develops, our approach could be amended to consider combinations.

For position channels, we use the absolute difference between two position values (in pixel space), as human vision is highly accurate in discriminating positions according to Stevens’ power law [stevens1957psychophysical, stevens2017psychophysics] and empirical studies [heer2009, heer2010]:

(8)

We measure distance in a size channel using the absolute difference between two size values (in pixel) raised to the estimated Stevens’ exponent of 0.7 [stevens1957psychophysical, stevens2017psychophysics]:

(9)

We calculate the Euclidean distance in the perceptual color space CIELAB [fairchild2004color] (CIELAB 2002):

(10)

where , , and represent , , and in CIELAB space.

Lastly, for shape encodings, we employ a perceptual kernel [demiralp2014], a (symmetric) matrix of pairwise distances between visual attributes. The -th element in the perceptual kernel for shape is the empirical probability of discriminating shape from shape based on an online crowdsourced experiment in which workers completed a triplet discrimination task where they chose the most dissimilar shape out of three shapes. Formally, our shape distance metric can be stated as:

(11)

Comparing discriminability distributions: To quantify the discrepancy between the discriminability distributions of encoding channel and (mapping the same field) for the source and target (i.e.,  and , respectively), we compute Earth Mover’s Distance [Villani2009] (EMD or Wasserstein distance). We use EMD, which measures the minimum cost to transform a distribution to another distribution, because it is non-parametric, symmetric, and unbounded. An EMD of 0 is the identity, and the greater the EMD is, the more different the two distributions are. Thus, the comparison loss between the source view and a target view is the sum of the EMD between their discriminability distributions in each encoding channel, formalized in Equation 6.

Figure 6: Responsive transformations motivating trend loss.

3.3 Trend Loss

Responsive transformations like disproportionate rescaling and changes to binning may impact the implied relationship (or trend) between two or more variables represented in a target view compared to the source view (affecting low-level trend identification [Brehmer2013]). As shown in Figure 6, different aspect ratios can alter the magnitude of the slope of a trend, and modifying bin size affect the amount of distributional information available. We use trend loss to refer to changes in the implied trend from the source to a target.

Figure 7:

Components of computing trend loss. (a) Calculating area between curves by standardizing chart size and interpolating break points. (b) Dividing and matching subgroups. (c) Linearizing color scale. LS is large screen, and SS is small screen.

Figure 8: Prototype pipeline. (1) The full specification of an input source view in ASP. (2) Enumerating targets by extracting a partial specification of the source view and generating a search space using an ASP solver. (3) Evaluating targets by computing our loss measures and ranking them using a model trained on human-produced rankings. (4) Ranked targets.

To capture representative data patterns while avoiding influences of noise, our trend loss first estimates trend models between the source and target views using LOESS. We then compare the area (or volume) of the estimated trends because it is more sensitive to details that simpler methods (e.g., the difference between regression coefficients) might ignore. We define trend models for the quantitative encoding channels in our scope (position, color and size):

  • : a 2D trend of y on x as appears in a simple scatterplot, line chart, or bar graph.

  • : a 3D trend of color on x and y like a heatmap or a scatterplot with a continuous color channel

  • : a 3D trend of size on x and y (e.g., a scatterplot with a continuous size encoding)

After calculating trend models for a source and target, we can define trend loss as the sum of the relative area between curves (or volume between surfaces) of the estimated trends in each trend model (). This is formalized as:

(12)

where stands for the relative area between curves (ABC) between the source and target trends ( and ), normalized by dividing by the area under the curve of the source trend for a 2D model. For a 3D model, is the relative volume between surfaces (VBS), which is the VBS of the source and target trends divided by the volume under the surface of the source trend.

We estimate the trend models using LOESS regression [cleveland1979] as it is non-parametric. We use uniform weights and bandwidth of 0.5 [cleveland1979]. LOESS regression returns an estimate at each observed value of the independent variable(s) (as an array of coordinates): an estimated curve for a 2D model and an estimated surface for a 3D model. Thus, when source and target views have different chart sizes or different sets of rendered values for the independent variable(s), it is difficult to directly compare the LOESS estimations. As shown in Figure 7a, we first standardize the chart sizes of two views by rescaling an estimated LOESS curve or surface in a target view to have the same chart width with the source. Then, we interpolate the LOESS curve to have equal distances between two consecutive coordinates for a 2D model (Figure 7b). We interpolate on 300 breakpoints in a 2D model by default, where one breakpoint corresponds to one to three pixels in many Web-based visualizations. For a 3D model, we interpolate 300 300 breakpoints from a LOESS surface in a similar way. Given these interpolations for the LOESS curves (or surfaces) in the source and target, we obtain the ABC (or VBS) segment at each breakpoint.

Subgroups: When a nominal variable encoded by color or shape divides the data set into subgroups, viewers might naturally consider each subgroup’s trend independently. To distinguish trends implied by subgroups, we first identify and match subgroups which occur in both the source and target views by looking at their nominal data values, as depicted in Figure 7b. Then, we compute the relative ABC (or VBS) of each subgroup and combine them by taking their average.

Color scale linearization: Although a continuous color scale encodes a unidimensional vector, color is often modeled on a multi-dimensional space (e.g., RGB, CIELAB), which makes it complex to estimate a LOESS surface. Similar to how common color schemes such as viridis or magma are designed to be perceptually uniform by keeping equi-distance in a perceptual color space between two consecutive color points [mpl], we can make use of the Euclidean distance between rendered color values in CIELAB to linearize a 3D color scheme. Specifically, we recursively accumulate the distances between each consecutive pair of rendered values to create a unidimensional vector. In Figure 7c, we show how the linear value of -th color point is computed from that of -th point; we take the calculated value of the -th point and add to it the distance between the -th and -th points. The first color point is assigned as zero.

Figure 9: (a, b) Example target transformations enumerated by our prototype responsive visualization recommender (total size of search space per source given as #Targets). indicates rankings of each five targets per source view predicted by our best model (see subsection 5.3). (c) Source visualizations for our user study (also includes a and b). Sources views have width of 600px and height of 300px. The width of targets is fixed as 300px. Data sets are from Our World in Data [owid-economy, owid-health, owid-life]. Continuous, Nominal, Temporal, Identification loss, Comparison loss, and Trend loss.

4 Prototype Responsive Visualization Recommender

To implement our task-oriented insight preservation measures, we developed a prototype responsive visualization recommender that enumerates and evaluates responsive designs (or targets). As shown in Figure 8, given an input source (large screen) view, our recommender first converts it to a partial specification, and then generates a search space of small screen targets based on the partial specification. We adopt the desktop-first approach that visualization authors have described using [hoffswell2020, kim2021]. Finally, the recommender computes our measures between the source view and each target to rank those targets using an ML model trained on human-labeled rankings.

4.1 Enumerating Target Views

To enumerate target views, we need a formal grammar for representing visualization specifications and formulating a search space. We use Answer Set Programming (ASP) [Brewka2011], particularly by modifying Draco [draco]. ASP is a declarative programming language for complex search problems (e.g., satisfiability problems) that encodes knowledge as facts, rules, and constraints. Rules generate further facts, and constraints prevent certain combinations of facts. Formalized in ASP, for example, Draco has a rule that if an encoding is binned, then it is discrete, and a constraint that disallows logarithmic scale on a discrete encoding [draco]. A constraint solver then solves an ASP program (the partial specification of a source view and our search space), returning stable sets of non-conflicting facts (enumerated target views with different transformation strategies). We use Clingo [gebser2014, gebser2011] as our solver.

Converting to a partial specification: Our recommender converts the full specification of an input source view to a partial specification to allow applying responsive transformation strategies. We maintain the data specification (data file, data field definitions, and the number of rows) and encoding information (e.g., count aggregation, association of data field) that are not changed under transformation. We indicate the rest of the specification (mark type, chart size, and encoding channels) as information about the source view to constrain responsive transformation strategies (e.g., constraining possible mark type replacement, allowing for swapping position encodings for axis-transpose).

Generating a search space: Our goal in generating a search space is to produce a set of reasonable targets that a responsive visualization author might consider given a source view. We generate a search space by automatically applying responsive visualization transformations recently observed in an empirical study of common responsive visualization design strategies [kim2021] to a source visualization. Our prototype implements rescaling, aggregation, binning, transposing, and select changes to marks and encodings. For rescaling, we fix the width of target views and vary heights, in the range from the height resulting from proportionate rescaling to the height that forms the inverse aspect ratio with an increment of 50 px. For example, if the source view has a width of 600 px and a height of 300 px (an aspect ratio of 2:1) and the width of target views is fixed at 300 px, then the height varies from 150 px (2:1) to 600 (1:2) by 50 px (i.e., 150, 200, …, 550, 600 px). Given a disaggregated source view, we generate alternatives by applying binning (max bin buckets of 25, 15, and 5) and aggregation (count, mean, median, sum) as graphical density adjustment strategies. We also generate alternatives by transposing axes (i.e., swapping x and y position channels). Finally, in line with the observation of prior work that responsive visualization authors occasionally substituted mark types when adding an encoding channel for aggregation, we allow a mark type change in scatterplots from a point mark to a rectangle (heatmap). We formulate these strategies in ASP format and add them to Draco [draco].

4.2 Evaluating and Ranking Targets

To evaluate enumerated targets, we calculate our loss measures on rendered values after rendering source and target views using Vega-Lite [Satyanarayan2017vegalite]. Then, we obtain rendered values, , of a visualization by gleaning Vega [vega] states (a set of raw rendered values [vegaState]). We implemented the loss measures in Python using SciPy [2020SciPy-NMeth]’s stats.entropy and stats.wasserstein_distance methods for entropy and EMD, respectively. To compute LOESS regression, we use the LOESS package [pyLoess]. Finally, to rank the enumerated targets, we combine the computed loss values by training ML models, which we detail in section 5. We use ML models for ranking instead of formalizing them in ASP because our measures are not declarative (not rule-based).

4.3 Examples

We introduce two example cases of transformations generated by our prototype and describe how our measures distinguish target views.

4.3.1 Case 1: Simple scatterplot

In the source scatterplot (Figure 9a), each point mark represents a country, and x and y positions encode Gini coefficients and annual growth rate of GDP per capita of different countries, respectively. The first example transformation (Ta1) is simple resizing. The second target view (Ta2) is transposed from the source view while keeping the size. The third and fourth target views (Ta3 and Ta4) are resized, binned in x and y scales, and aggregated by count, so the size of each dot represents the number of data points in the corresponding bin bucket. In the fifth target (Ta5), the mark type is changed from point to rectangle in addition to resizing, binning, and aggregating, and the color of each rectangle encodes the number of data points in that cell.

Because Ta1 and Ta2 perfectly preserve the number of identifiable rendered values, identification loss is zero. Ta4 and Ta5 have more identifiable points than Ta3 (due to their smaller bin size), so they have smaller identification loss. While Ta1 has disaggregated values, Ta4 better preserves the distances between points in terms of position encoding, so it has smaller comparison loss. Compared to the source view, the implied trend given x and y positions in Ta1 has a more similar slope and hence smaller trend loss than Ta2, whereas Ta2 preserves the differences in the position encodings, resulting in zero comparison loss. Similarly, Ta3 has a smaller trend loss than Ta4 because Ta3 better preserves the visual shape of the distribution in the source view.

4.3.2 Case 2: Histogram

The source histogram in Figure 9b shows the distribution of GDP per capita of different countries. There are 23 bins along the x axis and each bar height (y position) represents the number of countries in the corresponding bin. The first target view (Tb1) is resized. The second and third target views (Tb2 and Tb3) are transposed with different resizing. In the fourth and fifth target views (Tb4 and Tb5) bin sizes are changed from (23 to 10 and 5, respectively), with Tb5 transposed.

As Tb1, Tb2, and Tb3 have no changes in binning, they have zero identification loss, whereas Tb4 and Tb5 has greater identification loss proportional to their bin sizes. While Tb1, Tb2, and Tb3 have the same binning, Tb3 has the most similar differences between bar heights and bar intervals in pixel space, so it has the smallest comparison loss among them. Transposing axes (Tb3) better preserves the resolution for comparison (i.e., chart height and width), often resulting in the smaller comparison loss than other similarly transformed targets. Tb5 has smaller trend loss than Tb4 as it shows a similar aspect ratio to the source view, though inverted, as implied by x and y positions.

5 Model Training and Evaluation

A responsive visualization recommender should combine loss measures to rank a set of targets by how well they preserve task-oriented insights. For our prototype recommender, we train machine learning models to efficiently combine our loss measures and rank enumerated targets. We describe training data collection, model specification, and results.

5.1 Labeling

We obtained training and test data consisting of ranked target views for a set of source views using a Web-based task completed by nine visualization experts. As shown in Figure 10a, each labeler was assigned one out of three trial sets and performed 36 trials, with each trial asking them to rank five target transformations (small screen) given a source visualization (large screen).

Task materials: To create instances for labeling, we selected six desktop visualizations (source views) as shown in Figure 9c. Our goal was to include different chart types, multiple encoding channels for identification and comparison losses, and different types of examples for trend loss (e.g., 2D/3D models, subgroups, color scale linearization). Our prototype generates 60 to 620 target transformations (2,120 in total) for these six source views. We generated three sets of 30 target views per source view for labeling, using quintile sampling per preservation (loss) measure, to ensure relatively diverse sets of targets. After sorting targets in terms of each of our three measures, we sampled two targets from each quintile of the top 100 targets per measure, as depicted in Figure 10c. We took the top 100 targets after inspecting the best ranked views per measure for each source view, to avoid labeling examples that might be obviously inferior. Because identification loss is measured using entropy and is primarily affected by how data are binned, certain source views had fewer than five unique discrete values within top 100 targets. In this case, we proportionately sampled each discrete value.

After sampling 30 targets for a source view in a trial set, we randomly divided them into six trials (but fixed these trials between labeler in the same trial set), so we had 1,080 pairs (1,077 unique pairs111Each source view had 180 pairs, but the histogram source view with 60 transformations has 177 unique pairs.) labeled by three people each. We randomly assigned each trial set to labeler (3 trial sets (between) 6 source views 30 targets (within), Figure 10a).

Labelers: All five authors, who have considerable background in visualization design and evaluation, and an additional convenience sample of four visualization experts (representing postdoctoral researchers and graduate students in visualization) participated in labeling. All labelers worked independently.

Labeling task: Each labeler was asked to imagine that they were a visualization designer for a responsive visualization project, tasked with ranking a set of small screen design alternatives created by transforming the source. Their goal was to consider what would be an appropriate small screen design that would also preserve insights or takeaways conveyed in the desktop version as much as possible.

The study interface is shown in Figure 10b. Each labeler completed 36 trials (6 desktop visualizations 6 sets of 5 smartphone design candidates). In each trial, the desktop visualization and five smartphone design candidates were shown, and labeler ranked the candidates by dragging and dropping them into an order. Trial order was randomized.

Figure 10: (a) Study design. (b) Task interface. (c) Quintile sampling of targets for task materials.

Aggregating labels: From the task, we collected human-judged rankings of 1,080 pairs each of which was ranked by three labelers. To produce our training data set, we aggregated the three labels obtained from the three labelers of each pair into a single label representing the majority opinion, such that that for the -th pair , the label is if is more likely to appear higher than , and otherwise.

(13)

To avoid a biased distribution of training data as well as minimize the ordering effect within each pair, we randomized the order of pairs so that half of the pairs are labeled as and the other half as , which naturally sets the baseline training accuracy of 50%.

5.2 Model Description

Prior approaches to visualization ranking problems (e.g., Draco-Learn [draco], DeepEye [luo2018]

) utilize ML methods that convert the ranking problem to a pairwise ordering problem, such as RankSVM (Support Vector Machine

[Herbrich1999] and the learning-to-rank model [burges2005]; we adopt a similar approach. A model, , takes as input a pair of objects, , , and returns their orders (i.e., either or ranks higher).

(14)

where is a mapping function that combines the features from a pair of objects. We consider vector difference and concatenation for . Our models take two target views representing transformations of the same source view and return the one with higher predicted ranking, as depicted in Figure 8.3.

Features: We define the feature matrix where each row corresponds to a pair of target visualizations and columns represent the features (converted by ). We use our proposed loss measures as the features (Table 1). Aggregated features () refer to our three loss measures: identification, comparison, and trend loss, as described in Equations 5, 6, and 12 (section 3). Disaggregated features () refer to the components of the aggregated features (e.g., the EMD value in each encoding channels for comparison loss). We standardized all features.

Table 1: The set of features for our ML models by each chart type. These features are either concatenated or differentiated for each pair of targets. ggregated features are the sum of the corresponding isaggregated features. Pink, bold-bordered circles represent required features, and yellow, light-bordered circles optional encoding-specific features.

Model training:

We train SVM with a linear kernel, K-nearest neighborhood (KNN) with

, logistic regression, decision tree (DT), and a Multilayer Perceptron (MLP) with four layers and 128 perceptrons per layer, similar to other recent applications of ML in data visualization (e.g., Hu et al 

[hu2019], Luo et al [luo2018]

). We also train ensemble models of DTs: random forest (RF) with 50, and 100 estimators, Adaptive Boosting (AB), and gradient boosting (GB). Given the moderate number of observations (1,067) in our data set, we use leave-one-out (LOO) as a cross validation iterator to obtain robust training results. We used Scikit-Learn 

[scikit] for training.

Baselines: In addition to the natural baseline of 50% (random), we include two simple heuristic-based baselines to evaluate the performance of our models. The first baseline (1) includes the changes in chart width and height between a target and its source, capturing an intuition about maintaining size and aspect ratio. The second baseline (2) is whether x and y axes are transposed, capturing an intuition that, of the strategies in our search space, transposing is the most drastic change.

Figure 11:

(a) Joint distributions of rankings of target views in each pair of aggregated features (the source visualization is shown in

Figure 9a). (b) Kendall rank correlation coefficients for targets of our source views in Figure 9. Cmp (comparison).
Table 2: (a) Prediction accuracy of our models, averaged over LOO cross validation. Other performance measures (AUC score and F1-score) appeared similarly to accuracy. (b) Average importance of isaggregated features ( = difference) measured by impurity-based importance from training a random forest model (e = 50) 10 times.

5.3 Results

All the experimental materials, and files used for analysis are included in the supplementary materials, available at https://osf.io/jcvbx.

5.3.1 Rank correlation between loss measures

To ensure that our loss measures capture different information about transformations, we compute and inspect rank correlations between each pair of aggregated, and each pair of disaggregated measures. If two different loss measures produce highly similar rankings of target views, then one of them might be redundant. Our measures tend to be orthogonal to each other (see Figure 11), with Kendall rank correlation coefficients [kendall1948rank] between and . The same pattern is observed for the disaggregated measures with overall correlation coefficients mostly between and (see supplementary material).

When the chart type of a source view allows a few, limited responsive transformation strategies due to its own design constraints (e.g., line chart, heatmap), the correlation between measures appear slightly higher than the other chart types. For example, it is often impossible to add a new encoding channel through aggregation or binning in a line chart. This makes the line chart more sensitive to chart size changes, resulting in relatively higher negative rank correlation between comparison and trend loss. Similarly, different binning levels in a heatmap can affect both one’s ability to identify data points in different encoding channels and to recognize a trend implied by x and y on color channels (i.e., ), leading to a slightly higher positive rank correlation between identification and trend loss.

5.3.2 Monotonicity

Ranking problems through pairwise comparison assume that the partial rankings used as input are consistent with the full ranking (monotonicity of rankings) [jamieson2011active, Herbrich1999]. In other words, we need to ensure that the partial pairwise rankings that we calculate based on aggregated expert labels can yield a monotonic full ranking. Comparison sorting algorithms can be used to determine whether a monotonic full ranking can be obtained from pairwise rankings, as a comparison sort will only result in a monotonic ranking if the principle of transitivity () and connexity () hold.

To confirm whether our expert labels satisfy the monotonicity assumption, we first sort the five target views in each of our 108 trials, using the ten aggregated pairwise rankings as a comparison function. Next, we check whether each consecutive pair in the reproduced ordering conflicts with the aggregated expert labels, because if a pair in the reproduced ordering is not aligned with the aggregated label, that trial violates the monotonicity assumption. 102 out of 108 trials (

) in our data set had fully monotonic orderings. Of the six orderings which are not fully monotonic, five are partially monotonic with only one misaligned pair each (out of the ten ordered pairs). The other non-monotonic ordering (a trial with line chart as the source view) had multiple conflicts; we dropped this ordering from our training data, resulting in 1,070 training pairs (1,067 unique training pairs).

5.3.3 Training results

Model performance: Overall, our models with disaggregated () and aggregated features () achieved prediction accuracy greater than 75% (Table 2a), showing the utility of our measures in ranking responsive design transformations. Ensemble models (RF, AB, and GB) with features resulted in the highest overall accuracy (above 81%) because they iterate over multiple different models and we have a relatively small number of features. In particular, RF with

and 100 estimators showed highest accuracy of 84%. Our neural network model (MLP) also provided comparable performance to the ensemble models.

Models with features in general obtained higher accuracy than features, and combining them () did not provide significant gain in accuracy. Although they had only three features, our models with features showed reasonable accuracy of up to ( = concatenate) and ( = difference). For mapping functions, concatenation performed slightly better than difference for our best performing models (RF).

Our models all outperformed those with both baselines features ( and ), indicating that our loss measures capture information that simple heuristics, such as changes in chart size or axes transposition, are unable to capture. When we trained the best performing model (RF) with the features of only a single loss criterion (e.g., only trend), accuracy ranged from 52.6% to 79.7%, implying that our measures are more useful when combined than when used individually. As hypothetical upper bounds for accuracy, training and testing the model on the same data set resulted in accuracy from 84% (KNN) to 100% (RF).

Feature importance: To understand how the different loss measures function in our models, we inspected the importance of each disaggregated feature (mapping function = difference) using the impurity-based importance measure (average information gain) by training a random forest model with 50 estimators (average over 10 training iterations). As shown in Table 2b, features related to position encodings (, , ) in general seem to have higher importance, which makes sense given their ubiquity in our sample.

Predicted rankings of example cases: Using the best prediction model (RF with 100 estimators), we predicted the rankings of example cases described in subsection 4.3. Transformations from the simple scatterplot example (Figure 9a) are ranked as: Ta1 (resizing), Ta2 (transposing axes), Ta4, Ta3 (binning, resizing, aggregation), and Ta5 (binning, resizing, aggregation, mark type change). Ta1 appears higher in ranking than Ta2 because Ta2 has higher trend loss, while Ta1 slightly sacrifices comparison loss. Ta4 is ranked in a higher position than Ta5 because Ta5 has higher comparison and trend loss. Responsive transformations from the histogram example (Figure 9b) are ranked as: Tb2 (transposing axes, resizing), Tb3 (transposing axes, resizing), Tb1 (resizing), Tb5 (resizing, changing bin size), and Tb4 (resizing, changing bin size). The transposed views (Tb2 and Tb3) are ranked higher than Tb1 probably because the model has more emphasis on comparison loss as the feature importance (Table 2b) shows. The ordering between Tb5 and Tb4 can be backed by the smaller trend loss of Tb5 while the difference in comparison loss between them appears subtle.

6 Discussion and Future Work

6.1 Extending and Validating Our Preservation Measures

We devised a small set of measures for three common low-level tasks in visualization use and found that they can be used to build reasonably well-performing ML models for ranking small screen design alternatives given a large screen source view. Our measures are not strongly correlated, and removing some of the measures results in lower predictive accuracy. However, there are other forms of prominent task-oriented insights that could extend our approach if approximated well, such as clustering data points or identifying outliers. As our measures lose information by processing rendered values, future work could estimate task-oriented insights with different methods, such as extracting and directly comparing image features from rendered visualizations.

There are also opportunities to strengthen and extend our measures through human subject studies. These include more formative research with mixed methods to understand heuristics and other strategies that visualization authors and users employ to reason about how well a design transformation preserves important takeaways. In addition, future work could conduct perceptual experiments that more precisely estimate human baselines for identification, comparison, and trend losses. We also used simple approximations of perceptual differences in position, size, and color channels which could be improved through new experiments specifically designed to understand how perception is affected on smaller screen sizes, adding to work like examining task performance on smaller screens by different chart types [brehmer2020] and comparing task performance between small and large screens [Blascheck2019, Blascheck2021]. A limitation is that our experiment was conducted on desktop devices. Future work could test on mobile devices, as well as explore mobile-first design contexts, as our measures are designed to be symmetric.

6.2 Responsive Visualization Authoring Tools

Our work demonstrates how task-oriented insight preservation can be used to rank design alternatives in responsive visualization. To do so, we formulated and evaluated our insight preservation measures on a search space representing common responsive visualization design strategies and mark-encoding combinations. However, our work should be extended in several important ways to support a responsive visualization authoring use case.

First, while more drastic encoding changes than those supported by our generator are rare in practice [kim2021], this might be because responsive visualization authoring is currently a tedious process and authors satisfice by exploring smaller regions of the design space. There are many strategies that could be added to a search space like the one we defined, and used to evaluate our measures as well as to learn more about how authors react when confronted with more diverse sets of design alternatives. For example, while we mainly consider single-view, static visualizations, many communicative visualizations employ multiple views and interactivity [Segel2010, Hullman2011, hullman2013deeper]. Ideally a responsive visualization recommender should be able to formulate related strategies (e.g., rearranging the layout of multiple views, omitting an interaction feature, editing non-data ink like legends). Recommenders may need to consider further conditions such as consistency constraints for multiple views [qu2016], effectiveness of visualization sequence [kim2017graphscape], semantics of composite visualizations [javed2012], and effectiveness of interactive graphical encoding [saket2018]. As indicated in Kim et al [kim2021], loss measures should be able to address concurrency of information because rearranging multiples views (e.g., serializing) can make it difficult to recognize data points at the same time on small screen devices. In addition, they should also account for loss of information that can only be obtained via user interaction (e.g., trend implied by filtered marks).

We envision our measures, and similar measures motivated to capture other task-oriented insights, being surfaced for an author to specify preferences on in a semi-automated responsive visualization design context. Because what our measures capture is relatively interpretable, authors may find it useful to customize them for certain design tasks, such as prioritizing one measure or changing how information is combined to capture identification, comparison, or trend loss. This is a strength of our approach relative to using a more “black-box” approach where model predictions might be difficult to explain.

6.2.1 Extending ML-based approaches

The human labelers in our experiment, including the authors, seemed to at times use strategies or heuristics such as preferring non-transposed views in their rankings or trying to minimize changes to aspect ratio for some chart types. However, models with our loss measures as features perform better than heuristic approaches like detecting axes-transpose and chart size changes, implying that task-oriented insights may be the right level at which to model rankings. As an extension, future work might learn pre-defined costs for different transformation strategies to reduce the time complexity of evaluating task-oriented insights preservation, similar to the approach adopted by Draco-Learn [draco] which obtained costs for constraint violation. Learning such pre-defined costs may also enable better understanding how each responsive design strategy contributes to changes in task-oriented insights. An alternative approach could be to use our loss measures as cost functions and optimize different strategies to reduce them as MobileVisFixer [wu2020mobilevisfixer]

fixes a non-mobile-friendly visualizations for a mobile screen by minimizing heuristic-based costs. As recent deep learning models 

[wu2021, ma2020] have performed well in visualization ranking problems, future work may further elaborate on those models. In doing so, one could combine our measures with image features (e.g., ScatterNet [ma2020]) or chart parameters (e.g., aspect ratio, orientation [wu2021]).

As noted in subsubsection 5.3.2

, there were a few partial and not fully monotonic orderings in our data set. A better model might ignore this assumption and try to identify highly recommendable transformations or classify them into multiple ordinal classes, yet this might come up with lower interpretability about recommendations due to a lack of explicit ordinal relationship between transformations.

6.3 Generalizing Our Measures to Other Design Domains

Our approach to task-oriented insight preservation is likely to be useful in visualization design domains beyond responsive visualization, like style transfer and visualization simplification, although other domains may also require different transformation strategies. Style transfer, for instance, may involve techniques like aggregation but is more likely to change visual attributes of marks or references. While our loss measures are designed to be low-level enough to apply relatively generically to visualizations, their precise formulation and the combination strategy might warrant changes in other domains. For example, in visualization simplification, minimizing trend loss is likely to be more important than preserving identification and comparison of individual data points. Style transfer often focuses on altering color schemes, size scales, or mark types [Harper2014] which can result in different discriminability distributions, so it might put more emphasis on comparison loss.

7 Conclusion

Responsive visualization transformations often alter task-oriented insights obtainable from a transformed view relative to a source view. To enable automated recommenders for iterative responsive visualization design, we suggest loss measures for identification, comparison, and trend insights. We developed a prototype responsive visualization recommender that enumerates transformations and evaluates them using our measures. To evaluate the utility of our measures, we trained ML models on human-produced orderings that we collected, achieving accuracy of up to with a random forest model.

Acknowledgements.
Jessica Hullman thanks NSF (#1907941) and Adobe.

References