How Do Captions Affect Visualization Reading?

by   Shelly Cheng, et al.
Columbia University

Captions help readers better understand visualizations. However, if the visualization is intended to communicate specific features, should the caption be statistical, and focus on specific values, or perceptual, and focus on general patterns? Prior work has shown that when captions mention visually salient features, users tend to recall those features. Still, we lack explicit guidelines for how to compose the appropriate caption. Further, what if the author wishes to emphasize a less salient feature? In this paper, we study how the visual salience of the feature described in a caption, and the semantic level of the caption description, affect a reader's takeaways from line charts. For each single- or multi-line chart, we generate 4 captions that 1) describe either the primary or secondary most salient feature in a chart, and 2) describe the feature either at the statistical or perceptual levels. We then show users random chart-caption pairs and record their takeaways. We find that the primary salient feature is more memorable for single-line charts when the caption is expressed at the statistical level; for secondary salient features in single- and multi-line charts, the perceptual level is more memorable. We also find that many users will tend to rely on erroneous data in the caption and not double-check its veracity against the data in the chart.



page 1

page 2

page 3

page 4


Towards Understanding How Readers Integrate Charts and Captions: A Case Study with Line Charts

Charts often contain visually prominent features that draw attention to ...

3M: Multi-style image caption generation using Multi-modality features under Multi-UPDOWN model

In this paper, we build a multi-style generative model for stylish image...

Cats and Captions vs. Creators and the Clock: Comparing Multimodal Content to Context in Predicting Relative Popularity

The content of today's social media is becoming more and more rich, incr...

Accessible Visualization via Natural Language Descriptions: A Four-Level Model of Semantic Content

Natural language descriptions sometimes accompany visualizations to bett...

Progressive Feature Polishing Network for Salient Object Detection

Feature matters for salient object detection. Existing methods mainly fo...

What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics

While there have been significant gains in the field of automated video ...

Salient Speech Representations Based on Cloned Networks

We define salient features as features that are shared by signals that a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Related Work

Various accessible media publishers have established general guidelines for people to write natural language captions [diagramcenter, GBH, CFPB, W3C], but those guidelines often lack rationales and support from empirical experiments [Jung2021CommunicatingVW]. This critique motivates our empirical study to better understand how to write effective data captions.

1.1 The Impact of Visual Salience on Reader Takeaway

Prior work has shown that the salience of the feature described by a caption affects what the reader takes away. Kim et al. [Kim2021TowardsUH] show that, for single-line charts, when the caption mentions a salient feature, reader takeaways more consistently mention the feature; when the caption mentions a less salient feature, reader takeaways are more likely to mention the most salient feature than the feature described in the caption. In our study, we investigate how visual salience and semantic level of a caption affect the reader’s takeaway. We also offer recommendations when practitioners want to emphasize less salient features.

1.2 Natural Language Models for Captions

The lack of rationale for caption guidelines raises the need to analyze captions systematically. While the three-level model by Kim et al. [Kim2021AccessibleVD] guides people on how to scaffold visualization information in order, Lundgard et al. [2022-vis-text-model] propose a concrete model that categorizes the content in a caption into four semantic levels: 1) elemental and encoded, 2) statistical and relational, 3) perceptual and cognitive, and 4) contextual and domain-specific (Table 1).

Table 1: Statistical and perceptual levels of Lundgard et al.’s [2022-vis-text-model] Model of Semantic Content.

Lundgard et al. [2022-vis-text-model] evaluate the effectiveness of each semantic level, and suggest that both sighted and visually impaired readers tend to regard semantic levels that communicate perceptual information (level 3) and statistical information (level 2) as useful to an extent. Although sighted users also ranked contextual level (level 4) highly, this level depends on the reader’s subjective knowledge about the world events (context). Our work teases out the nuances between captions that communicate data in the chart—in other words, the perceptual and statistical levels.

1.3 Evaluating Captions

Some methods of evaluating captions used in the past empirical research include semi-structured interviews[Jung2021CommunicatingVW], pre-constructed questions[Moraes2014EvaluatingTA, Kildal2006NonvisualOO], and takeaways recall[Kong2019TrustAR, Borkin2016BeyondMV, Kim2021TowardsUH].

Semi-structured interviews allow users to express their thoughts and preferences directly after reading captions[Jung2021CommunicatingVW]. While such methods can provide detailed and insightful qualitative results, it is challenging to generalize meaningful quantitative results at scale.

Moraes et al.[Moraes2014EvaluatingTA] and Kildal et al.[Kildal2006NonvisualOO] ask users to listen to captions or data values, and answer some pre-constructed overview questions after that. The time taken to answer those questions and the correct answers are used to evaluate the effectiveness of the caption or tool used. This method is helpful for quantitatively evaluating captions that provide a large picture of visualizations. However, it is not suited for captions categorized under Lundgard et al.’s model, which focuses on specific visualization features.

Kong et al. [Kong2019TrustAR], Borkin et al.[Borkin2016BeyondMV] and Kim et al.[Kim2021TowardsUH] ask users to recall their takeaways from some visualizations or their captions, and analyze the takeaways. We find this method to be the most relevant to our study, and follow the takeaways recall method similar to Kim et al.’s [Kim2021TowardsUH]. However, the purpose of our study differs largely from Kim et al., who focus on what information to put in captions. Our study focuses on how varying semantic levels of caption content can affect users’ understanding of charts and captions.

2 Approach Overview

We study how captions written at the statistical or perceptual levels affect the memorability of visualization features. Since these semantic levels are based on Lundgard et al.’s model [2022-vis-text-model], we use three single- and two multi-line charts from their corpus [mitdataset]. The third single-line chart was used to evaluate the effects when the caption contains errors. We follow the 3-step process used by Kim et al. [Kim2021TowardsUH]: identify visually salient features, generate captions, and show captioned visualizations to users to collect takeaways (Figure 1). The study was approved by our institution’s IRB.

Figure 1: 3-step process of our study. Yellow boxes are crowd-sourced data collection, and grey box performed by authors.

2.1 Step 1: Identify visually salient features

We first determined the primary and secondary most salient visualization features. We showed 9 university students 7 charts each, and asked them to draw boxes and labeled each box as primary or secondary. They wrote a short description if drawing a box is inadequate. We then aggregated their submissions to choose the primary and secondary visual features for users in steps 2 and 3 of the study.

Figure 2 shows the primary and secondary boxes drawn for one of the charts. We then clustered and chose regions that overlapped with the top two clusters and used those regions (in orange and blue) as the primary and secondary features for which we write captions in the next step. We found consensus for the primary/secondary features in 5 of the 7 charts. Thus, we used 4 for our main study, and the chart to study the impact of caption errors (Figure 3).

Figure 2: An example of identified primary and secondary features from different users’ responses. The primary feature (in orange highlight) is generalized to be the minimum in 2009, the maximum in 2010, and the increasing trend in between; the secondary feature (in blue highlight) is generalized to be the relatively stable trend after 2011.

2.2 Step 2: Generate Captions

We generated two captions for each feature in each chart—one at the statistical level, and one at the perceptual level. To ensure the captions are consistent with Lundgard et al. [2022-vis-text-model], we used the following caption guidelines for 5 charts in their corpus [mitdataset]:

  • Statistical: Follow the general template: At [year], [dependent variable] is [value]. Insert “lowest” or “highest” in grammatically appropriate places if the feature contains an extremum.

  • Perceptual: Use one or more keywords that describe trends and patterns: increase, decrease, drop, rebound, stable, trend, gap. Add values in grammatically appropriate places so that both levels are comparable in granularity.

The independent variables are features described in the caption and the semantic level of the captions (statistical or perceptual), and the dependent variable is the user takeaway. For each chart, we generated 4 captions (statistical or perceptual primary or secondary feature), along with a no-caption control group. As an example, the captions for Figure 3(b) are:

  • Primary, Statistical: At 2009, Taiwan has the lowest change of exports of -45%; at 2010, it has the highest change of exports of 75%.

  • Primary, Perceptual: After a dramatic drop in exports in 2009, the percentage of exports rebounds significantly and reaches the peak of 75% in 2010.

  • Secondary, Statistical: At 2011, the annual change of exports in Taiwan is 30%; at 2016, it is -18%.

  • Secondary, Perceptual: After 2011, the exports become more stable, but the overall trend decreases, with exports reaching -18% in 2016.

Finally, we generated correct and incorrect captions at both semantic levels to describe the selected feature (in purple highlight) in the chart (Figure 3(e)). The only difference between the incorrect and correct captions was numeric errors (in red). The statistical caption states “At mid-2004, the lowest price of oranges is $0.10 per pound; at 2007, the highest price of oranges is $2.60 per pound,” and the perceptual caption states “From mid-2004 to 2007, the price of oranges per pound increases significantly from $0.10 to $2.60.”

Figure 3: Single- and multi-line charts with identified salient features (primary in orange and secondary in blue highlight) used in step 2 and step 3. The selected feature for the wrong caption chart is labeled in purple highlight.

2.3 Step 3: Collect Takeaways for Charts & Captions

Finally, we showed the captioned visualizations to a different set of users and collect their takeaways. We recruited users from three main channels: a university campus, Amazon Mechanical Turk (AMT), and visualization-related online communities. Users were asked to fill an online questionnaire (Figure 4).

Figure 4: Procedures for Collecting Takeaways.

The questionnaire starts with a screening test to ensure that a user can read values and trends from a visualization. Each user then reads 5 charts in random order, with a random caption version for each chart. To mimic real-world settings, the user can read the chart-caption pair as long as they wish, and then click “next” to write their takeaways. We don’t allow users to go back to the previous page to prevent copy-and-pasting the caption as part of their takeaways. This condition is made clear for every chart. We also ask the user to report their reliance on the chart and caption when writing their takeaways, based on a 5-point Likert scale.

The first two authors of this paper coded the takeaways based on the criteria in Table 2. Each author labeled the takeaways independently and discussed confusing cases together.

Label Explanation
Primary If takeaway mentions primary feature.
Secondary If takeaway mentions secondary feature.
Numeric If takeaway contains x- or y-axis values.
Wrong If takeaway contains inaccuracies.
Table 2: Labels for possible information in takeaways.

3 Results

We collected 500 takeaways from 100 users. 92% were 18-54 years old, 88% held a Bachelor’s degree or higher, and 92% agreed that they read visualizations proficiently in everyday life. Assuming that captions are more effective at communicating a given feature in the data when the user mentions the feature in their takeaway, we quantify and report memorability as the percentage of users that mentioned the feature (primary or secondary) in their takeaways.

As a first analysis, we partitioned the results by chart type (single- and multi-line) and takeaway feature (whether the primary or secondary was mentioned in the takeaway). We call a caption consistent with the takeaway when it describes the same feature as mentioned in the takeaway. We then ran Chi-squared tests for each partition to measure the independence between three caption types (two consistent and no-caption types) and whether the user mentioned that takeaway feature. For instance, for the primary takeaway feature, we ran Chi-squared tests using the captions that described the primary feature, along with the no-caption type. The p-values are all below and imply that caption type does affect memorability (Table 3).

Single-line Charts Multi-line Charts
Mentioned in takeaway? Primary Secondary Primary Secondary
Caption Type
Table 3: Chi-squared tests between caption type and whether a user mentioned the feature in their takeaway.

3.1 Semantic Level and Feature Salience

How do the feature and semantic level of the caption affect memorability? Figure 5 reports the percentage of users that mention the primary/secondary feature (columns) for single- and multi-line charts (rows), as we vary the caption type (x-axis). We darkened the bars for consistent captions. We ran pairwise Fisher’s exact test between bars in each chart, and marked a bar with *, **, or *** if the p-value is , , or as compared to the no-caption type. We also marked a bar with if it is statistically different than all other bars.

Figure 5: Percentage of users that mention a feature for chart types (x-axis). *, **, and *** denote a p-value of , , and as compared to the no-caption type, respectively; indicates statistical significance over all other types. Lighter bars indicate the caption type does not contain the takeaway feature.

Across all charts, using the statistical level to describe consistent captions improved memorability of the caption’s feature by a statistically significant amount, particularly when the feature is prominent in the chart (single-line charts, primary feature case in Figure 5(a)). Overall memorability is lower in multi-line (Figure 5(c)(d)) than single-line charts (Figure 5(a)(b)), potentially because they are more complex and no single feature “stands out” [Carpenter1998AMO]. In these cases, consistent captions at either semantic level improve memorability.

Secondary features (Figure 5(b)(d)) are also more memorable when the caption is consistent, though considerably more memorable when described as a trend or pattern (perceptual). Surprisingly, they are more memorable in single-line charts simply by having a caption, irrespective of which feature it describes and at which semantic level (Figure 5(b)).

Reliance on Captions: Along with takeaways, users also reported their reliance on the chart and caption. Figure 6 shows that when the caption describes the primary feature, users relied on the caption the same amount irrespective of the semantic level. In contrast, when it describes the secondary feature, users relied more on perceptual captions (single-line: Mann-Whitney ; multi-line: Mann-Whitney ).

Figure 6: Self-reported reliance on different caption types for single- and multi-line charts. 1 means not dependent and 5 means entirely dependent. *, ** denote a p-value of and , respectively.

3.2 Do Captions Help Users Remember Numbers?

Can caption type help users remember chart details? To this end, we focused on takeaways from single-line charts and computed the percentage of users that mentioned the x-axis (time) or y-axis (measure) value of a point in the chart. Chi-squared test shows that the caption type does affect the memorability of y-axis values (). We further ran Fisher’s exact test between each bar and its corresponding no-caption type to measure statistical significance.

Even without a caption, users were more likely to mention values along the x-axis rather than the y-axis. A potential reason is that, even when describing trends, users still need to refer to x-axis values (e.g., “increase between 2000 and 2010”). A statistical level caption significantly () increases the memorability of y-axis values, such that it’s comparable with the memorability of x-axis values. However, perceptual captions do not appear to have an effect. These results suggest that a statistical level caption is helpful if there is a need to highlight specific y-axis values.

The multi-line chart results (not shown) are similar under no caption, but captions do not show statistically significant differences.

Figure 7: Percentage of users that mention an x- or y-axis value of a point in single-line charts. * indicates statistical significance () over the corresponding no-caption bar.

3.3 Can Errors in Captions Hurt?

If the caption contains numeric errors inconsistent with the chart, does the semantic level of the caption affect whether or not the user remembers incorrect information? Answering this question signals the extent to which readers rely on captions in practice.

We used a single-line chart (Figure 3(e)) and generated statistical and perceptual captions that described the selected feature in the chart. However, the wrong caption type used incorrect numbers that were easy to double check by reading the chart. Given takeaways that contain a number, we measured the percentage that contained an incorrect number, and whether the error was found in the caption (“caption-related”) or not (“caption-unrelated”).

We first studied whether the semantic level and caption correctness affect whether users mention numeric values in their takeaways, and found no statistically significant effects. Then we studied the percentage of wrong takeaways among all numeric takeaways (Figure 8). As expected, there is a baseline level of errors regardless of caption correctness and its semantic level. However, caption errors increase the error rate considerably. Errors in the statistical caption were reproduced in nearly 30% of numeric takeaways, and errors in perceptual captions were reproduced in around 10% of numeric takeaways. We find no statistically significant differences, potentially due to the low sample size of wrong takeaways.

These results suggest that many users did not verify the veracity of the data in captions, and wrong captions can bias the users into drawing conclusions inconsistent with the chart data. We further posit that readers rely more heavily on specific numbers mentioned in statistical captions, whereas they focus more on general patterns and mildly on specific numbers in perceptual captions. In summary, this phenomenon merits further study.

Figure 8: Percentage of numeric takeaways that contain numeric errors, for different caption types (x-axis). The stacked bars report errors unrelated to the caption contents, and errors with the same error as the caption.

4 Findings, Applications, and Future Work

We examine how captions of varying semantic levels can affect the features mentioned in the users’ takeaways. These results lead to recommendations for composing captions, depending on what features the author wishes to emphasize and convey.

We find that users tend naturally to remember a highly salient feature in a chart, such as an extremum or abrupt change in a single-line chart. Yet, the author can further emphasize the primary feature by focusing the caption on the statistical properties of the feature. Meanwhile, captions can also focus the user on medium or low-salience features—these may be secondary trends or any feature in a complex multi-line chart. In these cases, captions should focus on perceptual level descriptions such as trends and patterns. To summarize these recommendations:

  • Use Statistical Level for high salience features in single-line charts. Especially helpful in conveying numeric information in single-line charts.

  • Use Perceptual Level for medium/low salience features.

Applications: Our findings can help data visualization practitioners write captions to more effectively communicate patterns that they want to highlight. They can also help standardize guidelines for machine-generated captions. Although this study only involves sighted users, we hope this approach of categorizing and analyzing captions’ semantic levels can be applied to exploring effective alt-text writing for the visually impaired community as well.