1 Related Work
Various accessible media publishers have established general guidelines for people to write natural language captions [diagramcenter, GBH, CFPB, W3C], but those guidelines often lack rationales and support from empirical experiments [Jung2021CommunicatingVW]. This critique motivates our empirical study to better understand how to write effective data captions.
1.1 The Impact of Visual Salience on Reader Takeaway
Prior work has shown that the salience of the feature described by a caption affects what the reader takes away. Kim et al. [Kim2021TowardsUH] show that, for single-line charts, when the caption mentions a salient feature, reader takeaways more consistently mention the feature; when the caption mentions a less salient feature, reader takeaways are more likely to mention the most salient feature than the feature described in the caption. In our study, we investigate how visual salience and semantic level of a caption affect the reader’s takeaway. We also offer recommendations when practitioners want to emphasize less salient features.
1.2 Natural Language Models for Captions
The lack of rationale for caption guidelines raises the need to analyze captions systematically. While the three-level model by Kim et al. [Kim2021AccessibleVD] guides people on how to scaffold visualization information in order, Lundgard et al. [2022-vis-text-model] propose a concrete model that categorizes the content in a caption into four semantic levels: 1) elemental and encoded, 2) statistical and relational, 3) perceptual and cognitive, and 4) contextual and domain-specific (Table 1).
Lundgard et al. [2022-vis-text-model] evaluate the effectiveness of each semantic level, and suggest that both sighted and visually impaired readers tend to regard semantic levels that communicate perceptual information (level 3) and statistical information (level 2) as useful to an extent. Although sighted users also ranked contextual level (level 4) highly, this level depends on the reader’s subjective knowledge about the world events (context). Our work teases out the nuances between captions that communicate data in the chart—in other words, the perceptual and statistical levels.
1.3 Evaluating Captions
Some methods of evaluating captions used in the past empirical research include semi-structured interviews[Jung2021CommunicatingVW], pre-constructed questions[Moraes2014EvaluatingTA, Kildal2006NonvisualOO], and takeaways recall[Kong2019TrustAR, Borkin2016BeyondMV, Kim2021TowardsUH].
Semi-structured interviews allow users to express their thoughts and preferences directly after reading captions[Jung2021CommunicatingVW]. While such methods can provide detailed and insightful qualitative results, it is challenging to generalize meaningful quantitative results at scale.
Moraes et al.[Moraes2014EvaluatingTA] and Kildal et al.[Kildal2006NonvisualOO] ask users to listen to captions or data values, and answer some pre-constructed overview questions after that. The time taken to answer those questions and the correct answers are used to evaluate the effectiveness of the caption or tool used. This method is helpful for quantitatively evaluating captions that provide a large picture of visualizations. However, it is not suited for captions categorized under Lundgard et al.’s model, which focuses on specific visualization features.
Kong et al. [Kong2019TrustAR], Borkin et al.[Borkin2016BeyondMV] and Kim et al.[Kim2021TowardsUH] ask users to recall their takeaways from some visualizations or their captions, and analyze the takeaways. We find this method to be the most relevant to our study, and follow the takeaways recall method similar to Kim et al.’s [Kim2021TowardsUH]. However, the purpose of our study differs largely from Kim et al., who focus on what information to put in captions. Our study focuses on how varying semantic levels of caption content can affect users’ understanding of charts and captions.
2 Approach Overview
We study how captions written at the statistical or perceptual levels affect the memorability of visualization features. Since these semantic levels are based on Lundgard et al.’s model [2022-vis-text-model], we use three single- and two multi-line charts from their corpus [mitdataset]. The third single-line chart was used to evaluate the effects when the caption contains errors. We follow the 3-step process used by Kim et al. [Kim2021TowardsUH]: identify visually salient features, generate captions, and show captioned visualizations to users to collect takeaways (Figure 1). The study was approved by our institution’s IRB.
2.1 Step 1: Identify visually salient features
We first determined the primary and secondary most salient visualization features. We showed 9 university students 7 charts each, and asked them to draw boxes and labeled each box as primary or secondary. They wrote a short description if drawing a box is inadequate. We then aggregated their submissions to choose the primary and secondary visual features for users in steps 2 and 3 of the study.
Figure 2 shows the primary and secondary boxes drawn for one of the charts. We then clustered and chose regions that overlapped with the top two clusters and used those regions (in orange and blue) as the primary and secondary features for which we write captions in the next step. We found consensus for the primary/secondary features in 5 of the 7 charts. Thus, we used 4 for our main study, and the chart to study the impact of caption errors (Figure 3).
2.2 Step 2: Generate Captions
We generated two captions for each feature in each chart—one at the statistical level, and one at the perceptual level. To ensure the captions are consistent with Lundgard et al. [2022-vis-text-model], we used the following caption guidelines for 5 charts in their corpus [mitdataset]:
Statistical: Follow the general template: At [year], [dependent variable] is [value]. Insert “lowest” or “highest” in grammatically appropriate places if the feature contains an extremum.
Perceptual: Use one or more keywords that describe trends and patterns: increase, decrease, drop, rebound, stable, trend, gap. Add values in grammatically appropriate places so that both levels are comparable in granularity.
The independent variables are features described in the caption and the semantic level of the captions (statistical or perceptual), and the dependent variable is the user takeaway. For each chart, we generated 4 captions (statistical or perceptual primary or secondary feature), along with a no-caption control group. As an example, the captions for Figure 3(b) are:
Primary, Statistical: At 2009, Taiwan has the lowest change of exports of -45%; at 2010, it has the highest change of exports of 75%.
Primary, Perceptual: After a dramatic drop in exports in 2009, the percentage of exports rebounds significantly and reaches the peak of 75% in 2010.
Secondary, Statistical: At 2011, the annual change of exports in Taiwan is 30%; at 2016, it is -18%.
Secondary, Perceptual: After 2011, the exports become more stable, but the overall trend decreases, with exports reaching -18% in 2016.
Finally, we generated correct and incorrect captions at both semantic levels to describe the selected feature (in purple highlight) in the chart (Figure 3(e)). The only difference between the incorrect and correct captions was numeric errors (in red). The statistical caption states “At mid-2004, the lowest price of oranges is $0.10 per pound; at 2007, the highest price of oranges is $2.60 per pound,” and the perceptual caption states “From mid-2004 to 2007, the price of oranges per pound increases significantly from $0.10 to $2.60.”
2.3 Step 3: Collect Takeaways for Charts & Captions
Finally, we showed the captioned visualizations to a different set of users and collect their takeaways. We recruited users from three main channels: a university campus, Amazon Mechanical Turk (AMT), and visualization-related online communities. Users were asked to fill an online questionnaire (Figure 4).
The questionnaire starts with a screening test to ensure that a user can read values and trends from a visualization. Each user then reads 5 charts in random order, with a random caption version for each chart. To mimic real-world settings, the user can read the chart-caption pair as long as they wish, and then click “next” to write their takeaways. We don’t allow users to go back to the previous page to prevent copy-and-pasting the caption as part of their takeaways. This condition is made clear for every chart. We also ask the user to report their reliance on the chart and caption when writing their takeaways, based on a 5-point Likert scale.
The first two authors of this paper coded the takeaways based on the criteria in Table 2. Each author labeled the takeaways independently and discussed confusing cases together.
|Primary||If takeaway mentions primary feature.|
|Secondary||If takeaway mentions secondary feature.|
|Numeric||If takeaway contains x- or y-axis values.|
|Wrong||If takeaway contains inaccuracies.|
We collected 500 takeaways from 100 users. 92% were 18-54 years old, 88% held a Bachelor’s degree or higher, and 92% agreed that they read visualizations proficiently in everyday life. Assuming that captions are more effective at communicating a given feature in the data when the user mentions the feature in their takeaway, we quantify and report memorability as the percentage of users that mentioned the feature (primary or secondary) in their takeaways.
As a first analysis, we partitioned the results by chart type (single- and multi-line) and takeaway feature (whether the primary or secondary was mentioned in the takeaway). We call a caption consistent with the takeaway when it describes the same feature as mentioned in the takeaway. We then ran Chi-squared tests for each partition to measure the independence between three caption types (two consistent and no-caption types) and whether the user mentioned that takeaway feature. For instance, for the primary takeaway feature, we ran Chi-squared tests using the captions that described the primary feature, along with the no-caption type. The p-values are all below and imply that caption type does affect memorability (Table 3).
|Single-line Charts||Multi-line Charts|
|Mentioned in takeaway?||Primary||Secondary||Primary||Secondary|
3.1 Semantic Level and Feature Salience
How do the feature and semantic level of the caption affect memorability? Figure 5 reports the percentage of users that mention the primary/secondary feature (columns) for single- and multi-line charts (rows), as we vary the caption type (x-axis). We darkened the bars for consistent captions. We ran pairwise Fisher’s exact test between bars in each chart, and marked a bar with *, **, or *** if the p-value is , , or as compared to the no-caption type. We also marked a bar with if it is statistically different than all other bars.
Across all charts, using the statistical level to describe consistent captions improved memorability of the caption’s feature by a statistically significant amount, particularly when the feature is prominent in the chart (single-line charts, primary feature case in Figure 5(a)). Overall memorability is lower in multi-line (Figure 5(c)(d)) than single-line charts (Figure 5(a)(b)), potentially because they are more complex and no single feature “stands out” [Carpenter1998AMO]. In these cases, consistent captions at either semantic level improve memorability.
Secondary features (Figure 5(b)(d)) are also more memorable when the caption is consistent, though considerably more memorable when described as a trend or pattern (perceptual). Surprisingly, they are more memorable in single-line charts simply by having a caption, irrespective of which feature it describes and at which semantic level (Figure 5(b)).
Reliance on Captions: Along with takeaways, users also reported their reliance on the chart and caption. Figure 6 shows that when the caption describes the primary feature, users relied on the caption the same amount irrespective of the semantic level. In contrast, when it describes the secondary feature, users relied more on perceptual captions (single-line: Mann-Whitney ; multi-line: Mann-Whitney ).
3.2 Do Captions Help Users Remember Numbers?
Can caption type help users remember chart details? To this end, we focused on takeaways from single-line charts and computed the percentage of users that mentioned the x-axis (time) or y-axis (measure) value of a point in the chart. Chi-squared test shows that the caption type does affect the memorability of y-axis values (). We further ran Fisher’s exact test between each bar and its corresponding no-caption type to measure statistical significance.
Even without a caption, users were more likely to mention values along the x-axis rather than the y-axis. A potential reason is that, even when describing trends, users still need to refer to x-axis values (e.g., “increase between 2000 and 2010”). A statistical level caption significantly () increases the memorability of y-axis values, such that it’s comparable with the memorability of x-axis values. However, perceptual captions do not appear to have an effect. These results suggest that a statistical level caption is helpful if there is a need to highlight specific y-axis values.
The multi-line chart results (not shown) are similar under no caption, but captions do not show statistically significant differences.
3.3 Can Errors in Captions Hurt?
If the caption contains numeric errors inconsistent with the chart, does the semantic level of the caption affect whether or not the user remembers incorrect information? Answering this question signals the extent to which readers rely on captions in practice.
We used a single-line chart (Figure 3(e)) and generated statistical and perceptual captions that described the selected feature in the chart. However, the wrong caption type used incorrect numbers that were easy to double check by reading the chart. Given takeaways that contain a number, we measured the percentage that contained an incorrect number, and whether the error was found in the caption (“caption-related”) or not (“caption-unrelated”).
We first studied whether the semantic level and caption correctness affect whether users mention numeric values in their takeaways, and found no statistically significant effects. Then we studied the percentage of wrong takeaways among all numeric takeaways (Figure 8). As expected, there is a baseline level of errors regardless of caption correctness and its semantic level. However, caption errors increase the error rate considerably. Errors in the statistical caption were reproduced in nearly 30% of numeric takeaways, and errors in perceptual captions were reproduced in around 10% of numeric takeaways. We find no statistically significant differences, potentially due to the low sample size of wrong takeaways.
These results suggest that many users did not verify the veracity of the data in captions, and wrong captions can bias the users into drawing conclusions inconsistent with the chart data. We further posit that readers rely more heavily on specific numbers mentioned in statistical captions, whereas they focus more on general patterns and mildly on specific numbers in perceptual captions. In summary, this phenomenon merits further study.
4 Findings, Applications, and Future Work
We examine how captions of varying semantic levels can affect the features mentioned in the users’ takeaways. These results lead to recommendations for composing captions, depending on what features the author wishes to emphasize and convey.
We find that users tend naturally to remember a highly salient feature in a chart, such as an extremum or abrupt change in a single-line chart. Yet, the author can further emphasize the primary feature by focusing the caption on the statistical properties of the feature. Meanwhile, captions can also focus the user on medium or low-salience features—these may be secondary trends or any feature in a complex multi-line chart. In these cases, captions should focus on perceptual level descriptions such as trends and patterns. To summarize these recommendations:
Use Statistical Level for high salience features in single-line charts. Especially helpful in conveying numeric information in single-line charts.
Use Perceptual Level for medium/low salience features.
Applications: Our findings can help data visualization practitioners write captions to more effectively communicate patterns that they want to highlight. They can also help standardize guidelines for machine-generated captions. Although this study only involves sighted users, we hope this approach of categorizing and analyzing captions’ semantic levels can be applied to exploring effective alt-text writing for the visually impaired community as well.