Words of Estimative Correlation: Studying Verbalizations of Scatterplots

11/28/2019 ∙ by Rafael Henkin, et al. ∙ City, University of London 0

Multimodal approaches where interactive visualization and natural language are used in tandem are emerging as promising techniques for data analysis. A significant barrier in effectively designing such multimodal techniques is the lack of a systematic understanding of how people verbalize visual representations of data. Motivated by these gaps, this paper devises and applies a transferable, semi-automated methodology to systematically study the relation between visualization and natural language through two crowd-sourced experiments and natural language analysis. We describe these experiments, analyze the resulting corpus of utterances with natural language processing techniques and derive an empirically supported semantic lexicon for aligning visualizations and verbalizations of data. Our results indicate a wide range of vocabulary used to describe visualizations and led to a number of high level concepts to categorize the space of words and related utterances. We discuss how our findings can be used for natural language generation, also reflecting on the design of the experiments and the semi-automated methodology used in the analysis. We discuss further research directions and argue for a role for such multimodal experiments in advancing our understanding of how people work with visualizations and also data at large.



There are no comments yet.


page 5

page 8

page 10

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In his seminal, originally classified report, Central Intelligence Agency (CIA) analyst Sherman Kent introduces “

Words of Estimative Probability

” – a group of terms that are “advisable” to use when intelligence analysts talk about the likelihood of future forecasts with some known/estimated probabilities 

[36]. In order to establish this set of words, Kent, together with his team, investigated how terms such as ”almost certainly”, ”probably”, ”unlikely” are perceived and understood by gathering empirical data through controlled studies (see p.79 on ”Handbook for Decision Analysis” by Barclay et al. [1]) on how the perception of these terms match numeric probabilities. The insight this research provided on the links between the intended and perceived semantics of probability has been fundamental in how intelligence analysts report critical pieces of information. Inspired by these studies, this paper, through a novel data-intensive methodology, aims to understand how visual depictions of correlation are perceived, understood and verbalized by their observers. By doing so, we also aim to establish a more general role for natural language-based experiments both in understanding the inner workings of visual representations of data and also in providing empirical support to inform the design of solutions that combine visual and textual representations.

Such multimodal interactive systems where visualization is combined with natural language are gaining increased interest both in research [39, 38, 15, 33, 40, 12] and in industry as evidenced by implementations in modern business analytic systems [41, 27]. Natural language has been shown to allow more expressiveness in interactions [33] and as a means to complement or reinforce visualizations in data exploration and analysis [38, 34].

In order to design such systems effectively, the process of understanding and/or generating language should be informed by user requirements and align with the domain of the underlying analytical tasks [29]. This very need is also highlighted for developing natural language generation (NLG) systems in the Natural Language Processing (NLP) literature [28]. Reiter et al. refer to the knowledge acquisition stage where researchers try to acquire “relevant knowledge about the domain, the users, the language used in the texts” as “one of the biggest problems in applied NLG” [29]. Methods such as structured interviews with experts and learning from collections of accurate text corpora are often the proposed approaches to gathering required linguistic elements for building a natural language generation system [29]. In visualization research, this need has often been addressed in a case-by-case basis [33, 39], not always resorting to empirically supported knowledge on how people perceive and understand the natural language information related to visualizations.

This paper is motivated by this need for a methodical understanding of how natural language is used to verbalize visualizations, and takes the first steps towards building such an understanding by developing and conducting a novel methodology to study the interaction between visualizations and verbalizations of systematically varied data relationships. To inspire and facilitate further studies, we devise a semi-automated methodology that involves user experiments along with data and natural language analysis methods that will serve as an extendable approach for studying verbalizations of data visualizations and for generating semantic lexicons and utterances to serve as the basis for the design of multimodal solutions.

We introduce and apply our methodology by investigating how the visualization of correlation relationships between two variables is verbalized by people. We are interested in correlation analysis as a fundamental task in data science and as a representative of cases where people reason with visual representations, namely scatterplots in this instance. We describe two crowd-sourced experiments aimed at gaining insight on the use of natural language with visualizations, drawing inspiration from studies that investigate the perception of systematically varied correlation in scatterplots 

[30, 13]

. The analysis of results is done using a number of natural language analysis and machine learning techniques to semi-automatically analyze the resulting data.

The first experiment is a visualization to natural language study, in which participants are asked to provide verbalizations of a data relationship when viewing scatterplots for the varied levels of correlation. We then analyze the results to construct a semantic lexicon that maps higher level concepts, related to data visualizations, to the vocabulary used by participants. The analysis is described as steps that can be later used to expand the constructed lexicon.

We then describe a second experiment, on natural language to visualization, where we prepare utterances based on the semantic lexicon and ask participants to select scatterplots that match the utterances. We analyze how the utterances resonate with participants and compare the results of the two experiments, demonstrating how the data collected across the experiments can be used as empirically evidenced templates for natural language generation.

The results indicate a wide range of vocabulary used by participants to refer to and describe relations, which we then organize under five concepts that indicate “what” participants refer to, and five traits that outline “how” they refer to them. We observe that responses often involved multiple concepts and traits with prominent particular combinations, and distinctive use of them over the whole correlation range. The comparative analysis between the first and the second experiments’ results reveals word combinations that are represented consistently and would serve as reliable vocabulary to construct utterances for constructing “empirically evidenced” descriptions of visual representations. The analyses in Sections 4 and 5 and the following discussions in Section 6 elaborate these points further.

The contributions of the paper are thus summarized as:

• Empirical data, analysis and observations to advance towards an understanding of how people process and verbalize correlation in scatterplots;

• A transferable methodology for systematic studies of verbalizations of visual representations of data through crowd-based studies and natural language processing;

• A semantic lexicon and empirically evidenced utterances that can help structuring templates for natural language generation in multimodal interactive data analysis systems

• A research agenda for further studies that systematically investigate the verbalizations of visual representations of data.

2 Related work

2.1 Visualization and natural language

The use of natural language generation (NLG) to complement visual information has been investigated in several application areas, as well as in domain-agnostic settings. As authors often describe the development of multimodal systems as a whole, the language realization step is rarely discussed, with issues about the use of certain words often appearing during evaluation. Srinivasan et al. [37] combined the generation of data-driven statements with the exploration of alternative visualizations to facilitate visual analysis; the authors report on study participants emphasizing the need to understand what words such as moderate and strong meant for the system. A slightly different approach, less focused on natural language, was used by Demiralp et al. [11] when providing an insight-based visualization exploration system. Mumtaz et al. [23], who introduced a multimodal tool to help with code quality analysis, acknowledged this by including tooltips to clarify the boundaries when values change from low to medium. Jain and Keller [16] also reported changing their textual health alerts based on feedback on the use of words such as significant. We also identified instances of authors using synonym lists to keep statements interesting [19], but in none of these cases there is a thorough consideration of the alignment of data, visualizations and natural language from a language perspective.

In addition to the importance of acquiring knowledge for this alignment, as emphasized in the literature (e.g., Reiter [28, 29]), we also motivate our work from the perspective of pre-design empiricism in visualization [7], which advocates gathering empirical evidence and documenting it before proceeding with designing multimodal tools. With this in mind, our contributions in this paper complement the described works through an open and reproducible methodology, that can be part of the pre-design process to improve NLG support.

Natural language is also used as an input in multimodal systems, as a means to provide alternative methods for hands-free interaction [39, 38] or as a complementary medium to traditional interaction modes [15, 33, 40, 12], allowing users to communicate with their own words rather than learning a potentially complex set of interactions. The methodology introduced in this paper also has a role in these cases, though in a more limited scope, to reveal the different manners through which users communicate about visualizations, which can then be adapted accordingly.

2.2 Perceptual studies

Fig. 1: Overview of the proposed methodology. The orange boxes visual encoding and data relationship represent the objects of the study. Experiment 1, in the upper box, contains three steps: (1) conducting the user study, (2) analyzing the results and (3) building the semantic lexicon from the analysis. Experiment 2 complements it through a (4) second user study and (5) analysis of results, leading to the refined lexicon. The arrows from E1 to E2 indicate how the two experiments are related: through the visualizations that are used in both experiments and the classified utterances that used for the comparison. The arrows with the blue labels indicate the outputs: the utterances and selected visualizations from each user study, and the categories of high-level concepts derived from the utterances.

Our work is also related to the investigation of perception of visualization. Most prominently, researchers have long sought to investigate how humans perceive correlation in scatterplots, with various studies [5, 26]

on exploring the ability to estimate varying levels of correlation. More recently, the information visualization community has approached this problem from a design perspective, focusing on the perception of correlation as a way to model the effectiveness of visualization design. Results from these studies showed how varying correlation for normally distributed variables can be modeled with a few parameters 

[30], and also systematically demonstrated how different types of visualization are more or less suitable for these tasks [13, 17]. Beecham et al. [3] extended such studies into studying visual inferences from systematically varying levels of structure in visual representations of geospatial data. Sher et al. [35] reported results that brought new questions to the field, with results that demonstrated a difficulty in accurate estimations when the underlying data deviated from normal bivariate distributions. Correll and Heer [10] explored similar conditions when investigating how reliable are people’s estimates of trends and missing values. The idea in these studies, of systematically presenting varied levels of correlation, served as an inspiration for our experiments. In contrast to these studies, however, our experiments focus on the highly subjective application of natural language rather than the accuracy of estimations.

Other tasks were also tested regarding the depiction of correlation with scatterplots. Pandey et al. [25] investigated the perception of scatterplots based on judgment of similarity. Rather than modeling perception based on correlation, their aim was to understand which perceptual features of scatterplots were used when participants grouped them. The output of this study was a list of concepts that describe scatterplots, extracted from judgments of participants. Although their concepts partly overlap with our findings, our semantic lexicon includes a wider range of characteristics.

3 Methodology

As discussed in the introduction, the overarching goal of our study is to build a deeper and systematic understanding of how people verbalize correlation relations as seen through scatterplots. Through this investigation, we also aim to generate an empirically supported semantic lexicon [31, 20] to use as a basis for future multimodal interactive systems. To inspire and facilitate further research in this space, we devise a transferable, semi-automated methodology for conducting systematic studies of verbalizations of data visualizations. The methodology, illustrated in figure 1, is guided by two general objectives that relate verbalizations with visualizations and vice-versa:

O1: Determine the fundamental linguistic aspects of the association of verbalizations with visualizations;

O2: Identify the alignment between the association of verbalizations with visualizations and of visualizations with verbalizations.

In this paper, we use these objectives in the context of visualizing correlation relationships in scatterplots, however, it is important to note that the methods and analyses presented are neither limited to scatterplots nor correlation. Rather than pinning down preferences or accurate estimations through hypothesis-driven experiments, we design and conduct exploratory experiments aimed at collecting a wide range of verbalizations that will enable us to achieve the objectives. The methodology thus comprises two main experiments: E1: Visualization to Natural Language, in which we gather verbalizations of scatterplots of systematically varied levels of correlation between two variables, and E2: Natural Language to Visualization, in which we gather data on how participants associate scatterplots of varying correlations to natural language descriptions.

To achieve O1, E1 consists of three steps: (1) a user study to acquire utterances; (2) a semi-automated analysis pipeline, in which natural language processing (NLP) methods are alternated with manual coding steps, and (3) the construction of a semantic lexicon from the derivation of high-level concepts of how participants verbalize the visualizations.

Objective O2 is achieved through E2, in two steps: (4) a second user study, which asks participants to choose scatterplots that match statements describing a correlation relationship, and (5) the analysis and comparison between distributions of the selected scatterplots and the distribution of utterances matching concepts and traits from E1, based on the distance between histograms of these distributions. This second step also helps with the validation and refinement of the semantic lexicon. In addition to the analysis steps, Sections 4 and 5 also include descriptions of material preparation, participants and procedure for the two studies.

4 Experiment 1: Visualization to Natural Language

The first experiment involves asking participants to provide textual descriptions of scatterplots. The results are collection of utterances for the varying levels of correlation, from which a semantic lexicon is built. It includes three major steps of the methodology, which are the study itself, the analysis of utterances and the construction of the semantic lexicon by extracting the main concepts and traits used in the utterances. This enables completing the first objective (O1). We describe material preparation and task procedure, discuss details about the analysis methods used and present the semantic lexicon.

Fig. 2: Screenshot from user study #1 where we ask participants to describe the relation in a scatterplot.
Fig. 3: Screenshot from user study #2 where we ask participants to indicate the scatterplots that match a verbal description of a relationship.

4.1 User study #1

Fig. 6: a) Top 30 words extracted from all answers; nouns ”A” and ”B” were replaced by vara and varb to facilitate analysis. b) Distribution of common adjectives extracted from bigrams and trigrams for levels of correlation. Looking at the changes in how adjectives are observed over correlation levels reveals certain adjectives being over/under represented with certain levels having less dominant terms (e.g., around 0 correlation).

Materials and participants: We generated 9 scatterplots (fig. 3) inspired by the scatterplots used in previous research [30, 13, 17]. They were generated by drawing samples from a bivariate normal distribution, with correlations ranging from -0.8 to -0.8, in increments of 0.2. Scatterplot figures are 288 pixels in both dimensions, with axes named for abstract variables A and B and ticks drawn for even numbers from 0 to 10. We recruited 160 participants through the Prolific online platform; our only pre-screening filter was self-declaration of native language as English. We used abstract variable names instead of real concepts (such as birth rate, GDP, etc.) so that participants would not relate more/less to the data, and any potential bias due to familiarity with the data is avoided. Moreover, this avoids additional layers of interpretation whilst the scatterplots are verbalized and the resulting data is analyzed.

Procedure: Each participant was asked to write an answer to the instruction ”Please describe the relationship between the variables A and B in this chart” for each scatterplot, in a randomized order (see Supplement1 for the online surveys submitted as additional material). Before proceeding with the task, participants were given a description on the use of scatterplots to analyze the data – we note that this description, while it included the word relationship, did not include the term correlation. We carefully avoided biasing participants in this training by not providing any interpretation or description of the relation between the variables, and instructed participants to give complete answers and avoid answering with same as before or same.

4.2 Analysis

We collected 1428 answers out of 1440 potential answers; 11 participants from an initial test run saw 8 instead of 9 scatterplots and we removed one answer which was ”n/a”. As part of the data cleaning process, we corrected spelling errors when sensible, such as “realtionship”, but did not change words such as “correlationship”. With our aim of applying a reusable methodology, we proceeded with a semi-automated analysis pipeline, combining natural language processing (NLP) methods with manual coding. A complete manual approach, such as thematic analysis, would not fit our criteria for transferability; on the other hand, our aim was to map concepts to answers rather than classify answers by concepts, thus we decided not to use classification or clustering techniques such as topic modeling. In the following, we go over the NLP analysis, followed by the derivation of conceptual categories before we move on to the semantic lexicon construction. All the steps described in the paper are accessible in the Python notebooks included in the supplementary material (also in the online repository at https://gitlab.city.ac.uk/nlvis/wec_supplement).

4.2.1 NLP pipeline

NLP processing: the first step taken was to replace occurrences of references to the variables A and B to identifiable proper nouns VarA and VarB. Following this, we proceeded with the tokenization (a process to break down text into individual words), part-of-speech (POS) tagging (marking up a word in a text as corresponding to a part of speech such as nouns, adverbs or verbs), and lemmatization (the process to group together the different inflected forms of a word into a single word, e.g., “am”, “is”, “are” being mapped to “be”) of each answer [9]. After a few tries with different tokenization methods, we decided to use regular expressions so that hyphenated words would be preserved. Tagging was done using the Stanford CoreNLP library [42] and lemmatization was done using the WordNet lemmatizer [4] applied only to the supported POS tags (nouns, adjectives and verbs). In the processed results, the average number of tokens per answer was 12.01; across correlations, the average number of tokens did not vary much, falling between 11.14 and 12.64 tokens.

Filtering and summarization: after the initial processing, the analysis proceeded with an exploratory stage to further understand the results, supported by other NLP-related methods. First, we excluded stop words and summarized the remaining words to find out which words and parts of speech were common and relevant in our data, to decide how to proceed with the exploration. The summary, shown in figure (a)a, revealed that a large number of answers made direct references to the variable names, followed by variable, relationship (a word that was part of the instruction), correlation, increase, high and low. In the figure, words that might be different parts of speech are aggregated – increase includes both nouns (singular and plural) and verbs (third person singular or plural). Nonetheless, we identified a high number of nouns, which is in line with the focus of the rest of analysis.

Collocation analysis: as the next step, we analyzed collocations – word combinations that co-occur frequently [9], initially only with adjectives and nouns and focusing on bigrams and trigrams. Here, it is important to note that our analysis had a different final objective compared to traditional NLP scenarios where collocations are used. In these scenarios, distinguishing collocations are important to differentiate documents (or utterances). However, in our case, the objective was not to characterize individual answers, but to understand how language was used in these answers across dataset. From the resulting list of bigrams and trigrams, we extracted a list of the most common adjectives and compared their distribution across correlations to investigate how these words were used

The analysis of the types of adjectives and how often they appeared provided an overview of how the varying levels of correlation are verbalized. Although not many adjectives seemed unexpected, such as positive and negative, our task procedure did not really suggest that people should focus on one or the other characteristic of the relationship, and the variety of adjectives reflects this. Another observation in our analysis was the low occurrence of some potentially common adjectives, which suggested that collocation analysis was not enough as a method to use for categorization and counting of certain words.

TABLE I: Definition, words and examples of answers for the traits identified through the analysis of results.

4.2.2 Derivation of categories

The previous two steps indicated the types of words that participants used to describe the relationship between variables, in a general manner and also across correlations. This preliminary knowledge allowed us to proceed to capture higher level ideas that would be used in the lexicon. For this stage, rather than using automated approaches such as topic modeling or analyzing whole sentences and manually tagging them, we decided to create categories with the collocation analysis as a starting point. For that, adjectives and nouns were separated into two lists of groups: in one list, each adjective was paired with the groups of nouns that it appeared next to in a collocation; in the other list, the inverse was done: each noun was paired with the groups of adjectives that appeared next to it. The same procedure was done with adverbs and verbs.

Based on their functions, we organized nouns and verbs under concepts and the adjectives and adverbs that modify these concepts under the traits, which became the two axes in the resulting semantic lexicon [20]. Here, rather than starting with a complete bottom-up approach, we focused on names for concepts and traits from the continuum between data visualization and statistical relationships. This continuum includes words that describe elements of charts and visual channels – shapes, position, transparency, etc. – and words related to data/statistics – relationships, arrangements, characteristics of data. Our objective was to define categories that would be sufficiently distinct, but that could also potentially encompass sub-categories and be open for future changes. Several rounds of manual coding, refinement and discussion between authors were done until a final set of core concepts and traits was defined. Most of the resulting words were categorized in this manner – words that did not fit into our refined continuum were left without a category and deemed unhelpful for the lexicon.

TABLE II: Definition, words and examples of answers of the concepts identified through the analysis of results.

The following concepts, summarized with examples in table II, emerged from our analysis:

  • Relation: words in this category describe the relationship between variables, including primarily relationship and correlation. References to the variables A and B and the word variable are also included in this category, as they were used often in answers that define a correlation relationship (e.g. ”as A increases, so does B”). This concept also includes link, association, connection, degree, level, proportion;

  • Behaviour: words that describe arrangements or motion of data points, including trend, pattern, variation, rate, etc. This category also includes words that rather explicitly suggest direction, such as increase – the key difference between behaviour and direction (which is one of the traits we discuss next) is the part of speech, or how the word is used in a sentence: nouns and verbs are categorized as concepts, whereas adjectives and adverbs are categorized as traits.

  • Graphic: words that refer to visual features, including dots, line, points, chart, graph, plot, space, curve;

  • Space: words that refer to spatial distribution, grouping of items or measurement scale, including

    values, group, location, anomaly, cluster, outlier, distribution, concentration


  • Inferences: words that are used when the participant described additional information or gave meaning to the visual features, including confidence, data, results, information, likelihood, majority, meaning, significance.

These concepts appeared along the following traits, summarized with examples in table I:

  • Magnitude: mapped adjectives and adverbs in this category refer to strength or amount, including words such as strong, many, moderate, slight, weak, minimal;

  • Direction: this category includes references to signals such as positive and negative, as well as upwards, downwards, increasing, rising, declining;

  • Position: this category includes spatial positions, such as high, low or bottom, as well as descriptions of spread of distribution, such as random, widely or outside, among others.

  • Discernibility: this category refers to how easy it is to identify a concept, including words such as clear, identifiable, discernible, noticeable, diffuse, distinct, vague;

  • Regularity: this category describes the stability or the form of a concept, including words such as tight, linear, steady, direct, inconsistent, loose, uneven.

We additionally defined a special category Negate to categorize answers between affirmative and negative statements, as a quick glance over the collected utterances revealed the need to account for the negation of some of the concepts or traits, in sentences such as ”no clear relationship”.

4.3 Lexicon construction

This process resulted in four lists of pairs of word and category, for adjectives, nouns, verbs and adverbs, which were used to map the concepts and traits to the answers. Each answer was tagged only once per occurrence of a concept or trait. This led to a further refinement of the categories: once a first round of analysis was done, the remaining uncategorized answers were analyzed and new words were added to the lists and categorized, until all answers had at least one category, either concept or trait – a final set of 8 utterances were not tagged as their words did not map to a concept or trait.

The iterative process also included prepositions, such as above or below, that were not captured by the collocation analysis. The complete set of tagged utterances range from one concept to long, multi-sentence answers that include up to 9 concepts or traits. The distribution of the number of answers per number of concepts and/or traits tagged, for the complete set of answers, is shown in Fig. 7. The distribution shows that, while the majority of answers referred to 2-4 concepts or traits, a high number of participants provided very long answers even if they were not asked to write a specific number of words or cover all aspects of the chart.

Fig. 7: Distribution of answers per number of concepts or traits mentioned. While a moderate number of answers include only a single concept or trait, the majority of the answers include from 2-4 concepts. There are no noticeable patterns in the distributions.
Fig. 8: Most common combinations of concepts, traits and negative statements with examples of utterances. With exception of one combination, most other common combinations are for 2 or 3 concepts or traits. The lower number of answers, for the bottom four combinations in the table, reinforces the notion that participants used a wide variety of unique combinations of concepts and traits.

4.4 Analysis of concepts and traits

After arriving at a list of concepts and traits, we proceeded to analyze the categorized utterances according to the varying levels of correlation. As an initial overview, we looked at the combinations of concepts, traits and negative statements for the whole data, revealing 419 distinct sets of combinations, with almost half of them (210) being unique to a single answer. Fig. 8 displays the ten most common combinations of categories for the whole data; describing the absence of relationship with a negation was the most common combination (e.g, ”no clear relationship”), followed by description of behaviour and inferences (e.g. ”as A increases, B decreases”). The low number of answers for the most common combination (52), relative to the total number of answers (1428), demonstrates the diversity of combinations for describing the data relationships.

Figure 9 shows the co-occurrence of concepts and traits for the varying levels of correlations; each stacked bar corresponds to affirmative (in blue) and negative statements (in gold). As the categorization is a many-to-many process, each chart in the figure is potentially depicting utterances that appear in other plots as well; as such, the aggregated distributions of concepts and traits, above and to the left, are not the marginal distributions for each row or column.

As seen in the figure, certain concepts, traits or their combinations have representations concentrated on particular ranges of correlation. Two pronounced examples are direction and behaviour that are over-represented for more extreme values of correlation, i.e., bi-modal distribution with peaks around higher and lower correlation values. This indicates that participants identify strong correlations with their inclinations, e.g. positive, negative, increasing and how the points are arranged, e.g. trend, pattern, rate. Some others, such as magnitude or inferences, are used uniformly regardless of the correlation value. Such ”uniform” categories indicate language that is regularly used by participants to describe the scatterplots. For instance, the magnitude of the correlations is often important to mention and participants indicate inferences and try to interpret the “meaning” of the scatterplots irrespective of the strength of relations. An increasing use of negative statements for position, combined with inferences and relations, for the lower levels of correlation, is also noticeable. An example of utterance with this combination is ”the relationship between A and B is random and there is no correlation”. Discernibility also contrasts with the other traits for its common combination with negative statements rather than affirmative (e.g. ”no clear relationship”).

In addition to these prominent observations, we also identified what we consider a few peculiarities on how participants used some of the concepts. An interesting use of the concept of space was in describing a coordinate system, with the variables A and B as references. Examples of such utterances are ”loose cluster of data points distributed around A6, B6 with a few wild outliers” and ”scattered points concentrated most densely between 6A and 4-8B”. What stands out regarding this is that the scatterplots do not contain gridlines, and ticks are included only for even numbers (which at least explains why most of the grid references are indeed for the even positions).

Fig. 9: Counts of co-occurrences of concepts and traits per level of correlation, with color separating affirmative statements (blue) from negative statements (in gold). The upper row and the leftmost column are the individual distributions of occurrences for concepts and traits, across correlations. In the central region, the charts show the co-occurrence of pairs of concepts and traits; the counted utterances may also include other concepts and traits. The figure shows the dominance of the combination of magnitude with inferences and also relation. As inferences include references to the variable names, and relation includes references to ”correlation” or ”relationship”, the distribution matches well with the most common words identified (as seen in fig. (a)a).

5 Experiment 2: Natural Language to Visualization

This section describes the second experiment, in which natural language statements are presented to participants, who then select the scatterplots to which the statement applies. The results are the frequency of choices for varying levels of correlation, providing empirical evidence for natural language generation. As part of the overall methodology, the statements are prepared based on the combinations of concepts and traits synthesized in E1. As such, the experiment serves as a tool to validate and refine the semantic lexicon, as the results can be compared with the distribution of concepts and traits across correlations from E1. However, the statements and the charts themselves can also come from other sources, so the experiment can also be conducted independently. We describe the steps of material preparation, task procedure and analysis.

5.1 User study #2

5.1.1 Material preparation

For this experiment, we prepared 15 statements based on seven combinations of concepts and traits. Four of these groupings include the relation concept, which includes terms such as correlation and relationship, combined with four traits: magnitude, direction, regularity and discernibility – this was additionally combined with a negation. Another pair of statements was the combination of behaviour and direction, whilst the last two pairs of statements combine inferences with space and position. This resulted in the following statements:

(A) Relation and Magnitude:

• There is strong correlation between the variables A and B

• There is moderate correlation between the variables A and B

• There is weak correlation between the variables A and B

(B) Relation and Direction:

• There is a positive correlation between the variables A and B

• There is a negative correlation between the variables A and B

(C) Relation and Discernibility with Negation:

• There is a clear relationship between A and B

• There is no obvious relationship between A and B

(D) Relation and Regularity:

• There is a tight relationship between A and B

• There is a loose relationship between A and B

(E) Inferences and Position

• When variable A is high, so is variable B

• When variable A is high, variable B is low

(F) Behaviour and Direction:

• There is an upward trend

• There is a downward trend

(G) Inferences, Space and Position:

• When the values of A are high, so are the values of B

• When the values of A are high, the values of B are low

We would like to note that here we identified a set of representative categories and associated utterances to demonstrate how we conduct E2 and compare results from E1 and E2. Even though we aimed to select the most common category combinations and utterance structures, we used our judgment to identify utterances that we thought would resonate with participants, while covering a wide range of concepts and traits. An alternative and a much more comprehensive study could have selected utterances from all the most common combinations, as seen in figure 9 – in this case this would mean 25 distinct pairs of categories.

5.1.2 Participants and procedure

We recruited 120 participants through the Prolific online platform; our pre-screening filter was self-reported English as a native language and no participation in the first experiment. For the survey itself, we again used the Qualtrics system. The scatterplots from experiment 1 were positioned on a 3x3 grid and labeled sequentially Chart 1, …, Chart 9. A randomizer was set up so that, on average, each participant would see 4-5 statements and thus each sentence would have been seen by a similar number of participants.

Each participant was asked to follow the instruction: ”Please select all the charts that apply to the following statement”, followed by the utterance. Before proceeding with the task, participants were shown a description about the use of scatterplots to analyze data, similarly to the first experiment.

There are three points in our procedure here that need highlighting as potential issues: participants’ lack of enough knowledge about statistics, lack of attention to the task at hand or lack of engagement with the experiment. We included one knowledge check question about correlation, which would not interrupt the task. After comparing the answers to the question with the overall answers, we decided not to filter any participants. This was also due to the fact that we are not judging accurate estimations and that some of the statements involved highly subjective interpretations. The answers to this knowledge check question are included in the complete data in the supplementary materials.

5.2 Analysis

Fig. 10:

Iterative steps of semantic grouping for Relation and Magnitude. The first groups are defined based on the intensity-driven word vector distances, with acknowledgment that it is not the optimal configuration. After step one, words are moved between groups or dropped until the distances between the two experiments, for each group, are subjectively acceptable. In the example, the best configuration of answers from E1 which are mapped to

Strong is when the word strong itself is the only word left in the group. That is, it is the only time when the Hellinger distance between the distributions of E1 and E2, for Strong, is the shortest, based on the words elicited in E1.

We collected 550 answers from 120 participants. We first present and discuss the results of this experiment, and then compare them with the results of the first experiment and discuss the implications. In the following, we organize the discussions according to the categories of utterances introduced in Section 5.1.1.

Relation and Magnitude: this category included three utterances, with magnitude varying between strong, moderate and weak. For strong, the distribution for the second experiment is bimodal and symmetric, with peaks in the extreme values (right side in fig. 10), indicating that participants have an almost unambiguous association of strong with minimum and maximum values of correlation. For moderate

, although the distribution is also bimodal and symmetric, it is not skewed and the peaks are located towards the mid-values of correlation. The shape of the distribution suggests that, when all scatterplots are seen together,

moderate correlation is more ambiguous. Lastly, the distribution for weak is relatively uniform, except for the low occurrence in extreme values, indicating that even the absence of correlation is also associated with the word.

Relation and Discernibility with Negation: for this combination, utterances included one affirmative and one negative utterance. Overall, the results (top side in fig. 11) suggest that participants associate less ambiguously the existence of a clear relationship with the scatterplots that depict higher levels of correlation (0.6 and 0.8, positive and negative) and, conversely, no obvious relationship from -0.4 to 0.4.

Relation and Regularity: the utterances for this combination referred to tight and loose relationships. The distribution of results (bottom right side in fig. 11) show that participants strongly associate tight relationship with the extreme values of correlation with little ambiguity. In contrast, the distribution for the choices associated with loose is almost uniform, with very small peaks around 0.6, suggesting that it is an ambiguous term. Although we did not ask participants regarding their views on the meaning of words, we also speculate that tight and loose are not interpreted as having exact opposite meanings in relation to a data relationship such as correlation.

The results for the three combinations described so far characterize the empirical support for terms for which the meanings are not directly linked to the underlying mathematical definitions or to the visual forms of the charts. The next four combinations include traits that have a closer association with the mathematical definitions of correlation, such as direction with positive and negative. This means that for such statements, it is possible to assign a correct answer. However, as our experiment was not designed to obtain accurate estimations, our analysis focus on the interpretation of the statements in relation to the scatterplots.

Relation and Direction: this combination had two utterances, for positive correlation and negative correlation . The distribution of answers (bottom right fig. 12) for negative is right-skewed, decreasing across correlations . For positive, the distribution is left-skewed, being uniformly low from -0.8 until 0.4 and then sharply increasing at 0.6. We speculate that the differences between the results are likely the result of an anchoring or baseline effect, as participants saw all stimuli at the same time. However, further experiments would be needed to confirm such effect or clear up the differences between positive and negative.

Behaviour and Direction: for this combination, the statements mentioned either an upward trend or a downward trend. Unlike the previous combination, the results for both (top right side in fig. 12) seem to be more closely aligned, being right-skewed for downward and left-skewed for upward in a similar manner. We speculate that the difference between the categories is due to the fact that upward and downward are associated with the visual form rather than the mathematical definition of correlation.

Inferences and Position and Space and Position: in these two pairs of utterances, the difference is the use of the words variable and values. The results for both combinations are very similar (see fig. 13), with a high number of selections in the range of 0.6-0.8. The two categories are more focused on the visual descriptions rather than an interpretation associated with degrees of correlation. For both categories, all scatterplots were chosen at least once; this stands out considering that some of the scatterplots clearly do not match the descriptions. Although we did not ask for justifications or timed the answers, we speculate that participants might have disengaged with the task given the longer form of these utterances compared to the other simpler statements.

Fig. 11: Comparison of results between experiments for Relation + Discernibility and Relation + Regularity, with the comparison aiming to validate the definition of traits. For the first combination, the distributions are very similar, with distances of 0.2917 for clear relationship and 0.1260 for no obvious relationship., suggesting that both traits are meaningful. For relation and regularity, although the numeric distance is low enough (0.2554 for loose and 0.2514 for tight), the shape of the distribution for loose in E2 suggest that the trait is quite ambiguous.
Fig. 12: Comparison of results between experiments for Behaviour + Direction and Relation + Direction, with the comparison aiming to validate the definition of traits and identify differences between concepts. For the first combination, the distances between the distributions are 0.1555 and 0.1691 for downward trend correlation and upward trend, respectively. For the second combination, the distances between the distributions are 0.2509 and 0.2444 for negative correlation and positive correlation, respectively. Overall, all distances are low enough to support the definition of traits. For both upward and positive, the distributions in E2 are slightly more skewed compared to E1, suggesting that, in a verbalization to visualization (E2), it is possible that the trait resonates more than in a verbalization to visualization (E1) scenario.

5.3 Comparison with E1

The connection with the previous experiment through the statements and scatterplots enables different analyses based on the types of traits:

(i) for traits that have a more subjective interpretation, the comparison enables going back to the answers collected in E1 and iteratively grouping statements and words based on their similarity – this is the case for the magnitude trait, in which words such as moderate do not have an anticipated meaning for correlation;

(ii) the comparison also serves to further validate the definition of certain traits, in particular for their use in NLG, complementing the overall methodology. This is the case for discernibility and regularity, which display differences between experiments.

The following approaches were used here to compare the results for these different cases; we also discuss the value and limitations of the experiment further in this section.

Semantic grouping: the semantic grouping involves matching words, identified from the utterances in E1, with the smaller set of words used in the statements in E2. This is required for traits for which there is no clear quantifiable distinction between adjectives. In the case of magnitude, for example, there might be clearer distinctions between strong, moderate and weak. However, deciding if an utterance that describes slight correlation corresponds to moderate or weak is more challenging. Such mapping is important in any case of using natural language in data analysis: if users describe slight correlation, the system must have ways to deal with it; if an output describes slight correlation, it is expected that a majority of users should interpret it appropriately.

There are several ways to do this. In this paper, we illustrate one data-driven method and use an existing vector space of words [18] where the distance between adjectives take the semantic intensity of the adjectives into account. This means that, in this space, vectors representing words such as weak and strong have a longer distance between them than in unweighted vector spaces. A 2D projection of this space, using the t-SNE algorithm, is used to determine initial groupings of words (included in the supplementary materials); iterative refinements of the groups, based on a quantified similarity (see below), are done until an adequate set of interchangeable words is found.

Affirmative-negative matching: for binary traits such as discernibility (clear or vague), utterances that include them in combination with a negative statement configure a pair of similar utterances. An example is vague relationship being similar to no noticeable relationship. For the comparison, distributions of both kinds of answers should be combined to better reflect the meaning of the utterances. Further down this section, this is what is done to compare discernibility and regularity traits.

Fig. 13: Distribution of results for the second experiment for Relation + Position and Space + Position. For these results, what stands out is that every scatterplot was chosen at least once across the statements. As the statements are more descriptive about the scatterplots, therefore not requiring data-related interpretation, we speculate that participants perhaps did not engage much, possibly due to the longer length of the statements.

Quantifying similarities: To compare the distributions of the answers quantitatively, we use normalized distributions of charts chosen by participants, for each of the nine levels of correlation across the statements. As the resulting distributions are discrete, we use Hellinger distance [2] to quantify the similarity of the distributions from the two experiments, where applicable. The distance is a metric that produces results from 0.0 to 1.0, where values closer to 0.0 mean that the distributions are more similar and values closer to 1.0 mean that the distributions are more different. The metric is also advantageous as it is nonparametric and distribution-free, therefore it does not impose any assumptions about the shape of the distributions that are being compared.

Relation and Magnitude: as described above, the semantic grouping step is required to be able to compare results of the magnitude trait between experiments. Figure 10 illustrates four iterations of magnitude groupings: the first step consists of swapping the initial assignment of few and slight, from moderate to weak and vice-versa. This results in a stronger group for weak, with the Hellinger distance between the experiments, which is already quite low at 0.1481, going down to 0.0968. However, the distance for moderate increases significantly to 0.2929. The next step to improve this is moving three words, broad, condensed and concentrated from strong to moderate, which has the effect of improving the distances for both groups. Lastly, we remove the word large, leaving strong itself as the only word for its category, with a final distance of 0.1166. Here it is important to emphasize again that the comparison aims at finding groups that are good enough, a highly subjective decision for which data-driven methods can help with.

The comparison of results indicate that strong is a suitable category, aligned between experiments, but it is also best represented by the word strong alone, without other potential synonyms found in E1. Moderate also aligns well between experiments and include some interchangeable words. Weak too aligns well, but it is quite ambiguous for mid-range and lower levels of correlation.

Relation and Discernibility and Relation and Regularity: the comparison of these two pairs aims at validating the definition of the traits and giving insight into how the traits vary between the two scenarios – visualization to verbalization and verbalization to visualization. For discernibility, from the answers of E1, statements were matched to clear or vague; for vague, negated statements for clear were also grouped together. The Hellinger distances between the distributions are 0.2917 for clear relationship and 0.1260 for no obvious relationship, validating the choice for the traits. The slightly higher distance for clear relationship suggests that the absence of a relationship is a stronger concept than the existence of it, in a visualization to verbalization scenario.

For regularity, the distributions of answers from the first experiments were combined in the following manner: answers mapped to ”loose” were combined with the negated utterances for tight, whilst answers mapped to tight were combined with negated utterances for loose. The Hellinger distances of 0.2554 and 0.2514 for loose and tight, respectively, validate the choice for the trait. However, as discussed above, a tight relationship is a much less ambiguous concept than a loose relationship in the verbalization to visualization scenario. As the second experiment only covered the word ”loose”, further experiments focused on linguistic aspects would help to clarify the differences.

Relation and Direction and Behaviour and Direction: For these two categories, the comparison is again aimed at both validating the traits and understanding the differences in the scenarios. For the first combination, the distances between the distributions are 0.2509 for negative correlation and 0.2444 for positive correlation. The main difference between results is the shape of the distribution for positive correlation, which in the second experiment is concentrated in 0.6 and 0.8, whereas in the first experiment, participants identified the scatterplots from 0.2.

For behaviour and direction, the distances are slightly lower, with 0.1555 for downward trend and 0.1691 for upward trend. As discussed in the previous section, it is possible that the small difference between the two categories relates to the fact that upward and downward are linked to the visual form, but the data collected do not support further conclusions. From the results, both combinations seem appropriate for describing correlation in scatterplots.

6 Discussion

6.1 Study design

We presented two possibilities of experimental setups for achieving our main goal, each addressing a different objective. For the first objective, determining the fundamental linguistic aspects of the association of verbalizations with visualizations, we devised an experiment aimed at collecting utterances without biasing participants in particular directions or language. This meant, for example, choosing a dataset with abstract attributes. While this fulfills our initial aim, it does not reflect real-word situations, which could include analysts dealing with either familiar or unfamiliar data. It is possible that with such data, participants would refer to the variables more often and would try to include additional interpretations of the scatterplots.

For the first experiment, an important aspect is participants’ knowledge about correlation, which affects the variety of utterances and the related concepts and traits. Being the first published lexicon of this kind, we argue that it is preferable to have a wide range of vocabulary representative of a diverse cohort. Therefore, we did not restrict participants based on any criteria besides self-reported English as a native language. We also argue that knowledge or attention checks during the experiment would likely influence participants’ answers. We recognize that some of the observations and language would be different for different cohorts, that might be more or less numerate. In future work that expands the lexicon for specific types of visualization or that targets a narrower set of participants, a more rigorous screening procedure would be welcome.

For the second experiment, the task procedure is also important. Our setup presents a small number of statements to each participant, each at a time, from the whole pool of statements. One of the aims in doing so was to collect results that would mirror the structure of results of E1, of answers per level of correlation, so that a comparison between experiments would be possible as part of the methodology. Variations of the presentation of the statements and scatterplots include, for example, showing one scatterplot at a time, to potentially minimize comparison between scatterplots.

In terms of overall design, an alternative study that would elicit a similar kind of information is matching statements to scatterplots via a proxy measurement, such as level of agreement. While this would work well for correlation estimation (such as those by Rensink [30]), for comparing results our preference was to ask for direct mappings between the statements and the scatterplots, aware of the fact that this hides potential uncertainties about the answers.

Another further question is whether a similar setup to the original ”Words of Estimative Probability” study [36] could have been conducted here; that is, present participants with some words and ask them to enter a number for the correlation strength they think the word represents. Unlike the notion of probability, where there is a more widely understood link between concept and number, the concept of correlation is open to interpretation and a much less established understanding could be expected, and to further complicate matters, there is even inconsistency between different numerical metrics that measure correlation [14]. Motivated by these, we did not expect participants to represent their interpretations through a number or a level of confidence as we discussed above.

6.2 Context of the concepts and traits

The concepts and traits that we derived from our results are limited by our context of correlations in scatterplots. Although it is likely that, as higher level categories, they will generalize to other visualizations and relationships, it is uncertain how the same words would be used in different experimental setups. Nonetheless, as we discuss in sec. 7, our aim is to provide a baseline for future studies, which could extend or replace parts of our overall categorization and word mappings.

6.3 Employing a semi-automated approach

As outlined in the previous sections, the proposed methodology employs automated natural language processing and machine learning techniques to facilitate the analysis of the resulting data. Methods such as thematic analysis [6]

could be considered more suitable for this setting, where the whole data set is coded and categorized by experts manually. With the crowd-based studies, however, conducting such thematic analysis on the whole set is much less feasible due to the large volume of responses and the automated steps reduce the volume of data to be manually analyzed to more manageable levels. And compared to the conventional settings where thematic analysis is applied, the data we operate on in our setting is much more structured and responses are much more targeted, making it suitable for automated analysis such as, part-of-speech tagging or n-gram extraction. The methodology and second experiment are also a way to provide a scalable alternative to identifying empirical support for words and concepts. A more conservative method would be to run E2 for as many words as needed; the aim of the comparison between experiments is to mitigate the need for that.

7 Research Roadmap

The research presented in this paper is a starting point towards the wider goal of understanding how natural language and visualization work together in data analysis contexts. In this section, we present a research agenda to facilitate the further exploration of this topic and we think that the experiment designs, computational analysis routines, and the resulting data, which are all available in an online repository (http://gitlab.city.ac.uk/nlvis/wec_supplement), will serve as stepping stones.

7.1 Verbalization as an evaluation methodology

We observe and demonstrate that our verbalization study into scatterplots provide unprecedented insights on how participants internalize the information conveyed by the scatterplots and how they then externalize this information through language. Systematically conducted verbalization studies have the potential to serve as a new lens to look into visualization designs directly, providing insights into how they work and how well they meet the intended design goals. Visualization researchers already work with natural language utterances of people using their designs collected through methods such as think-aloud protocols or interviews [8]. However, the scope of language used in data from such studies is broad and requires extensive analysis and interpretation from researchers. We argue that visual-verbalization studies will complement existing visualization evaluation methodologies, by providing researchers with systematically structured data on the language used by people reading and processing information communicated through visual representations of data. This should eventually point out to the limitations and strengths of the visualization designs. A future challenge to address here is to conduct such studies for more complex tasks while still maintaining the applicability of the natural language processing techniques as described in this paper.

Another potentially interesting avenue is to consider interaction, and how researchers can gather utterances from users that inform them on how well the interactive process is running in a visualization solution. One direction could be to investigate multimodal interaction interfaces, such as chatbots, as an experiment medium to gather data for the evaluation of interaction between systems and users.

7.2 Moving towards interactive systems

One motivation for this work is the increasing prominence of interactive, multimodal data analysis systems [33, 39]. In this paper we produce a semantic lexicon and empirically evidenced utterances to help structure such multimodal interfaces. Our approach here, however, is limited to one-off utterances, i.e. not part of a dialogue but rather as responses to a single stimulus. The structure of dialogues, how the interactive narrative flow could be designed is also a key consideration in the design of multimodal interfaces [21] and the semantic lexicon we produce could be considered a part of what Mitchell et al. refers to as “dialogue seed corpus” [22]. A future direction is to extend these visual verbalization studies to interactive settings where the sequential nature of interactive dialogues could also be captured.

7.3 Linguistic analysis and other languages

The scope of the analysis of utterances in E1 was targeted at understanding the connection between vocabulary and the varying levels of correlation, of which the implications for NLG-empowered visualization are more immediate. Although the methodology employs natural language processing methods, linguistic aspects related to sentence construction or the association with the varied backgrounds of participants were not examined. Investigating this perspective can be part of an investigation about the cognitive processes related to verbalization in data analysis contexts. Such an in-depth understanding, in particular if related to the theoretical frameworks emerging from cognitive science in similar vein to what Padilla et al. did within the context of decision-making [24], has the potential to inform designers on respecting and acting on cognitive limitations and strengths of viewers of visualizations.

Expanding the lexicon and the experiments to other languages beyond English is also an interesting path to take. Given the different ways of constructing words and sentences, such as declensions in German, it is likely that a wholly different analysis process would emerge, as well as other combinations of concepts and traits. As related cultural and educational aspects would also be part of this, it would be interesting to to see how visual literacy and numeracy affect the language used.

7.4 Visualization and data relationship path

Immediate follow-up experiments can be based on varying the visualizations or the underlying data relationship. The semantic lexicon produced in this paper can be enriched by new experiments, or used as a baseline for comparing how people respond to different stimuli.

Types of visualization: systematic variations of visualizations based on visual variables can be used to compare results with the baseline correlation in scatterplots lexicon. Experiments can be set up to investigate changes across point-based visualizations, such as dot plots, or with more different encodings, such as bar or line charts. Using one lexicon as a baseline will enable quantifying the differences in results and understanding how the language used by participants vary based on the encodings. For NLG-assisted visual analysis, acquiring knowledge about how concepts and traits overlap or differ across lexicons will help to generate statements that can be generalized to different types of visualizations.

Types of statistical relationships and characteristics: follow-up experiments can also vary the data relationship under focus or characteristics of the data, such as those characteristics governing cluster separation [32]. Proxy measures for scagnostics [43] can be used to simulate data for further studies, also enabling both comparison and expansion of the lexicon. As correlation studies are generally focused on bivariate datasets, investigating multivariate or univariate data, as well as temporal or spatial attributes, would also greatly expand the scope of the lexicon and the potential for applying the knowledge into a wider range of NLG-supported tools.

8 Conclusion

We described two experiments aimed at advancing the understanding of how people relate verbalizations to visualizations of correlation in scatterplots. We contribute with data, analysis and observations towards that goal, in the form of a collection of utterances and a semantic lexicon from a visualization to verbalization experiment. Our findings present a diversity of vocabulary for the different levels of correlation, with some concepts more prevalent than others in different ranges. We also show that that our organization of concepts into utterances also resonates with users in a verbalization to visualization experiment.

We conducted our analysis through a transferable methodology to support a research roadmap towards a further understanding on the use of natural language with visualization for other types of visualization, statistical relationships and languages, as well as arguing for a role for systematic natural language experiments as a means of evaluating visualizations and advancing multimodal interactive systems. The potential success of future work relies heavily on a multi-disciplinary thinking with concerted effort from researchers in visualization and linguistics, as well as those in cognitive sciences to relate to theories of cognition. With the eventual aim to build a comprehensive and multi-faceted understanding of how visual and textual representations work in human reasoning and decision making, this strand of research is likely to open up new directions and further the impact of human-data interaction systems in our increasingly data-rich society.


We thank Johannes Liem for the help with setting up the online studies and discussions. This work was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) with grant number EP/P025501/1.


  • [1] S. Barclay, R. V. Brown, C. W. Kelly III, C. R. Peterson, and L. D. Phillips (1977) Handbook for decision analysis. Technical report DECISIONS AND DESIGNS INC MCLEAN VA. Cited by: §1.
  • [2] M. Basseville (2013-04) Divergence measures for statistical data processing - An annotated bibliography. Signal Processing 93 (4), pp. 621–633. External Links: Document, ISSN 01651684, Link Cited by: §5.3.
  • [3] R. Beecham, J. Dykes, W. Meulemans, A. Slingsby, C. Turkay, and J. Wood (2017) Map LineUps: Effects of spatial structure on graphical inference. IEEE Transactions on Visualization and Computer Graphics 23 (1), pp. 391–400. External Links: Document, ISSN 10772626, Link Cited by: §2.2.
  • [4] S. Bird, E. Klein, and E. Loper (2009) Natural Language Processing with Python. O’Reilly Media Inc.. Cited by: §4.2.1.
  • [5] P. Bobko and R. Karren (1979-06)

    The Perception of Pearson Product Moment Correlations From Bivariate Scatterplots

    Personnel Psychology 32 (2), pp. 313–325. External Links: Document, ISSN 17446570, Link Cited by: §2.2.
  • [6] V. Braun and V. ClarkeH. Cooper (Ed.) (2012) Thematic analysis. Vol. Vol. 2. External Links: ISBN 9781433810039 Cited by: §6.3.
  • [7] M. Brehmer, S. Carpendale, B. Lee, and M. Tory (2014) Pre-design empiricism for information visualization: Scenarios, methods, and challenges. Proceedings of the Fifth Workshop on Beyond Time and Errors: Novel Evaluation Methods for Visualization, pp. 147–151. External Links: Document, ISBN 9781450332095 Cited by: §2.1.
  • [8] S. Carpendale (2008) Evaluating information visualizations. In Information visualization, pp. 19–45. Cited by: §7.1.
  • [9] K. B. Cohen and A. Dolbey (2007) Foundations of Statistical Natural Language Processing (review). Language 78 (3), pp. 599–599. External Links: Document, ISBN 0262133601, Link Cited by: §4.2.1, §4.2.1.
  • [10] M. Correll and J. Heer (2017) Regression by eye: estimating trends in bivariate visualizations. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, CHI ’17, New York, NY, USA, pp. 1387–1396. External Links: ISBN 978-1-4503-4655-9, Link, Document Cited by: §2.2.
  • [11] Ç. Demiralp, P. J. Haas, S. Parthasarathy, and T. Pedapati (2017-08) Foresight: recommending visual insights. Proc. VLDB Endow. 10 (12), pp. 1937–1940. External Links: ISSN 2150-8097, Link, Document Cited by: §2.1.
  • [12] T. Gao, M. Dontcheva, E. Adar, Z. Liu, K. Karahalios, and A. Arbor (2015) DataTone : Managing Ambiguity in Natural Language Interfaces for Data Visualization. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology, Charlotte, NC, USA, pp. 489–500. External Links: Document, ISBN 978-1-4503-3779-3 Cited by: §1, §2.1.
  • [13] L. Harrison, F. Yang, S. Franconeri, and R. Chang (2014-12) Ranking Visualizations of Correlation Using Weber’s Law. IEEE Transactions on Visualization and Computer Graphics 20 (12), pp. 1943–1952. External Links: Document, ISSN 1077-2626, Link Cited by: §1, §2.2, §4.1.
  • [14] J. Hauke and T. Kossowski (2011) Comparison of values of pearson’s and spearman’s correlation coefficients on the same sets of data. Quaestiones geographicae 30 (2), pp. 87–93. Cited by: §6.1.
  • [15] E. Hoque, V. Setlur, M. Tory, and I. Dykeman (2018-01) Applying Pragmatics Principles for Interaction with Visual Analytics. IEEE Transactions on Visualization and Computer Graphics 24 (1), pp. 309–318. External Links: Document, ISBN 1077-2626 VO - PP, ISSN 1077-2626, Link Cited by: §1, §2.1.
  • [16] A. Jain and J. M. Keller (2015) Textual summarization of events leading to health alerts. Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS 2015-November, pp. 7634–7637. External Links: Document, ISBN 9781424492718, ISSN 1557170X Cited by: §2.1.
  • [17] M. Kay and J. Heer (2016) Beyond Weber’s Law: A Second Look at Ranking Visualizations of Correlation. IEEE Transactions on Visualization and Computer Graphics 22 (1), pp. 469–478. External Links: Document, ISBN 1077-2626, ISSN 10772626 Cited by: §2.2, §4.1.
  • [18] J. Kim, M. de Marneffe, and E. Fosler-Lussier (2016) Adjusting Word Embeddings with Semantic Intensity Orders. pp. 62–69. External Links: Document Cited by: §5.3.
  • [19] S. Latif and F. Beck (2019) Interactive map reports summarizing bivariate geographic data. Visual Informatics 3 (1), pp. 27–37. External Links: Document, ISSN 2468502X, Link Cited by: §2.1.
  • [20] H. Liu, S. T. Wu, D. Li, S. Jonnalagadda, S. Sohn, K. Wagholikar, P. J. Haug, S. M. Huff, and C. G. Chute (2012) Towards a semantic lexicon for clinical natural language processing.. AMIA Annu Symp Proc 2012, pp. 568–576. External Links: ISSN 1942-597X, Link Cited by: §3, §4.2.2.
  • [21] S. Liu, S. Seneff, and J. Glass (2010) A collective data generation method for speech language models. In 2010 IEEE Spoken Language Technology Workshop, pp. 223–228. Cited by: §7.2.
  • [22] M. Mitchell, D. Bohus, and E. Kamar (2014) Crowdsourcing language generation templates for dialogue systems. In Proceedings of the INLG and SIGDIAL 2014 Joint Session, pp. 172–180. Cited by: §7.2.
  • [23] H. Mumtaz, S. Latif, F. Beck, and D. Weiskopf (2019) Exploranative Code Quality Documents. IEEE Transactions on Visualization and Computer Graphics, pp. 1–1. External Links: Document, 1907.11481, ISSN 1077-2626, Link Cited by: §2.1.
  • [24] L. M. Padilla, S. H. Creem-Regehr, M. Hegarty, and J. K. Stefanucci (2018) Decision making with visualizations: a cognitive framework across disciplines. Cognitive Research: Principles and Implications 3 (1), pp. 29. Cited by: §7.3.
  • [25] A. V. Pandey, J. Boy, E. Bertini, C. Felix, and J. Krause (2016) Towards Understanding Human Similarity Perception in the Analysis of Large Sets of Scatter Plots. pp. 3659–3669. External Links: Document, ISBN 9781450333627 Cited by: §2.2.
  • [26] I. Pollack (1960) Identification of visual correlational scatterplots. Journal of Experimental Psychology. External Links: Document, ISSN 00221015 Cited by: §2.2.
  • [27] Qlik Software Qlik Insight Bot. External Links: Link Cited by: §1.
  • [28] E. Reiter and R. Dale (2000) Building natural language generation systems. Cambridge University Press. External Links: ISBN 0521620368 Cited by: §1, §2.1.
  • [29] E. Reiter, S. G. Sripada, and R. Robertson (2003) Acquiring Correct Knowledge for Natural Language Generation. Technical report Vol. 18. External Links: Link Cited by: §1, §2.1.
  • [30] R. A. Rensink and G. Baldridge (2010-08) The Perception of Correlation in Scatterplots. Computer Graphics Forum 29 (3), pp. 1203–1210. External Links: Document, ISSN 01677055, Link Cited by: §1, §2.2, §4.1, §6.1.
  • [31] B. Roark and E. Charniak (1998) Noun-phrase co-occurrence statistics for semiautomatic semantic lexicon construction. In Proceedings of the 17th international conference on Computational linguistics -, Vol. 2, Morristown, NJ, USA, pp. 1110–1116. External Links: Document, Link Cited by: §3.
  • [32] M. Sedlmair, A. Tatu, T. Munzner, and M. Tory (2012) A taxonomy of visual cluster separation factors. In Computer Graphics Forum, Vol. 31, pp. 1335–1344. Cited by: §7.4.
  • [33] V. Setlur, S. E. Battersby, M. Tory, R. Gossweiler, and A. X. Chang (2016) Eviza: A Natural Language Interface for Visual Analysis. Proceedings of the 29th Annual Symposium on User Interface Software and Technology 10 (16), pp. 365–377. External Links: Document, ISBN 9781450341899 Cited by: §1, §1, §2.1, §7.2.
  • [34] R. Sevastjanova, F. Beck, B. Ell, C. Turkay, R. Henkin, M. Butt, D. A. Keim, and M. El-Assady (2018) Going beyond visualization: verbalization as complementary medium to explain machine learning models. In Workshop on Visualization for AI Explainability at IEEE VIS, Cited by: §1.
  • [35] V. Sher, K. G. Bemis, I. Liccardi, and M. Chen (2017) An Empirical Study on the Reliability of Perceiving Correlation Indices using Scatterplots. Computer Graphics Forum 36 (3), pp. 61–72. External Links: Document, ISSN 14678659 Cited by: §2.2.
  • [36] K. Sherman (1964) Words of estimative probability. Technical report Central Intelligence Agency. Cited by: §1, §6.1.
  • [37] A. Srinivasan, S. M. Drucker, A. Endert, and J. Stasko (2019) Augmenting visualizations with interactive data facts to facilitate interpretation and communication. IEEE Transactions on Visualization and Computer Graphics 25 (1), pp. 672–681. External Links: Document, ISSN 19410506 Cited by: §2.1.
  • [38] A. Srinivasan and J. Stasko (2017) Natural Language Interfaces for Data Analysis with Visualization: Considering What Has and Could Be Asked. In EuroVis 2017 - Short Papers, B. Kozlikova, T. Schreck, and T. Wischgoll (Eds.), External Links: Document Cited by: §1, §2.1.
  • [39] A. Srinivasan and J. Stasko (2017) Orko : Facilitating Multimodal Interaction for Visual Exploration and Analysis of Networks. InfoVis TVCG 24 (SI), pp. 11. External Links: Document, ISBN 1077-2626 VO - PP, ISSN 1077-2626 Cited by: §1, §1, §2.1, §7.2.
  • [40] Y. Sun, J. Leigh, A. Johnson, and S. Lee (2010) Articulate: A semi-automated model for translating natural language queries into meaningful visualizations.

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    6133 LNCS, pp. 184–195.
    External Links: Document, ISBN 3642135439, ISSN 03029743 Cited by: §1, §2.1.
  • [41] Tableau Software Tableau. External Links: Link Cited by: §1.
  • [42] K. Toutanova, D. Klein, C. D. Manning, and Y. Singer (2007) Feature-rich part-of-speech tagging with a cyclic dependency network. External Links: Document Cited by: §4.2.1.
  • [43] L. Wilkinson, A. Anand, and R. Grossman (2005) Graph-theoretic scagnostics. In Proceedings - IEEE Symposium on Information Visualization, INFO VIS, External Links: Document, ISBN 078039464X, ISSN 1522404X Cited by: §7.4.