Quda: Natural Language Queries for Visual Data Analytics

05/07/2020 ∙ by Siwei Fu, et al. ∙ AOL 0

Visualization-oriented natural language interfaces (V-NLIs) have been explored and developed in recent years. One challenge faced by V-NLIs is in the formation of effective design decisions that usually requires a deep understanding of user queries. Learning-based approaches have shown potential in V-NLIs and reached state-of-the-art performance in various NLP tasks. However, because of the lack of sufficient training samples that cater to visual data analytics, cutting-edge techniques have rarely been employed to facilitate the development of V-NLIs. We present a new dataset, called Quda, to help V-NLIs understand free-form natural language. Our dataset contains 14;035 diverse user queries annotated with 10 low-level analytic tasks that assist in the deployment of state-of-the-art techniques for parsing complex human language. We achieve this goal by first gathering seed queries with data analysts who are target users of V-NLIs. Then we employ extensive crowd force for paraphrase generation and validation. We demonstrate the usefulness of Quda in building V-NLIs by creating a prototype that makes effective design decisions for free-form user queries. We also show that Quda can be beneficial for a wide range of applications in the visualization community by analyzing the design tasks described in academic publications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Related Work

Our work draws on prior research in V-NLIs, datasets for visualization research, and similar corpora in the NLP domain.

1.1 NLI for Data Visualization

Our research is motivated by recent progress in V-NLIs, which provides engaging and effective user experience by combining direct manipulation and natural language as input.

Some research emphasizes on the multi-modal interface that interacts with users using natural language. Back to 2010, Articulate [Sun2010] identifies nine task features and leverages keyword-based classification methods to understand how user’s imprecise query weighted in each feature. Sisl [Cox2001] is a multi-modal interface that accepts various input modes, including point-and-click interface, NL input in textual form, and NL using speech. Analyza [Dhamdhere2017] is a system that combines V-NLI with a structured interface to enable effective data exploration. To address the ambiguity and complexity of natural language, Datatone [Gao2015] presents a mixed-initiative approach to manage ambiguity in the user query. Flowsense [Yu2020] is a V-NLI designed for the dataflow visualization system. It applies semantic parsing to support natural language queries and allows users to manipulate multiple views in a dataflow visualization system. Several commercial tools, such as Microsoft Power BI [microsoft], IBM Watson Analytics [IBM], Wolfram Alpha [wolfram], Tableau [tableau], and ThoughtSpot [thoughtspot], integrate V-NLIs to provide better analytic experience for novice users.

Another line of research targets the conversational nature of V-NLI. Fast et al. proposed Iris [Fast2018]

that is a conversational user interface that helps users with data science tasks. Kumar et al.  

[Kumar2016] aimed to develop a data analytics system to automatically generate visualizations using a full-fledged conversational interface. Eviza [Setlur2016] and Evizeon [Hoque2018] are visualization systems that enable natural language interactions to create and manipulate visualizations in a cycle of visual analytics. Similarly, Orko [Srinivasan2018] is a prototype visualization system that combines both natural language interface and direct manipulation to assist visual exploration and analysis of graph data.

The aforementioned approaches are mainly implemented using rule-based language parsers, which provide limited support for free-form NL input. Instead of generating visualizations, Text-to-Viz [Cui2020] aims to generate infographics from natural language statements with proportion-related statistics. The authors sampled valid proportion-related statements and built a machine learning model to parse utterances. Although promising, Text-to-Viz does not support queries for a broad range of analytic activities.

The performance and usage scenario of V-NLIs highly depend on language parsers. The cutting-edge NLP models have reached close-to-human accuracies in various tasks, such as semantic parsing, text classification, paraphrase generation, etc. However, few have been applied to visualization-oriented V-NLIs. We argue that the release of the high-quality textual dataset would assist the training and evaluation of NLP models in the domain of data analytics and visualization.

1.2 Datasets for visualization research

An increasing number of studies in the visualization community have employed supervised learning approaches in various scenarios such as mark type classification 

[Savva2011], reverse engineering [Poco2017], color extraction, etc. The capability of these approaches, however, highly relies on the availability of massive datasets that are used for training and evaluation.

Some datasets consisting of visual diagrams are constructed and open-sourced for visualization research. For example, Savva et al.  

[Savva2011] compiled a dataset of over 2,500 chart images labeled by chart type. Similarly, the Massvis dataset [Borkin2013] contains over 5,000 static chart images, and over 2,000 are labeled with chart type information. Beagle [Battle2018] embeds a Web Crawler extracting more than 41,000 SVG-based visualizations from the web, and all are labeled by visualization type. Viziometrics [Lee2018]

collects about 4.8 million images from scientific literature and classifies them into five categories, including equation, diagram, photo, plot, and table. Poco and Heer 

[Poco2017] compiled a chart corpus where each chart image is coupled with bounding boxes and transcribed text from the image. Instead of compiling a corpus containing chart images, includes queries accompanied by analytic tasks for visualization-oriented V-NLIs.

Viznet, on the other hand, collects over 31 million real-world datasets that provides a common baseline for comparing visual designs. Though containing rich textual information, Viznet can hardly be applied to provide training samples for learning-based NLP models. On the contrary, is designed for helping V-NLIs understand utterances describing analytic tasks in information visualization.

1.3 Corpora in NLP Domain

Our corpus is primarily designed for helping V-NLIs classify queries into analytic tasks. Therefore we first survey related corpora for text classification. Second, we introduce a paraphrase dataset because our corpus can support paraphrasing tasks to some extend. Third, we present prior corpora for text-to-SQL because they are in the domain of data analytics, which is similar to .

Text classification is the task of categorizing a piece of textual content into an appropriate category. A number of voluminous datasets have been proposed. For example, AG’s News corpus [Zhang2015] includes news articles accompanied by high-level categories. Sogou News corpus [Zhang2015] consists of training examples and testing examples. Each of them are classified into categories including “sports”, “finance”, “entertainment”, “automobile”, and “technology”. Similar corpora [Zhang2015] are DBPedia, Yelp Review Polarity, Yelp Review Full, etc. Though large in scale, sentences in the aforementioned corpora are different from V-NLI queries in syntax and semantics. Akin to , TREC [Voorhees1999] is a corpus for question answering. It has two versions, i.e., TREC-6 and TREC-50. TREC-6 contains questions assigned with one of six labels, i.e., abbreviation, description and abstract concepts, entities, human beings, locations, and numeric values. Whereas TREC-50 has finer-grained categories for each question. Both have training samples and test samples. is different from TREC in two aspects; 1) it is three times larger in scale, and 2) TREC falls in different domains compared to .

Given an original sentence, techniques for paraphrase generation aim to generate text that conveys the same meaning as the original one. Rich and voluminous paraphrase datasets have been constructed to support learning-based models for paraphrase generation. For example, Twitter URL Paraphrasing Corpus [Lan2017] is a continuously growing paraphrase dataset that links tweets through shared URLs. Quora Question Pairs [quora] contains over potential duplicate questions pairs. MSCOCO [Lin2014] contains images associated with different sentences describing each image. The key difference between and the existing paraphrase corpora is the application domain, which affects lexical distribution and syntax. Our corpus is the first attempt to build a paraphrase corpus for data analytics. Paraphrase generation is an important task that is influential to downstream tasks such as dialog systems, semantic parsing, and information retrieval. We envision that the release of could foster research in paraphrase generation and broad NLP tasks for data analytics.

Text-to-SQL is the task that requires a deep understanding of natural language and databases and mapping it to meaningful executable SQL queries. Example corpora are WikiSQL ( queries) [Zhong2017], Spider ( queries) [Yu2018], Advising ( queries) [FineganDollak2018], Yelp and IMDB ( queries) [Yaghmazadeh2017], MAS ( queries) [Li2014], and SCHOLAR ( queries) [Iyer2017] to name a few. These corpora are powerful in applications such as database-oriented visualization. can be distinguished from the corpora mentioned above from two aspects. First, queries in text-to-SQL corpora are not labeled with analytic tasks, on which the choice of visualization highly relies [Saket2019]. The lack of task information would hinder V-NLIs from choosing a proper visualization to respond to a query. On the contrary, our study focuses on constructing a corpus for low-level analytic tasks. Second, covers a broader range of analytic tasks compared to text-to-SQL corpora. As discussed in Spider [Yu2018], the existing corpora are limited in SQL operators, such as ORDER BY, GROUP BY, and WHERE. Hence, the support for complex tasks such as “Find Anomalies” or “Correlate” is not feasible for text-to-SQL corpora.

2 Problem Definition

Analytic tasks play a pivotal role in visualization design and evaluation. The understanding of tasks is beneficial for a wide range of applications in the visualization community such as visualization recommendation, computational linguistics for data analytics, etc. The goal of this paper is to propose a corpus in the domain of visual data analytics that facilitates the deployment of learning-based NLP techniques for V-NLIs. Specifically, our corpus focuses on queries that reflect how data analysts ask questions in V-NLIs. Due to the variation and ambiguity of natural language, the space of queries is intangibly large. To narrow down the scope of our corpus, we borrow the idea from prior work in visualization framework [Munzner2014, Amar2005, Rind2016] and V-NLI for visualization [Tory2019, Srinivasan2018] and identify from six dimensions, i.e., abstraction level, composition level, perspective, type of data, type of task, and context-dependency.

Abstraction: Concrete. The abstraction level is one dimension of analytic tasks describing the concreteness of a query [Rind2016]. Abstract queries are generic and are useful for generalizing the concept behind a concrete task (an example is, “Find maximum”). These queries can be addressed in multiple ways based on the interpretation. However, to obtain a reasonable response from V-NLIs, an analyst should author a query that provides sufficient information [Srinivasan2018], such as “Find top movies with most stars” and “Retrieve the country with the most population”. Therefore, our corpus focuses on the queries at the low abstraction level which expresses both tasks and values explicitly.

Composition: Low. The composition level describes the extent to which a query encompasses sub-queries [Rind2016]. Composition is a continuous scale from low-level to high-level. A query with a high composition level consists of multiple sub-queries and a V-NLI may answer it using multiple steps. For example, “For the movie with the most stars, find the distribution of salaries of the filming team.” A V-NLI first needs to identify the movie with the most stars. Then display the salary distribution of all staff in the filming team. As our corpus is the first attempt to collect queries for V-NLI, we focus on queries that are low in the composition level at this stage. We plan to include with high composition level in future research.

Perspective: Objectives. Queries in V-NLIs can be classified into two categories, i.e., objectives and actions [Rind2016]. Objectives are queries raised by analysts seeking to satisfy a curiosity or solve a problem. The actions in V-NLIs are executable steps towards achieving objects, and usually relate to interactive features of visualization artifacts, such as “Show the distribution using a bar chart,” or “Map the popularity of applications to x-axis.” However, to assume V-NLI users being knowledgeable in constructing effective visualization is virtually arbitrary. Instead of enumerating actions, we aim to collect queries that are objectives raised by data analysts.

Type of Data: Table. The type of dataset affects the syntax and semantics of queries. For example, analysts may seek to identify the shortest path between two nodes for a network dataset. However, such query rarely occurs in the tabular dataset because links between items are not supported explicitly. Data and dataset can be categorized into five types [Munzner2014], including table, networks & trees, fields, geometry, and clusters & sets & lists. At the date of paper submission, our corpus target at queries based on the tabular dataset, we argue that our corpus can support other types of data to some extent. For example, the task of “Find Nodes” in networks is similar to “Retrieve Value” in tabular data. We plan to extend our research to support other data types comprehensively in the future.

Type of Tasks: 10 Low-level Tasks. Collecting a corpus for all possible tasks is not feasible. Therefore, we turned to the related studies to identify analytic tasks on which to focus. In this study, we adopt the taxonomy proposed by Amar et al.  [Amar2005] that categorized ten low-level analytic activities, such as Retrieve Value, Sort, etc. These tasks serve a good starting point for our research despite not being comprehensive.

Context Dependency: Independent. The conversational nature of V-NLI may result in queries that rely on contextual information [Srinivasan2018, Tory2019]. For example, a query “Find the best student in the class” may be followed by “Obtain the English score of the student.” The second query appears to be incomplete individually and is a follow-up query of the first query. Such a query is referred to as contextual queries, which is not the focus of our research. Our corpus focuses on queries that have complete references to tasks or values associated with a task.

3 Constructing

is the first part of an ambitious project. We aim to incorporate theories and techniques in the visualization and NLP domains to facilitate the process of data analytics. Thus far, we have collected queries based on data tables and low-level analytic tasks. We plan to complete the construction of around large-scale corpora in the next two years. In this work, we incorporate both expert and crowd intelligence in data acquisition, which can be divided into three stages. In the first stage, we conduct an interview study involving data analysts to author queries based on data tables and low-level tasks. We derive expert queries from the study results. For the second stage, we perform a large-scale crowd-sourced experiment to collect sentential paraphrases for each expert query, which are restatements of the original query with approximately the same meaning [Wang2018]. Finally, we design a validation stage involving both crowd force and a machine learning algorithm to ensure data quality. The third stage results in paraphrases where each expert sentence is accompanied with paraphrases on average. Figure Quda: Natural Language Queries for Visual Data Analytics shows an overview of the entire data acquisition procedure.

4 Employing Expert Intelligence

In this section, we describe expert interviews that aim to collect professional queries from data analysts, given analytic tasks and data tables in different domains.

4.1 Participants and Apparatus

To understand how data analysts raise questions from data tables, we recruit 20 individuals, 10 females, and 10 males, with ages ranging from 24—31 years (). In our study, we identify “data analysts” as people who are experienced in data mining or visual analytics and have at least one publication in related fields. Most participants () are postgraduate students majoring in Computer Science or Statistics, and the rest are working as data specialists in IT companies. The experiments are conducted on a laptop (2.8GHz 4-Core Intel Core i7, 16 GB memory) on which participants read documents and create queries.

4.2 Tasks and Data Tables

Tasks play a vital role in authoring queries. Our study begins with the taxonomy of low-level visualization tasks [Amar2005], including Retrieve Value, Compute Derived Value, Find Anomalies, Correlate, etc. The pilot study shows that participants may be confused about some tasks. For example, one commented, “Determine Range is to find the maximum and minimum values in one data field, which is similar to Find Extremum.” To help participants clarify the scope of each task, we compiled a document presenting each task from three aspects, i.e., general description, definition, and example sentences. The document is shown in the supplementary material.

Data tables provide a rich context for participants to create queries. Hence, to diversify the semantics and syntax of queries, we have prepared real-world data tables covering different domains, including health care, sports, entertainment, etc. Tables with insufficient data fields may not support some types of queries. For example, assume a table about basketball players has two columns, i.e., players’ name and their nationality. Participants may find it hard to author queries in the “Find Extremum” or “Correlate” categories which usually require numeric fields. Hence, instead of skipping the tasks, we allow participants to revise tables by adding new columns or editing existing ones if necessary. Moreover, we selectively choose data tables that have rich background information and explanation for each column so that participants can get familiar with them in a short time. All data were collected from Kaggle 222https://www.kaggle.com/datasets, Data World 333https://data.world/, and Google Datasetsearch 444https://datasetsearch.research.google.com/.

4.3 Methodology and Procedure

The interview began with a brief introduction to the purpose of the study and the rights of each participant. We then collected the demographic information of each participant, such as age, sex, experience in data mining and visual analytics. After that, participants were asked to familiarize themselves with analytic tasks by reading the document.

In the training stage, participants were asked to author queries for tasks based on a sample data table, which differs from those used in the main experiment. We instructed them to think aloud and resolved any difficulties they encountered. We encourage participants to author queries with diverse syntax structure and semantics.

The pilot results indicate that the participants were more enthusiastic in generating sentences with diverse syntax if they are interested in the context of the table. At the beginning of the main experiment, we presented tables with a detailed explanation of the context and data fields to participants and encouraged them to choose two based on their interest. We randomized the presentation order of the tasks. Participants were guided to author at least two queries given a table and a task, and no time limit was given to complete each query. To summarize, the target queries collected at this stage is However, because some of the participants enjoyed the authoring experience and generated more sentences for some tasks, the total number of queries exceeded .

After the main experiment, a post-interview discussion was conducted to collect their feedbacks. Interviews were audio recorded and we took notes of the participants’ comments for further analysis. Each interview lasted approximately two hours, and we sent a gift card to interviewees for their participation.

4.4 Results

The result of our interview study is a corpus containing queries generated by data analysts. The length of queries ranged from to , with an average of words (). We characterize the

queries from three aspects, and present how analytic tasks vary in these aspects. First, we derived basic statistics for each task, i.e., the number of sentences per task and the distribution of sentence length. Next, we counted the number of uni-grams (with stop-words removal) and bi-grams to measure the lexicon diversity (a larger number means the higher diversity). Finally, to evaluate the diversity at the sentence level, we compute the average pairwise BLEU score 

[Papineni2002] for queries within a task (the lower, the higher diversity). We do not explain these metrics in detail, because they have been presented at length in the previous paper. Figure 1 demonstrates the distribution of each property.

We observe that “Sort” has the least number of sentences. Since we emphasize diversity during the interview, participants did not attempt to author more queries when they found it hard to create sentences with diverse structures, Therefore, we infer that participants found it hard to create diverse queries for “Sort ”. This observation is confirmed quantitatively, whereas “Sort ” has the least number of uni-grams and bi-grams compared to others. The average pairwise BLEU score is also high. Akin to “Sort”, “Characterize Distribution” has the highest pairwise BLEU score while the quantity of uni-gram and bi-gram is low. That is, queries are similar in the syntactic and sentential levels.

On the contrary, some tasks, such as “Find Anomalies”, “Correlate”, and “Filter” have lower BLEU score and more uni-grams & bi-grams, which means queries for these tasks are diverse in linguistics. We observed that participants found it enjoyable to write queries for these tasks. For example, one participant commented in the post-interview discussion, “Find Anomalies is an interesting task because it requires an in-depth understanding of the data table.” She further added, “I asked ‘Based on the relationship among rating, installs, and the number of reviews, which app’s rating does not follow the trend?’ because this kind of abnormal apps is worth investigating.”

Figure 1: The analysis of tasks using five metrics. The two lexical features, i.e., the number of uni-grams and bi-grams, have similar trends, while the sentential feature (Average Pairwise BLEU Score) has almost distinct distribution compared to lexical features.

4.5 Reflections and Discussion

The target corpus should meet a set of requirements as discussed in Section 2. However, we did not attempt to overload participants by explaining all requirements during the study. Instead, we carefully guided them to avoid potential pitfalls in writing queries. In this section, we report the considerations and reflections.

How is the abstraction level concretized? Our corpus focuses on queries with low abstraction level. However, some queries are authored with a high abstraction level, which raises ambiguity in understanding the intent. For example, given a movie dataset, one participant wrote down “Sort all movies.” The phrase does not state clearly based on what criteria the movies should be sorted. Hence, to make queries more concrete, participants were asked to answer a question, “How do you draw a visualization to address this query?” The solution may reveal the implicit intention of the participants, and they were further suggested to revise the query by adding more details.

How do we lower the composition level? We sought to collect queries with a low composition level. However, identifying the extent to which the query encompasses sub-queries is difficult. For example, given a table about students’ performance in a class. One participant would like to know, “who is the best student in the class?” This query can be broken down as “Compute the total score of all subjects for each student” and “Find the student with the maximum score.” Thus, to identify an adequate composition level, we asked participants to bear in mind that, “Whether this query can be solved using one visualization at the first step?” Participants were suggested to revise the query until the answer to the above question is “Yes”. Taking the above sentence as an example, one analyst might answer the query by using a stacked bar chart directly where the x-axis is the student ID while the scores of each subject are stacked at the y-axis. Hence, the composition level of the above query is appropriate.

How should they focus on the objectives? Some of our participants are visualization experts. Hence, they tend to raise queries such as “Show a histogram for …(one data field)”, for the task “Characterize Distribution”. As discussed in Section 2, we focused on objectives rather than actions. Therefore, we asked them to revise such queries by pretending that they have no prior knowledge of visualization.

How to assign task labels? Some queries belong to multiple tasks in nature. For example, The sentence “For applications developed in the US, which is the most popular one?” falls in both “Filter” and “Find Extremum” categories. The identification of all task labels is important to reflect the characteristics of each query. However, each query is attached to a single task label in our corpus based on participants’ intent. For example, if the above sentence is authored under the task “Find Extremum”, then it is then attached to the label. We plan to address this issue using two approaches in future research. First, we can post-process all queries using machine learning approaches to identify potential tasks they belong to. Second, we can improve the design of the interview study by asking participants to review their queries.

How to diversify the syntax and semantics? The diversity of the syntax and semantics is critical to ensuring the quality of our corpus. Instead of asking participants to use a different syntax, we instruct them to raise “interesting and meaningful” questions. By focusing on semantics, participants tend to construct queries with detailed information, which benefits the diversity at both lexicon and sentence levels. For example, one participant noted, “Retrieve Value is a kind of trivial task. However, it is non-trivial to me because I was thinking about which value is worth retrieving.” She further commented, “In the second table (containing demographics information in the USA), I asked ‘How many of the Orange County residents are Asian?’ because Asians are dominant in the Orange County. Such a question is meaningful, I believe.”

5 Borrowing Crowd Intelligence

Paraphrase acquisition is a common approach for data augmentation. The goal of this stage is to extend the corpus by collecting paraphrases based on expert sentences using crowd intelligence. This stage can be divided into two steps, i.e., paraphrase generation and validation. All experiments were conducted on MTurk.

5.1 Paraphrase Acquisition at Scale

In this step, we aim to acquire at least paraphrases for each expert sentence. We followed the crowdsourcing pipeline [Burrows2013] to collect paraphrases using crowd intelligence. We developed a web-based system that the crowd can access through MTurk. Our system records both crafted queries and user behavior, e.g., keystrokes and duration, for further analysis.

We describe our task as [Potthast2010], “Rewrite the original text found below so that the rewritten version has the same meaning, but a completely different wording and phrasing” to help crowd workers understand our task. The interface also encourages the crowd to craft paraphrases that differ from the original one in terms of sentence structure. We demonstrate both valid and invalid examples to explain our requirements. The interface displays an experts’ sentence and instructs workers to rewrite it. After a sentence is submitted, workers are allowed to author paraphrases for other expert sentences, skip sentences, or quit the interface by clicking “Finish”. Finally, we use an open-ended question to collect workers’ comments and suggestions for this job. We sent 0.1$ as a bonus for each sentence after validation.

Based on prior work in corpus collection [Lasecki2013, Burrows2013] and the pilot experiments, we established a set of rules in the main experiment to ensure the quality of the collected sentences. First, to collect paraphrases that are diverse in structure, we limited the maximum number of sentences one could create ( in our study) to involve more crowd intelligence. Second, to obtain sentences with high-quality, workers were informed that the results would be reviewed and validated before they receive a bonus. Third, the interface recorded and analyzed their keystrokes and duration in crafting sentences to reject invalid inputs. For example, we rejected sentences that were crafted within seconds or with less than keystrokes. Fourth, to avoid duplicate sentences with other workers, we compared the input with all sentences in the database.

Our job has received positive feedback from the crowd. We were encouraged by comments like “This was fun, and it made me use my brain” and “This wasn’t easy, but I hope I did it right, according to your rules. Fun and challenging. :) ”.

5.2 Validation via Pairwise Comparison

The result of the first step is a corpus containing crowd-generated paraphrases accompanied by expert-generated sentences. The goal of this step is to filter out paraphrases that are semantically inequivalent compared to the expert sentence.

Due to the ambiguity in human language, asking workers directly to identify whether paraphrases are semantically equivalent to the expert sentence would be restrictive and unreliable. For example, given an expert’s sentence, “How many faculties are there at Harvard University?”, and a worker’s, “How many departments are there at Harvard?”, it is not clear whether the two sentences have the same meaning because “faculty” could mean both “department” and “professor”. Instead of asking whether a paraphrase is semantically equivalent, we asked comparison questions to capture the equivalence strength of one paraphrase with respect to others. To be specific, given an expert sentence, a worker is shown a pair of paraphrases written by the crowd and asked to choose which one has closer meaning to the expert’s sentence. Pairwise comparison is effective in capturing semantic relationships[Parikh2011]. However, exhaustive comparisons, i.e., each paraphrase compares to all other paraphrases using a large number of workers, is unnecessary and cost-prohibitive. Hence, borrowing from the idea of relative attributes [Parikh2011]

, we design a strategy to collect pairwise comparison dataset for all paraphrases, and train scoring functions to estimate the semantic equivalence scores of them.

5.2.1 Collecting Pairwise Dataset

Each expert sentence is associated with paraphrases, and thus, we collect pairwise data among them. The total number of comparisons for each expert sentence is , where is the number of paraphrases, and is the number of pairwise comparisons per paraphrase. In our experiment, , and . To alleviate randomness, each comparison is completed by unique workers, providing comparison tasks for each expert sentence. Given expert sentences, this produces a tasks for unique workers.

We set up a web-based system to assist crowd workers in conducting comparison jobs. The system displays job descriptions, examples and requirements. Each job consists of comparison tasks based on distinct expert sentences, along with two gold standard instances for quality control. We manually crafted a set of gold standard tasks that are unambiguous and have a clear answer. We rejected any jobs where one of the gold tasks was failed. To measure subtle differences in the semantics, we used a two-alternative forced-choice design for each comparison in which workers cannot answer “no difference”. The workers were paid $0.15 per job.

5.2.2 Estimating Semantic Equivalence Score

After collecting a set of pairwise comparison data, we employed relative attribute [Parikh2011] to estimate the semantic equivalence score of each paraphrase candidate.

Given an expert sentence , we had training paraphrases

represented by feature vectors

in

. We have collected a set of ordered pairs of paraphrases

such that , i.e., paraphrase is more close to the expert sentence in semantics than . Our goal was to learn a scoring function:

(1)

such that the maximum number of the following constraints is satisfied:

(2)

To capture semantic features of each paraphrase, we employed Universal Sentence Encoder [Cer2018] and obtained a -dimension feature vector for each paraphrase. We used a SVM-based solver [Parikh2011] to approximate , and obtained the estimated score for paraphrases using Equation 1.

We set a threshold , so that paraphrases with scores lower than are filtered out. The scores range from to , and we set to reject all paraphrases with negative scores. Finally, we obtained high-quality paraphrases for expert sentences.

6 Characteristics

In this section, we demonstrate the characteristics of as compared to other popular datasets through 1) statistical analysis of these corpora and 2) benchmarking of a range of learning-based approaches. Based on the potential usage of our corpus, we focus on two fundamental NLP tasks, namely, text classification and paraphrase generation.

6.1 Text Classification

6.1.1 Statistical Analysis

contains categories, sentences with average sentence length. Compare to some well-known datasets designed for benchmarking deep text classification models, i.e., TREC ( sentences, six categories) and SST[Socher2013] ( sentences, five categories), not only has a comparable scale but offers a new sub-domain that focuses on query intention classification. These unique properties make the largest corpus in the domain of visual data analytics to date.

6.1.2 Benchmarking

Assigning a task label to a query is critical in V-NLIs. Therefore, we conduct an experiment to evaluate how text classification models perform on our corpus, and reveal the characteristics of our corpus with different experiment settings.

Models:

We conduct our experiment with four well-known text classification models as baselines. The first is a convolutional neural network

CNN [Kim2014] built on top of pre-trained word vectors for sentence classification. The second model, C-LSTM [Zhou2015]

, utilizes the strength of CNN and long short-term memory recurrent neural networks (LSTM). The third is the

C-CNN model that adds a recurrent connection to the convolutional layer[Shin2018]. Lastly, we employ Ad-RNN [Miyato2016], which uses adversarial and virtual adversarial training to improve the RNN model. We use the default parameter settings for all models.

Dataset: We have prepared two corpora in this experiment. Different data splits (train and test) may affect the performance of models. We split the dataset by experts. That is, experts are split into train and test sets (_expert). All queries created by the same experts and their paraphrases are in the same set. We observed that high-frequency words that appeared in the queries may have strong correlations with the contents in the data tables, that is, these words or phrases are values in data tables or synonyms of column names. These correlations may introduce biases and make the model ignoring the real intentions of the queries, and therefore hurting the model generalization. Inspired by prior work [Zhang2015], we further employ a data augmentation technique on that replaces words or phrases with their synonyms. To be specific, we derive a list of high-frequent words after stop-words removal. Then we manually categorize these words into five groups, i.e., Place, Person, Time, Date, and Others. Next, for each group, we compile a set of alternative words from WordNet[Fellbaum2000], and replace high-frequent words with the alternatives randomly. Finally, we split the augmented corpus by experts and obtain _aug_expert. To reduce bias in data split, we report the average F1 scores of 10-fold cross-validation to estimate the performance of models on these corpora.

Figure 2: The performance of four well-known text classification baseline models on _expert and _aug_expert in terms of F1 score.

Results: Figure 2 shows the performance of four baseline models on corpora in terms of precision. Compare to the SOTA precision of the other datasets for text classification at a similar scale (e.g., TREC-[Cer2018]), the numbers in Figure 2 indicate that classifying the queries is challenging, and this leaves room for developing new methods that specifically designed for this task. As shown in Figure 2, on average, the augmented corpus (i.e., _aug_expert) increases the precision for about . This proves that the data augmentation we introduced in this paper is useful for model generalization.

6.2 Paraphrase Generation

Paraphrase generation is important in NLP and can be beneficial for a wide range of downstream applications, such as question answering, semantic parsing, information retrieval, text summarization, among others 

[Li2018]. Each expert query is accompanied by paraphrases, and thus, our corpus can serve as a benchmark for paraphrase identification, extraction, generation, etc. We explore the properties of in the context of paraphrase generation.

6.2.1 Statistical Analysis

Table 1 shows the statistical information of in comparison with other paraphrasing corpora. The Quora dataset[quora] is a collection of question paraphrases composed of over pairs of duplicate questions. It has about unique sentences. We follow the experimental protocols mentioned by Patro et al.  [Patro2018] to sample K (Quora-K) and (Quora-K) as training sets and use the rest of the K and K as test sets, respectively. The Twitter dataset[Lan2017], which is widely used for paraphrasing research, consists of a large amount of human-annotated sentence pairs, which are collected from Twitter by linking tweets through shared URLs. This dataset is growing continuously and has pairs of paraphrases ( unique sentences) after a one-year collection (from 2016.10 to 2017.10) [Socher2013]. We sample K randomly as a training set (Twitter-K) and K as the testing set. We construct paraphrasing datasets from using two settings. First, we allow sentence repeat using permutations (-78K). For example, given sentences ( expert sentence + paraphrases) with the same meaning, we construct pairs. As a result, each sentence will appear in multiple paraphrase pairs, and we have pairs consisting of unique sentences. Second, we avoid sentence repeat by selecting pairs randomly from sentences, which results pairs ( unique sentences) in total (-5K). For the -5K and -78K datasets, we split the train (90%) and test (10%) randomly as shown in Table 1.

Compared to the dataset like Quora or Twitter, differs from them for both vocabulary space and design principles. -5K and -78K are tailed for data analytics, and the queries in reflect how analysts may ask questions on the data. On the other hand, utterances in Quora or Twitter datasets are mainly for general purpose and therefore cover a large range of casual topics. In summary, although -5K and -78K are not competitive to Quora or Twitter dataset in terms of the number of pairs and unique sentences, it provides a new data source with new problems that complementary to all existing paraphrasing corpora.

6.2.2 Benchmarking

We first define the task of paraphrase generation, then introduce the setup of the experiment. Finally, we present the benchmarking results to provide deep insights into our corpus in comparison to others.

Task Definition: We follow the description in Patro et al. ’s work [Patro2018] to introduce the task of paraphrase generation. Given an original sentence and a reference sentence with almost the same meaning, we need to generate another sentence so that the distance between and the reference is minimal. The training data consist of pairs of paraphrases , where and are paraphrases of each other.

Dataset # Pairs # Sentences Train Test
Quda-5k 5,365 10,730 5,365 684
Quda-78k 78,939 11,102 78,939 10,302
Quora-50k 50,000 66,621 50,000 10,000
Quora-100k 100,000 112,397 100,000 30,000
Twitter-100k 100,000 31,465 100,000 15,000
Table 1: Statistical comparison between five paraphrasing datasets.

Models: We choose three representative techniques for paraphrase generation. The first model (EDLPS) proposes a pairwise discriminator based encoder-decoder for paraphrase generation. Encoder, decoder, and discriminator are built upon LSTM [Patro2018]. Second, we choose VAE-SVG, which combines deep generative models (VAE) with sequence-to-sequence models (LSTM). This model can generate multiple paraphrases for a given sentence [Gupta2018]. The third model (DiPS) can generate paraphrases similar in semantics yet diverse in structure. The model is based on a basic Seq2Seq framework and features in monotone sub-modular function maximization [Kumar2019]. We use the same parameter setting on all datasets for each model.

Evaluation Metrics:

The evaluation metrics include METEOR 

[Lavie2007], BLEU [Papineni2002], and TER [Snover2006]. Though these metrics were used originally for machine translation, prior research has shown that they correlate with human judgments in terms of evaluating the quality of the generated paraphrases [Wubben2010]

. BLEU score considers the modified n-gram precisions and poses penalty for paraphrases that are too short or too long. We use the unigram precisions to present the results. The higher the BLEU score is, the better the generated paraphrases are in quality. METEOR calculates word mapping, unigram precision and recall between the generated paraphrase and the reference sentence. A higher METEOR score reflects better paraphrasing quality. TER stands for Translation Edit Rate, which measures how much effort humans would have to perform to change a generated paraphrase to the reference. Lower TER scores mean better quality.

Results: Table 2 shows the performance of three models using four datasets under three metrics. Generally, models do not perform well on -5K compared to other datasets. All models result in lower BLEU score, lower METEOR score, and high TER for -5K and -78K. Given the same experimental settings, the performance of the models highly relates to the volume of training sets. An interesting observation is that models perform better on Quora-K than -78K, albeit Quora-

K is larger in scale than -78K. That is probably because the number of unique sentences in -78K is six times smaller than those in Quora-

K.

Dataset Model BLEU METEOR TER
Quda-5k EDLPS 16.1 5.2 107.6
VAE-SVG 28.0 11.7 89.0
DiPS 19.1 7.9 92.3
Quda-78k EDLPS 21.1 8.2 98.6
VAE-SVG 26.6 12.5 88.6
DiPS 24.1 9.5 94.8
Quora-50k EDLPS 29.6 14.5 87.9
VAE-SVG 40.8 21.6 74.4
DiPS 44.8 25.1 63.9
Quora-100k EDLPS 37.4 19.9 76.9
VAE-SVG 42.3 23.0 70.2
DiPS 45.9 25.9 63.1
Table 2: The performance of three paraphrase generation models on four datasets measured using BLEU, METEOR, and TER.

7 Applications

We explore the use of in two applications, i.e., the natural language interface and the analysis of design tasks in academic literature in two domains.

7.1 Natural Language Interface

Our research is motivated by the rising trend of V-NLI. With the large design space [Yu2020], one challenge faced by V-NLI design is choosing a proper visualization to answer an analytic question. The choice of visualization design relies highly on analytic tasks. Given the type of task, V-NLIs can make an effective design choice based on a large body of empirical research [Saket2019]. In this section, we present the design and implementation of a V-NLI prototype, called FreeNLI that benefits from the construction of . The features of FreeNLI have two aspects: (1) It accepts free-form natural language as input, and (2) forms effective design decision by recognizing analytic tasks from the input query.

7.1.1 The FreeNLI System

Figure 3 shows the system architecture of FreeNLI consisting of three major components: a language parser that processes NL query, a pool of design rules that identify effective design decisions, and an interface that interacts with data analysts.

Language Parser. Given an NL query, FreeNLI parses from two aspects. First, it applies a well-trained text classification model to categorize the type of analytic tasks involved in the query. We use the Ad-RNN model trained on _aug_expert, which has the best performance as described in Section 6.1. By default, the model classifies queries into one of analytic tasks. However, analysts may ask queries beyond the scope. To recognize the exception, we categorize a query into “Others” if the maximum estimation confidence is lower than a threshold, which is set to in the current prototype.

Second, the language parser identifies data fields for visualization. The current prototype uses simple string matching to determine which columns are involved in the visualization. Therefore, users are required to mention column names in the query. Due to the sheer variety of natural language expressions, data analysts may use an abbreviation or alternative terms to refer to data fields. For example, a data table on the performance of basketball players may have two columns, i.e., name and score. A data analyst may search for, “which player (name) has the best performance (score)?” The mapping from NL to data fields is challenging, and in future research, we plan to employ more advanced techniques to alleviate this restriction.

Figure 3: Overview of the system architecture of FreeNLI.

Design Decisions. With the data fields and the analytic task, FreeNLI offers effective design decisions based on the findings of an empirical study [Saket2019]. To be specific, we derive design decisions from three aspects, i.e., data transformation, design choice, and design ranking. First, to accommodate analytic tasks such as “Tell me the distribution of movie’s budget”, we employ simple data transformation and aggregation techniques, such as binning, counting, sorting, deriving maximum, minimum, and average. Next, referring to the empirical study, FreeNLI has limited support for the type of visualizations (scatterplots, bar charts, and line charts), the number of data fields (one or two), and type of data fields (numerical and nominal). Finally, to help a user complete analytic tasks with efficiency, we choose and rank visualizations to minimize the completion time[Saket2019].

Interface. Figure 4(a) shows FreeNLI that provides a concise interface for a user to interact with. As a start, a user may select or upload a dataset. Then, the dataset is shown as a table with which the user is allowed to navigate raw data and raise hypotheses. Simple manipulations, such as sorting and filtering, are supported. The user can then raise queries for the data table. When a query is submitted, a pop-up window with effective visualizations will be displayed to answer the query. The user can hover over visual elements to explore detailed information using a tooltip. Since FreeNLI was for demonstration purposes, we kept the interface simple and provided limited functionality and customization in the current prototype. We plan to refine and augment FreeNLI iteratively from two directions. First, we plan to provide rich configuration for the visualization and data transformation, such as adjusting the size of binning, including pivoting, sampling, etc. Second, we will improve user experience with more features, such as exploratory analysis and history, etc.

7.1.2 Usage Scenario

We describe a usage scenario to evaluate the usefulness of FreeNLI. Suppose Jane is a fan of Hollywood movies, and she exhibits considerable interest in analyzing a movie dataset comprising nine fields, including budget, title, genre, votes, etc. The dataset contains a total of records, and she aims to gain a comprehensive sense of the dataset and identify trends, distribution, or other insights worth sharing. After loading the movie dataset to FreeNLI, a data table is shown to allow Jane to explore the dataset by sorting and filtering. “I do not want to use the table to explore such a large dataset”, Jane murmurs. Instead, she decides to ask questions to visualize the dataset. For an overview of worldwide gross profit, Jane types the query “How about the worldwide gross profit of different genres?”. A bar chart pops up to display the distribution of Genre on Worldwide Gross profit (Figure 4(b)). She observes that adventure movies have the grossest profit compared to other types of movies. Jane wonders whether adventure movies are released more than others. She then asks “Show me the distribution of genre.” FreeNLI shows a pie chart to address this query (Figure 4(d)). Surprisingly, Adventure only accounts for among all types of movies. Drama, Comedy and Action have more market share compared to Adventure. Jane infers that adventure movies are usually high in the budget, and only filming teams with enough resources can afford it. To verify the hypothesis, she asks “What is the correlation between Genre and Budget?” A bar chart is displayed showing that Adventure requires the most budget (71 million dollars on average) among all categories (Figure 4(c)). Finally, she concludes, “Though adventure movies have high worldwide gross profit, only limited filming teams with a high budget can afford it.”

Figure 4: (A) The interface of FreeNLI. (B)-(D) Two bar charts and one pie chart are displayed to answer different queries.

7.1.3 Discussions and Future Directions

A large body of empirical research has evaluated the effectiveness of various design decisions given analytic tasks and datasets. However, data analysts without visualization background may not benefit from these studies. FreeNLI recognizes analytic tasks for free-form natural language queries and helps data analysts gain insights into data with effective visual representations. However, despite its promise, the current prototype is limited in two aspects. First, input queries should articulate which data fields are involved. Although this step is common for many existing V-NLIs, it is not natural for free-form NL queries. Hence, we plan to remove this restriction by employing models in text-to-SQL to map NL queries to data fields. Second, the current prototype is not scalable because of the need to construct a large number of rule-based design strategies. The second issue can be resolved by building end-to-end models that can learn expert behaviors in authoring visualizations, which is a promising research direction.

7.2 Comparison of Two Application Fields

To demonstrate the usefulness of our corpus beyond V-NLIs, we analyze and compare the design requirements of two application domains, i.e., visual analytics of sports data and urban computing. To answer questions such as, “How tasks distributed in the two domains” and “What are dominant tasks in each domain”, we first collect publications from leading venues. Then, we manually gather task utterances from the publications. Next, we employ the Ad-RNN model trained on _aug_expert and classify task utterances into categories. We include an “Others” category to handle utterances that do not belong to any of the tasks. Finally, we report the analytic results in comparing the two domains.

7.2.1 Data Collection

To ensure that papers are of high quality, we focus on those published in top venues in visual analytics, transportation, and sports analytics, such as IEEE transactions on visualization and computer graphics, IEEE Transactions on Intelligent Transportation Systems, Proceedings of the CHI conference, etc. In addition, to obtain design requirements, we selectively choose papers that 1) target at domain problems and 2) solve domain problems using visual analytics. Finally, we have collected papers where are in the urban domain and are in the sports domain. The full paper list can be found in the supplementary material.

The text classification model trained on may have better performance if the data for prediction are in the same vocabulary and syntax space as training samples. Our corpus features queries that are objectives, concrete in the abstraction level, and low in the composition level. Therefore, we focus on task utterances that share similar characteristics as our corpus. To gather text as required, we have gone through sections related to design tasks, requirement analysis, system design, and case study, and collect all possible utterances in the publications. Next, we manually filter out utterances that do not meet our requirements. As a result, we obtain utterances in total, where are from the sports domain and the rest fall in the urban domain.

7.2.2 Results

The task distribution of each domain is shown in Figure 5. The y-axis presents task categories, while the x-axis means the proportion of tasks in each field. We observe that some tasks are dominant in both domains, such as Cluster, and Correlation. Since the number of entities, e.g., candidate solutions or tactics, is large in both domains. Clustering is usually applied to group similar entities and provide an overview of the entire dataset. For example, in a billboard placement project in the urban domain, analysts ask, “How many groups of candidate solutions exist?” Similarly, analysts in the sports domain often seek to understand “What are the frequent patterns of tactics?” In the sports domain, Correlation ranks first among all tasks. By scrutinizing the paper, we find that analysts often need to understand the potential effect of a particular adjustment in tactics. They may ask questions such as, “How does a tactical adjustment influence the strokes and tactics”, and “What is the effect of a formation change?” An interesting finding is that publications in the urban domain constitute a larger portion of Characterize Distribution tasks () compared to the sports domain (). Since the Urban domain usually involves the analysis of spatial-temporal data. Analysts may ask questions such as “How are the target trajectories distributed across the city”, and “What is the difference between weekday and weekend?”

Figure 5: Comparison of task analysis between two domains, i.e., sports and urban computing. The x-axis presents the proportion of analytic tasks in each domain.

8 Discussions

, the largest textual corpus in visual data analytics to date, is the first attempt to construct a large-scale corpus for V-NLIs. Though we have shown the usefulness of in various application scenarios, some limitations exist in the current version of the corpus. First, according to the six dimensions discussed in Section 2, our corpus covers a limited scope of queries. For example, does not include contextual queries that frequently appear in V-NLIs and systems with multimodal interaction[Srinivasan2018]. Corpora are fundamental in machine learning research, and those catered for visual data analytics facilitate the development of machine/deep learning approaches in the visualization community. We plan to construct more textual corpora in the context of V-NLIs in future research.

Second, data tables have a strong influence on queries. Data tables with various domains would diversify queries in semantics and syntax. Our corpus is constructed with data tables in domains. Though enumerating all data tables is infeasible, we should collect queries based on the most representative ones and cover as many domains as possible. In future research, we will investigate the datasets used in the domain of visual data analytics.

Third, paraphrase validation ensures the quality of crowd-authored queries. However, our approach requires extensive crowd force for pairwise comparisons, which would not be scalable if the number of paraphrases increases. As is a large-scale corpus with human validation, we plan to train a machine learning model on our corpus to distinguish paraphrases with low quality. Then the scalability issue can be alleviated.

9 Conclusion and Future Work

In this work, we present a large-scale corpus, named , to help V-NLIs understand natural language queries by training and benchmarking machine/deep learning techniques. Besides applications such as V-NLIs, our corpus is fundamental in various domains, including visualization recommendation, automatic visualization generation, etc. contains queries labeled with low-level analytic tasks. To construct , we first collect expert queries by recruiting data analytics. Then, we employ crowd force to author paraphrases for each expert query. Next, to ensure quality for paraphrases, we borrow the crowd to collect the pairwise comparison dataset and compute semantic equivalence scores using relative attribute[Parikh2011]. Finally, paraphrases with low scores are filtered out. We present four experiments to illustrate the usefulness and significance of in different usage scenarios.

In the future, we plan to explore more research opportunities using . First, multimodal interaction for visualization systems can improve user experience and system usability [Srinivasan2018]. We have built FreeNLI as the first step to explore natural language as an input modality. We plan to continuously expand the scope of and augment the prototype to support more usage scenarios such as contextual queries. Second, we plan to improve the process of corpus construction to balance the data quality and cost. The cost for expert query collection and paraphrase generation is reasonable. However, the process of paraphrase validation is expensive, and the cost will increase quadratically with the number of paraphrases. With the help of , we plan to train a classification model to investigate scalable approaches for quality control.

Acknowledgements.
The authors wish to thank A, B, and C. This work was supported in part by a grant from XYZ (# 12345-67890).

References