In recent years, both academia and industry have shown a growing interest in designing natural language interfaces (NLIs) to interactively access, explore, and analyze data in databases. Existing NLIs (cox2001multi; sun2010articulate; gao2015datatone; dhamdhere2017analyza; yu2019flowsense; setlur2016eviza; hoque2017applying; fast2018iris) allow users to formulate data-related questions in natural language (NL). They usually combine natural language processing (NLP) methods (e.g., semantic parsing) with human computer interaction (HCI) techniques, and are able to translate ambiguous NL queries into formal database query languages (e.g., SQL) to facilitate data retrieval. Then, the systems can generate proper visualizations of the retrieved data, helping users gain quick insights. By using NLIs, users, especially novice data analysts and other users without a strong background in computer science, can conveniently explore and analyze data, and do not need to specify their exploration requirements through complex database query languages or tedious interface interactions (e.g., filtering, drag and drop) in data analytics tools.
However, when using these NLIs for data analysis, users need to empirically specify every data query to derive a series of meaningful data insights. Such an analysis is proceeded iteratively through trial and error, making it still time-consuming to conduct effective and systematic data analysis. In practice, real-world datasets are often large and complex, resulting in a large exploration space. Since novice data analysts do not necessarily have sufficient knowledge about data and application domains, it is challenging for them to efficiently formulate data queries without overlooking important data insights during exploration.
To address the aforementioned challenge, we develop an intelligent NLI with a next-step data query recommendation module. The system can automatically generate a series of appropriate and insightful natural language data queries and help users, especially inexperienced data analysts, decide what to ask next for subsequent data exploration steps.
To make effective query recommendations, it is crucial to capture users’ analytical interests behind a sequence of data queries and produce relevant and context-aware exploratory operations (sordoni2015hierarchical; milo2020automating). In this paper, we target data analysis for industry applications where the dataset often comes from multiple tables in relational databases and is accessed and manipulated via SQL. We focus on recommending three typical data operations in SQL, including attribute selection, aggregation, and grouping. We propose a data-driven log-based recommendation model that suggests data queries on a dataset to a user by exploring the semantically relevant query sequences issued by other users for databases under similar topics. We assume that when users explore similar datasets, they are likely to share some common interests in their data analysis (chatzopoulou2009query; aligon2015collaborative; milo2018next). Moreover, our method fully exploits the contextual semantic information expressed in prior queries of the current user and dynamically adapts the recommendations to the user’s analytical focuses. We then design and develop an NLI, QRec-NLI that integrates the recommendation model to help data analysts conveniently and efficiently analyze data through NL with step-wise exploration guidance. We formulate design requirements by literature review and expert interviews with a group of data analysts and engineers from an international technology company. Their jobs mainly involve analyzing data from different domains and creating dashboards to communicate data insights. They confirmed that it is common in the daily work that they need to analyze data from a domain they are not familiar with, and need assistance in exploring data comprehensively and discovering data insights efficiently. We worked closely with them in the past seven months and refined our system iteratively. Eventually, our system, QRec-NLI, provides step-by-step NL-based data exploration recommendations. Users can specify data of their interest by typing in natural language and the system will generate proper visualizations to reveal data insights according to data types. Moreover, users are enabled to review history queries and the corresponding results, and organize them into a dashboard to communicate data insights.
The design and implementation of QRec-NLI differentiates itself from existing visualization recommendation systems without NL interaction support (lin2020dziban; wongsuphasawat2015voyager; wongsuphasawat2017voyager), as well as general NLIs (setlur2016eviza; gao2015datatone; hoque2017applying; narechania2020nl4dv) that focus on conveying users’ analytical interests to systems using NL and do not provide analytical guidance on what to ask next. In summary, the major contributions of this paper are as follows:
NLI system: we design and implement QRec-NLI, an NLI that enables users to perform interactive data analysis of different domains of user interest using NL and provides users with sequential exploration guidance for deciding what to ask next.
Query recommendation: we propose a data-driven next-step query recommendation model that produces semantically relevant and context-aware data queries for interested domains based on current users’ prior queries and other users’ queries for similar analysis domains.
Evaluation study: we conducted a user study to demonstrate the effectiveness of QRec-NLI in supporting sequential data exploration compared with a baseline without the recommendation module. Also, we discussed the lessons learned from the design and evaluation of QRec-NLI for future research in interactive data analysis.
2. Related Work
Our work builds upon prior research on natural language interfaces for database queries, natural language interfaces for data visualization, andnext-step query recommendations for data analysis.
2.1. Natural Language Interfaces for Database Queries
Many NLIs have been developed to enable users to gain easy access to relational databases through NL, which can be classified into three types:keyword-based systems, parsing-based systems, and neural-based systems.
Keyword-based systems (kaufmann2007nlp; simitsis2008precis; zenz2009keywords; shekarpour2015sina) first identify keywords or other domain-specific or domain-independent language patterns in input questions. Then, they map those patterns to entities in the database schema. Parsing-based systems (saha2016athena; kaufmann2006querix; li2014constructing; li2014nalir) further leverage natural language processing techniques (e.g., part-of-speech tagging, dependency parsing and entity recognition) to derive semantic and syntactic information of input queries and convert them into structural query forms (e.g., SQL). However, both keyword-based and parsing-based systems have limited capability in understanding diverse natural language queries and cannot handle complex reasoning tasks (e.g., question answering). Currently, the surge of many NL-to-SQL benchmark datasets (e.g., ATIS (price1990evaluation), Scholar (iyer2017learning), Academic (li2014constructing), WiKiSQL (zhong2017seq2sql), Spider (yu2018spider)
) gives rise to many neural-network-based methods(rubin2020smbop; zhong2017seq2sql; guo2019towards; wang2019rat; bogin2019representing; xu2017sqlnet; yu2018syntaxsqlnet)
, which have better natural language understanding abilities and can achieve state-of-the-art (SOTA) performance on NL-to-SQL tasks. They typically use sequence-to-sequence deep learning models (e.g., RNN, transformer) to automatically learn the translation from NL queries to SQL queries. In our work, we choose SmBop(rubin2020smbop) as our NL-to-SQL engine. It adopts the transformer-based model architecture with semi-autoregressive bottom-up parsing, which has a good performance on the Spider dataset with a faster speed. The Spider dataset contains complex and cross-domain NL questions and SQL queries with a zero-shot setting, where the database schemas and queries of the testing sets are new and unseen in the training sets. SmBop can handle different users’ queries for different domain datasets.
2.2. Natural Language Interfaces for Data Visualization
Prior research has explored how natural language interfaces can be employed to facilitate the interaction with data visualizations and help users comprehend data query and exploration results.
Cox et al. (cox2001multi) proposed a multimodal NLI that combines direct manipulation with natural language interaction to support data exploration. However, it supports only a small set of visualizations and NL queries without the ability to infer users’ analytical needs. Articulate (sun2010articulate) is a more intelligent NLI that first maps NL queries to some analytical tasks and then decides proper visual encodings based on the tasks and data characteristics. Flowsense (yu2019flowsense) leverages semantic parsing techniques to enable NL-based interactions in a dataflow system. It allows users to create and connect data flow components through NL. Since NL can be ambiguous, DataTone (gao2015datatone) detects and presents data ambiguities in users’ NL queries. Users can use the ambiguity widgets in DataTone to interactively resolve ambiguity and derive desired visualizations. Recently, Arpit et al. (narechania2020nl4dv) built a toolkit for creating NLIs to facilitate interaction with data visualizations. It employs a set of NLP techniques (e.g., dependency parsing) to comprehensively analyze NL data queries and further interprets explicit, ambiguous, or implicit references to data attributes, analytical tasks, and visualizations.
Conversation is also an important feature in NLI design for interactive data analysis. Analyza (dhamdhere2017analyza) focuses on question-answer features to allow layman users to interact with databases to finish their data exploration tasks. Iris (fast2018iris)
is a conversational agent that can handle a sequence of user commands for complex exploratory tasks. Eviza(setlur2016eviza) and Evizeon (hoque2017applying) further improve the interpretation of follow-up questions in conversations, as well as interactions with visualizations. In addition, some commercial products, such as Tableau, Microsoft Power BI, IBM Watson, integrate natural language interaction features to help users apply NL to create dashboards and further share data insights.
However, all these NLIs only focus on the analysis of existing NL queries for data analysis. Users need to specify their data queries for every exploration step empirically, which is time-consuming and inefficient. These NLIs do not predict and recommend to users what to analyze in the next step. A recent work, Snowy (srinivasan2021snowy), suggests next-step NL queries for visual data analysis by considering underexplored interesting data subsets. However, it only accepts single table data in CSV format, failing to be applied for more complex data analysis in the industry, where data is often stored in multiple tables in relational databases and accessed by SQL. Moreover, the data interestingness used by Snowy is pre-defined using a fixed set of statistical data properties, which cannot adapt to analytical interests for different application domains. In our work, we target data analysis across multiple tables from different domains. We adopt a data-driven approach and consider both semantics of application domains, and prior relevant user queries for next-step exploration (e.g., data selection, transformation).
2.3. Next-step Query Recommendations for Data Analysis
To further reduce the manual effort for data analysis and insight discovery, many query recommendation techniques have been proposed to assist users in deciding next-step exploration actions. They can be roughly categorized into two groups: data-driven systems and log-based systems (milo2020automating).
Data-driven systems evaluate the interestingness of data insights (e.g., data subsets) generated by different exploration actions. The interestingness can be defined according to objective measures (e.g., information gain) or subjective criteria (e.g., unexpected values that users have not explored) (geng2006interestingness). The typical recommendations include grouping (e.g., roll-up (sathe2001intelligent), drill-down (joglekar2017interactive)), attribute-value pairs (drosou2013ymaldb), data charts (vartak2015seedb), and data cubes (sarawagi1998discovery; sarawagi2000user). However, data-driven systems cannot be well adapted to the various preferences of different users. In contrast, log-based systems (milo2018next; aligon2015collaborative; eirinaki2013querie; chatzopoulou2009query) recommend more personalized next-step actions based on prior queries of the current user or other users. The log-based systems generally assume that if two users share similar query requests, they may be interested in similar aspects of data. Thus, one user’s query can be utilized to make suggestions for the other. Log-based approaches involve two major steps. First, given a user’s query contexts, the approaches retrieve the most similar query sequences from query logs generated by the same user or other users. Then, it analyzes the retrieved sequences to synthesize the final recommendations of next-step exploration for the current user. However, many log-based approaches measure the similarity based on prefixes of attributes without considering the semantics of attributes.
In our paper, we propose a log-based approach that explicitly considers the target analysis domain (e.g., “customer orders”), semantic meanings of user queries, as well as conceptual relationships among a sequence of queries. And it generates exploration recommendations that can adapt to users’ interested domains and query contexts. Furthermore, we incorporate our method into a NLI to enable interactive and user-friendly data exploration.
2.4. Visualization Recommendation for Data Analysis
Visualization recommendation techniques focus on automatically generating proper and desired visualizations for data analysts to explore data and discover insights (zeng2021evaluation).
Many previous studies adopt rule-based (narechania2020nl4dv; mackinlay1986automating; mackinlay2007show; wongsuphasawat2016towards) and learning-based (moritz2018formalizing; luo2018deepeye; dibia2019data2vis; hu2019vizml; 10.1145/3447548.3467224; li2021kg4vis) approaches to suggest visual encodings of specified data based on data attributes, tasks, and visual perception theory.
Moreover, other work (vartak2015seedb; key2012vizdeck; demiralp2017foresight; wongsuphasawat2015voyager; wongsuphasawat2017voyager; lin2020dziban; Raghunandan2021) builds interactive systems that utilize visualization recommendations to facilitate data exploration.
Some systems (vartak2015seedb; key2012vizdeck; demiralp2017foresight) present data of interest in a series of visualizations based on pre-defined statistical properties (e.g., deviation, outliers, correlation). However, the recommendations do not explicitly consider different users’ preferences.
present data of interest in a series of visualizations based on pre-defined statistical properties (e.g., deviation, outliers, correlation). However, the recommendations do not explicitly consider different users’ preferences. Voyager(wongsuphasawat2015voyager; wongsuphasawat2017voyager) allows users to specify data or visualization of user interest. Then, the system presents a gallery of recommended visualizations to enable faceted and breath-oriented exploration of data attributes and visual design choices. Dziban (lin2020dziban) further considers the context of data analysis for recommendations. It builds on Draco (moritz2018formalizing) knowledge base, and employs chart similarity measures introduced in GraphScape (kim2017graphscape) to recommend charts that are perceptually similar to a specified “anchored” chart. However, Dziban cannot suggest visualizations for new data attributes.
In this paper, we also develop an interactive system for visual data analysis based on users’ interests and data patterns. Compared to prior visualization recommendation techniques, our work focuses on the usage and recommendation of a series of NL queries to support more effective visual analysis. Users can specify their analytical interests via NL, and the system can respond to their queries with intuitive visualizations and further guide the next-step exploration through NL query recommendations.
3. Design Requirements
We aim to develop a NLI that can recommend step-by-step exploration actions to data analysts to reduce the manual effort and expertise requirements for data analysis and facilitate interactive insight discovery. To identify design requirements of QRec-NLI, we first reviewed design requirements and implementations of previous NLIs for data analysis (cox2001multi; sun2010articulate; gao2015datatone; dhamdhere2017analyza; yu2019flowsense; setlur2016eviza; hoque2017applying; fast2018iris; narechania2020nl4dv; li2014constructing; li2014nalir; srinivasan2021snowy). Further, we worked closely with our industry collaborators for about five months to collect their feedback on the design requirements on building a NLI to facilitate their data analysis. Specifically, there are 6 data analysts (D1-D6) and a data visualization scientist (E1, who is also a co-author of this paper). Both the data analysts and visualization research scientist are from an international technology company.
Our target users, data analysts, usually rely on some business intelligence (BI) tools (e.g., Tableau, Microsoft Excel) to analyze data from different business domains (e.g., operations), extract data facts manually, and communicate data insights via dashboards. However, they do not necessarily have expertise in every domain and may overlook important analysis actions and data aspects during their data explorations. Thus, they confirmed that it sould be very helpful if some intuitive hints could be provided to them to guide their subsequent data exploration. Moreover, they mostly use direct manipulations (e.g., drag and drops) to convert data into dashboards, which is found to be tedious and needs considerable efforts in translating data-related problems into interface interactions. Using natural language was perceived to be intuitive and convenient by our users to formulate their data questions that require multiple steps (e.g., data transformation, visualization). During the design process, we carried out weekly meetings with our industry collaborators and iteratively updated design requirements, and implemented system prototypes according to their feedback. Finally, we compiled a list of design requirements, which can be summarized as follows:
R1. Provide easy access to databases via natural language queries. Natural language interaction provides an intuitive and user-friendly way for users to interact with databases and conduct the flow of analysis (fast2018iris; dhamdhere2017analyza; setlur2016eviza; hoque2017applying). Our target users also desire to use NL to quickly formulate their data needs. In addition, to promote data discovery, NLIs need to offer hints on what data exists in databases and what queries systems support (setlur2016eviza; yu2019flowsense). D3 commented that the system should support autocompletion of data attributes when typing a question, which helps them compile their analytical questions.
R2. Present proper visualizations for the retrieved data from databases. Data visualization is an important and effective approach for analyzing big data and revealing data patterns (e.g., trends, outliers) (cox2001multi; sun2010articulate; gao2015datatone; narechania2020nl4dv). Our users also need to generate visualizations to share data insights within the organization. However, most of them (except D1 and D5) do not have expertise in data visualization. They expect that the system can automatically map the attributes and values of retrieved data to proper forms of visualization.
R3. Recommend next-step exploration actions according to users’ analysis contexts. In practice, data analysts need to explore data from different business domains and empirically extract data insights. However, D1 and D3 pointed out that they often need to spend lots of time finding interesting domain-specific data facts when the dataset is from a domain that they are not very familiar with. Moreover, the large data exploration space and high data complexity in databases make data exploration even more challenging (milo2018next; aligon2015collaborative). To reduce manual effort in insight discovery, D3 suggested that the system should present meaningful exploration actions (
Example 3.1 ().
which attributes are relevant for the investigated domains). In addition, data analysis is a subjective and iterative process that involves multiple exploration steps. Analysts may have diverging analysis flows, i.e., their analytical interests can also change during exploration. Thus, the system is expected to offer context-aware recommendations that are dynamically adapted to users’ analytical focuses (sordoni2015hierarchical; milo2020automating; srinivasan2021snowy).
R4. Explain the relevance of system responses to users’ queries. We find that not all of the target users have expertise in database query languages (e.g., SQL) and visualization. Thus, the system should explain analysis operations powered by SQL (e.g., attribute selection) and generate visualizations in an understandable manner. D4 said that it would be better if the system could use NL to present suggestions on exploration actions (i.e., SQL-related operations). D3 added that the system needs to demonstrate the mappings between retrieved data and generated visualizations. Besides, D3 stated that the system should link input NL queries with the generated SQL, allowing him to verify if it retrieves the correct data from databases.
R5. Support an easy exploration of query history and data visualizations. Data analysis is a multi-step iterative process. Many of our target users mentioned that they often need to refer back to previous queries, review what insights they derive, and adjust the future exploration path. Afterward, the analysts need to select important data facts and the corresponding visualizations from prior queries and create a complete data story using a dashboard. D1 advised that the system should save and summarize user query sequences and restore prior queries on demand. And D2 recommended that the system needs to further help them organize the queries and visualizations of interest into a dashboard to report data insights.
Motivated by the derived requirements, we design and implement QRec-NLI, an NLI (Figure 1) that can recommend next-step exploration actions to facilitate the interactive and iterative data analysis. In this section, we first give an overview of the system framework. Then, we describe the recommendation model that generates semantically relevant and context-aware analysis actions in detail. Finally, we illustrate the visual components and interaction designs of the user interface.
4.1. System Framework
Figure 2 summarizes the system workflow. After the user loads a database of interest and inputs an NL query through the User Interface, the Query Analyzer translates the input into a SQL query using SmBop (rubin2020smbop), a SOTA NL2SQL model, and retrieves the data from databases. Then, the Visualization Generator automatically generates visualizations to show the data based on the data’s properties. Finally, the Recommendation Engine generates a set of next-step exploration suggestions in the form of SQL queries based on the prior user queries in the Query Log Repository, which covers the basic functionalities, including attribute selection, aggregation, and grouping. The suggested queries are then translated into NL and provided in the User Interface.
4.2. Next-step Exploration Recommendations
We propose a log-based model to generate semantically relevant and context-aware next-step exploration actions based on all queries made by a current user and reference queries in query databases. When introducing our method, we start by introducing the problem settings and the data form. Then, we describe the four major steps for recommending what to ask next, including (1) reference query preparation, (2) initial exploration action recommendation, (3) context-aware exploration action recommendation, and (4) query recommendation translation.
4.2.1. Problem Setting & Data
We define the problem of next-step query recommendation as follows. Given the database from a target domain (e.g., customer service), the problem is to generate a list of query suggestions in natural language according to the current user’s prior queries and historical queries from other users, serving as references. A database is a set of tables containing information to different entities (e.g., product and customer), linked through primary-key foreign-key pairs, describing their relationships.
In our work, the historical references are from an external large and cross-domain query dataset , Spider (yu2018spider), which contains 10,000 user SQL queries, covering 138 different domains, such as academy, government, and commerce. We group the reference queries according to their focal domains as , where is a term describing the focal domain, is a set of database schemas, and is the user’s SQL queries in . These queries indicate the common analysis focuses from other users, serving as references for recommending queries to new users (chatzopoulou2009query; aligon2015collaborative; milo2018next).
In summary, our proposed model inputs the database schema and the analysis domain , the current user’s prior queries , and the reference query dataset . It outputs next-step NL query candidates .
4.2.2. Reference Query Preparation
After the target domain, is decided, the recommendation model first scans the databases and selects the historical queries from semantically relevant domains as reference queries. For example, if a user chooses to explore the data in the