Repository that accompanies "An Evaluation Dataset for Intent Classificationand Out-of-Scope Prediction" (EMNLP 2019)
Task-oriented dialog systems need to know when a query falls outside their range of supported intents, but current text classification corpora only define label sets that cover every example. We introduce a new dataset that includes queries that are out-of-scope---i.e., queries that do not fall into any of the system's supported intents. This poses a new challenge because models cannot assume that every query at inference time belongs to a system-supported intent class. Our dataset also covers 150 intent classes over 10 domains, capturing the breadth that a production task-oriented agent must handle. We evaluate a range of benchmark classifiers on our dataset along with several different out-of-scope identification schemes. We find that while the classifiers perform well on in-scope intent classification, they struggle to identify out-of-scope queries. Our dataset and evaluation fill an important gap in the field, offering a way of more rigorously and realistically benchmarking text classification in task-driven dialog systems.READ FULL TEXT VIEW PDF
Task oriented dialog systems typically first parse user utterances to
Predicting user behaviour on a website is a difficult task, which requir...
Building a machine learning driven spoken dialog system for goal-oriente...
Task oriented language understanding in dialog systems is often modeled ...
Text classification tends to struggle when data is deficient or when it ...
In this report, we provide a comparative analysis of different technique...
Modern dialog managers face the challenge of having to fulfill human-lev...
Repository that accompanies "An Evaluation Dataset for Intent Classificationand Out-of-Scope Prediction" (EMNLP 2019)
Task-oriented dialog systems have become ubiquitous, providing a means for billions of people to interact with computers using natural language. Moreover, the recent influx of platforms and tools such as Google’s DialogFlow or Amazon’s Lex for building and deploying such systems makes them even more accessible to various industries and demographics across the globe.
Tools for developing such systems start by guiding developers to collect training data for intent classification: the task of identifying which of a fixed set of actions the user wishes to take based on their query. Relatively few public datasets exist for evaluating performance on this task, and those that do exist typically cover only a very small number of intents (e.g. Coucke et al. (2018), which has 7 intents). Furthermore, such resources do not facilitate analysis of out-of-scope queries: queries that users may reasonably make, but fall outside of the scope of the system-supported intents.
Figure 1 shows example query-response exchanges between a user and a task-driven dialog system for personal finance. In the first user-system exchange, the system correctly identifies the user’s intent as an in-scope balance query. In the second and third exchanges, the user queries with out-of-scope inputs. In the second exchange, the system incorrectly identifies the query as in-scope and yields an unrelated response. In the third exchange, the system correctly classifies the user’s query as out-of-scope, and yields a fallback response.
Out-of-scope queries are inevitable for a task-oriented dialog system, as most users will not be fully cognizant of the system’s capabilities, which are limited by the fixed number of intent classes. Correctly identifying out-of-scope cases is thus crucial in deployed systems—both to avoid performing the wrong action and also to identify potential future directions for development. However, this problem has seen little attention in analyses and evaluations of intent classification systems.
|banking||transfer||move 100 dollars from my savings to my checking|
|work||pto request||let me know how to make a vacation request|
|meta||change language||switch the language setting over to german|
|auto & commute||distance||tell the miles it will take to get to las vegas from san diego|
|travel||travel suggestion||what sites are there to see when in evans|
|home||todo list update||nuke all items on my todo list|
|utility||text||send a text to mom saying i’m on my way|
|kitchen & dining||food expiration||is rice ok after 3 days in the refrigerator|
|small talk||tell joke||can you tell me a joke about politicians|
|credit cards||rewards balance||how high are the rewards on my discover card|
|out-of-scope||out-of-scope||how are my sports teams doing|
|out-of-scope||out-of-scope||create a contact labeled mom|
|out-of-scope||out-of-scope||what’s the extended zipcode for my address|
This paper fills this gap by analyzing intent classification performance with a focus on out-of-scope handling. To do so, we constructed a new dataset with 23,700 queries that are short and unstructured, in the same style made by real users of task-oriented systems. The queries cover 150 intents, plus out-of-scope queries that do not fall within any of the 150 in-scope intents.
We evaluate a range of benchmark classifiers and out-of-scope handling methods on our dataset. BERT (Devlin et al., 2019) yields the best in-scope accuracy, scoring 96% or above even when we limit the training data or introduce class imbalance. However, all methods struggle with identifying out-of-scope queries. Even when a large number of out-of-scope examples are provided for training, there is a major performance gap, with the best system scoring 66% out-of-scope recall. Our results show that while current models work on known classes, they have difficulty on out-of-scope queries, particularly when data is not plentiful. This dataset will enable future work to address this key gap in the research and development of dialog systems. All data introduced in this paper can be found at https://github.com/clinc/oos-eval.
We introduce a new crowdsourced dataset of 23,700 queries, including 22,500 in-scope queries covering 150 intents, which can be grouped into 10 general domains. The dataset also includes 1,200 out-of-scope queries. Table 1 shows examples of the data.111See the supplementary material for a full list of domains and intents, as well as additional examples.
We defined the intents with guidance from queries collected using a scoping
crowdsourcing task, which prompted crowd workers to provide questions and commands related to topic domains in the manner they would interact with an artificially intelligent assistant. We manually grouped data generated byscoping tasks into intents. To collect additional data for each intent, we used the rephrase and scenario crowdsourcing tasks proposed by Kang et al. (2018).222 In all cases, crowd workers were paid $0.20 per task. See the supplementary material for examples of each task. For each intent, there are 100 training queries, which is representative of what a team with a limited budget could gather while developing a task-driven dialog system. Along with the 100 training queries, there are 20 validation and 30 testing queries per intent.
Out-of-scope queries were collected in two ways. First, using worker mistakes: queries written for one of the 150 intents that did not actually match any of the intents. Second, using scoping and scenario tasks with prompts based on topic areas found on Quora, Wikipedia, and elsewhere. To help ensure the richness of this additional out-of-scope data, each of these task prompts contributed to at most four queries. Since we use the same crowdsourcing method for collecting out-of-scope data, these queries are similar in style to their in-scope counterparts.
The out-of-scope data is difficult to collect, requiring expert knowledge of the in-scope intents to painstakingly ensure that no out-of-scope query sample is mistakenly labeled as in-scope (and vice versa). Indeed, roughly only 69% of queries collected with prompts targeting out-of-scope yielded out-of-scope queries. Of the 1,200 out-of-scope queries collected, 100 are used for validation and 100 are used for training, leaving 1,000 for testing.
For all queries collected, all tokens were down-cased, and all end-of-sentence punctuation was removed. Additionally, all duplicate queries were removed and replaced.
In an effort to reduce bias in the in-scope data, we placed all queries from a given crowd worker in a single split (train, validation, or test). This avoids the potential issue of similar queries from a crowd worker ending up in both the train and test sets, for instance, which would make the train and test distributions unrealistically similar. We note that this is a recommendation from concurrent work by Geva et al. (2019). We also used this procedure for the out-of-scope set, except that we split the data into train/validation/test based on task prompt instead of worker.
In addition to the full dataset, we consider three variations. First, Small, in which there are only 50 training queries per each in-scope intent, rather than 100. Second, Imbalanced, in which intents have either 25, 50, 75, or 100 training queries. Third, OOS+, in which there are 250 out-of-scope training examples, rather than 100. These are intended to represent production scenarios where data may be in limited or uneven supply.
To quantify the challenges that our new dataset presents, we evaluated the performance of a range of classifier models333See the supplementary material for model details. and out-of-scope prediction schemes.
A linear support vector machine with bag-of-words sentence representations.
A convolutional neural network with non-static word embeddings initialized with GloVe(Pennington et al., 2014).
BERT: A neural network that is trained to predict elided words in text and then fine-tuned on our data (Devlin et al., 2019).
We use three baseline approaches for the task of predicting whether a query is out-of-scope: (1) oos-train, where we train an additional (i.e. ) intent on out-of-scope training data; (2) oos-threshold
To reduce the severity of the class imbalance between in-scope versus out-of-scope query samples (i.e., 15,000 versus 250 queries for OOS+), we investigate two strategies when using oos-binary: one where we undersample the in-scope data and train using 1,000 in-scope queries sampled evenly across all intents (versus 250 out-of-scope), and another where we augment the 250 OOS+ out-of-scope training queries with 14,750 sentences sampled from Wikipedia.
From a development point of view, the oos-train and oos-binary methods both require careful curation of an out-of-scope training set, and this set can be tailored to individual systems. The oos-threshold method is a more general decision rule that can be applied to any model that produces a probability. In our evaluation, the out-of-scope threshold was chosen to be the value which yielded the highest validation score across all intents, treating out-of-scope as its own intent.
|In-Scope Accuracy||Out-Of-Scope Recall|
We consider two performance metrics for all scenarios: (1) accuracy over the 150 intents, and (2) recall on out-of-scope queries. We use recall to evaluate out-of-scope since we are more interested in cases where such queries are predicted as in-scope, as this would mean a system gives the user a response that is completely wrong. Precision errors are less problematic as the fallback response will prompt the user to try again, or inform the user of the system’s scope of supported domains.
Table 2 presents results for all models across the four variations of the dataset. First, BERT is consistently the best approach for in-scope, followed by MLP. Second, out-of-scope query performance is much lower than in-scope across all methods. Training on less data (Small and Imbalanced) yields models that perform slightly worse on in-scope queries. The trend is mostly the opposite when evaluating out-of-scope, where recall increases under the Small and Imbalanced training conditions. Under these two conditions, the size of the in-scope training set was decreased, while the number of out-of-scope training queries remained constant. This indicates that out-of-scope performance can be increased by increasing the relative number of out-of-scope training queries. We do just that in the OOS+ setting—where the models were trained on the full training set as well as 150 additional out-of-scope queries—and see that performance on out-of-scope increases substantially, yet still remains low relative to in-scope accuracy.
In-scope accuracy using the oos-threshold approach is largely comparable to oos-train. Out-of-scope recall tends to be much higher on Full, but several models suffer greatly on the limited datasets. BERT and MLP are the top oos-threshold performers, and for several models the threshold approach provided erratic results, particularly FastText and Rasa.
Table 3 compares classifier performance using the oos-binary scheme. In-scope accuracy suffers for all models using the undersampling scheme when compared to training on the full dataset using the oos-train and oos-threshold approaches shown in Table 2. However, out-of-scope recall improves compared to oos-train on Full but not OOS+. Augmenting the out-of-scope training set appears to help improve both in-scope and out-of-scope performance compared to undersampling, but out-of-scope performance remains weak.
|Classifier||under||wiki aug||under||wiki aug|
|Our Dataset (This Work)||150||23,700||✓||✓||✓||✓|
|TREC-6, (Li and Roth, 2002)||6||5,952||✗||✗||✗||✗|
|TREC-50, (Li and Roth, 2002)||50||5,952||✗||✓||✓||✗|
|Web Apps, (Braun et al., 2017)||8||89||✗||✗||✓||✗|
|Ask Ubuntu, (Braun et al., 2017)||5||162||✗||✗||✓||✗|
|Chatbot Corpus, (Braun et al., 2017)||2||206||✓||✗||✓||✗|
|Snips, (Coucke et al., 2018)||7||14,484||✓||✗||✗||✗|
|Liu et al. (2019)||54||25,716||✓||✓||✗||✗|
In most other analyses and datasets, the idea of out-of-scope data is not considered, and instead the output classes are intended to cover all possible queries (e.g., TREC (Li and Roth, 2002)). Recent work by Hendrycks and Gimpel (2017) considers a similar problem they call out-of-distribution detection. They use other datasets or classes excluded during training to form the out-of-distribution samples. This means that the out-of-scope samples are from a small set of coherent classes that differ substantially from the in-distribution samples. Similar experiments were conducted for evaluating unknown intent discovery models in Lin and Xu (2019). In contrast, our out-of-scope queries cover a broad range of phenomena and are similar in style and often similar in topic to in-scope queries, representing things a user might say given partial knowledge of the capabilities of a system.
are the most similar to the in-scope part of our work, with the same type of conversational agent requests. Like our work, both of these datasets were bootstrapped using crowdsourcing. However, the Snips dataset has only a small number of intents and an enormous number of examples of each. Snips does present a low-data variation, with 70 training queries per intent, in which performance drops slightly. The dataset presented inLiu et al. (2019) has a large number of intent classes, yet also contains a wide range of samples per intent class (ranging from 24 to 5,981 queries per intent, and so is not constrained in all cases).
Braun et al. (2017) created datasets with constrained training data, but with very few intents, presenting a very different type of challenge. We also include the TREC query classification datasets (Li and Roth, 2002), which have a large set of labels, but they describe the desired response type (e.g., distance, city, abbreviation) rather than the action intents we consider. Moreover, TREC contains only questions and no commands. Crucially, none of the other datasets summarized in Table 4 offer a feasible way to evaluate out-of-scope performance.
The Dialog State Tracking Challenge (DSTC) datasets are another related resource. Specifically, DSTC 1 Williams et al. (2013), DSTC 2 Henderson et al. (2014a), and DSTC 3 Henderson et al. (2014b) contain “chatbot style” queries, but the datasets are focused on state tracking. Moreover, most if not all queries in these datasets are in-scope. In contrast, the focus of our analysis is on both in- and out-of-scope queries that challenge a virtual assistant to determine whether it can provide an acceptable response.
This paper analyzed intent classification and out-of-scope prediction methods with a new dataset consisting of carefully collected out-of-scope data. Our findings indicate that certain models like BERT perform better on in-scope classification, but all methods investigated struggle with identifying out-of-scope queries. Models that incorporate more out-of-scope training data tend to improve on out-of-scope performance, yet such data is expensive and difficult to generate. We believe our analysis and dataset will lead to developing better, more robust dialog systems.
All datasets introduced in this paper can be found at https://github.com/clinc/oos-eval.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP).