Recently, there has been a surge of excitement in developing chatbots for various purposes in research and enterprise. Data-driven approaches offered by common bot building platforms (e.g. Google Dialogflow, Amazon Alexa Skills Kit, Microsoft Bot Framework) make it possible for a wide range of users to easily create dialog systems with a limited amount of data in their domain of interest. Although most task-oriented dialog systems are built for a closed set of target domains, any failure to detect out-of-domain (OOD) utterances and respond with an appropriate fallback action can lead to frustrating user experience. There have been a set of prior approaches for OOD detection which require both in-domain (IND) and OOD data [10, 14]. However, it is a formidable task to collect sufficient data to cover in theory unbounded variety of OOD utterances. In contrast, 
introduced an in-domain verification method that requires only IND utterances. Later, with the rise of deep neural networks,
proposed an autoencoder-based OOD detection method which surpasses prior approaches without access to OOD data. However, those approaches still have some restrictions such that there must be multiple sub-domains to learn utterance representation and one must set a decision threshold for OOD detection. This can prohibit these methods from being used for most bots that focus on a single task.
The goal of this paper is to propose a novel OOD detection method that does not require OOD data by utilizing counterfeit OOD turns in the context of a dialog. Most prior approaches do not consider dialog context and make predictions for each utterance independently. We will show that this independent decision leads to suboptimal performance even when actual OOD utterances are given to optimize the model and that the use of dialog context helps reduce OOD detection errors. To consider dialog context, we need to connect the OOD detection task with the overall dialog task. Thus, for this work, we build upon Hybrid Code Networks (HCN)  since HCNs achieve state-of-the-art performance in a data-efficient way for task-oriented dialogs, and propose AE-HCNs which extend HCNs with an autoencoder (Figure 1). Furthermore, we release new dialog datasets which are three publicly available dialog corpora augmented with OOD turns in a controlled way (exemplified in Table 1) to foster further research. 111https://github.com/sungjinl/icassp2019-ood-dataset.git
|1||sys:||Hello, welcome to the Cambridge restaurant system.|
|How may I help you?|
|2||usr:||What time is the next bus from squirrel hill to morewood?|
|3||sys:||Sorry I didn’t catch that. Could you please repeat?|
|4||usr:||Bring me to the nearest grocery store|
|5||sys:||Sorry I didn’t catch that. Could you please repeat?|
|6||usr:||So sorry man I want a moderately priced restaurant in the west part of town|
In this section, we first present the standard HCN model. Then we introduce the proposed AE-HCN(-CNN) model, consisting of an autoencoder and a reconstruction score-aware HCN model. Finally, we describe the counterfeit data augmentation method for training the proposed model.
As shown in Figure 1, HCN considers a dialog as a sequence of turns. At each turn, HCN takes a tuple, , as input to produce the next system action 222A system action can be either a text output or an api call. , where is a user utterance consisting of tokens, i.e., ,
a one-hot vector encoding the previous system action anda contextual feature vector generated by domain-specific code. The user utterance is encoded as a concatenation of a bag-of-words representation and an average of word embeddings of the user utterance:
Finally, a distribution over system actions is calculated by a dense layer with a softmax activation:
On top of HCN, AE-HCN additionally takes as input an autoencoder’s reconstruction score for the user utterance for dialog state update (Figure 1):
The autoencoder is a standard seq2seq model which projects a user utterance into a latent vector and reconstructs the user utterance. Specifically, the encoder reads using a GRU  to produce a 512-dimensional hidden vector which in turn gets linearly projected to a 200-dimensional latent vector :
The output of the decoder at step is a distribution over words:
where has 512 hidden units. The reconstruction score
is the normalized generation probability of:
2.4 Counterfeit Data Augmentation
To endow an AE-HCN(-CNN) model with a capability of detecting OOD utterances and producing fallback actions without requiring real OOD data, we augment training data with counterfeit turns. We first select arbitrary turns in a dialog at random according to a counterfeit OOD probability , and insert counterfeit turns before the selected turns. A counterfeit turn consists of a tuple as input and a fallback action as output. We copy and of each selected turn to the corresponding counterfeit turns since OOD utterances do not affect previous system action and feature vectors generated by domain-specific code. Now we generate a counterfeit and . Since we don’t know OOD utterances a priori, we randomly choose one of the user utterances of the same dialog to be . This helps the model learn to detect OOD utterances because a random user utterance is contextually inappropriate just like OOD utterances are. We generate
by drawing a sample from a uniform distribution,, where is the maximum reconstruction score of training data and is an arbitrary large number. The rationale is that the reconstruction scores of OOD utterances are likely to be larger than but we don’t know what distribution the reconstruction scores of OOD turns would follow. Thus we choose the most uninformed distribution, i.e., a uniform distribution so that the model may be encouraged to consider not only reconstruction score but also other contextual features such as the appropriateness of the user utterance given the context, changes in the domain-specific feature vector, and what action the system previously took.
To study the effect of OOD input on dialog system’s performance, we use three task-oriented dialog datasets: bAbI6  initially collected for Dialog State Tracking Challenge 2 ; GR and GM taken from Google multi-domain dialog datasets . Basic statistics of the datasets are shown in Table 2. bAbI6 deals with restaurant finding tasks, GM buying a movie ticket, and GR reserving a restaurant table, respectively. We generated distinct action templates by replacing entities with slot types and consolidating based on dialog act annotations.
We augment test datasets (denoted as Test-OOD in Table 2) with real user utterances from other domains in a controlled way. Our OOD augmentations are as follows:
OOD utterances: user requests from a foreign domain — the desired system behavior for such input is a fallback action,
segment-level OOD content: interjections in the user in-domain requests — treated as valid user input and is supposed to be handled by the system in a regular way.
These two augmentation types reflect a specific dialog pattern of interest (see Table 1): first, the user utters a request from another domain at an arbitrary point in the dialog (each turn is augmented with the probability , which is set to 0.2 for this study), and the system answers accordingly. This may go on for several turns in a row —each following turn is augmented with the probability , which is set to 0.4 for this study. Eventually, the OOD sequence ends up and the dialog continues as usual, with a segment-level OOD content of the user affirming their mistake. While we introduce the OOD augmentations in a controlled programmatic way, the actual OOD content is natural. The OOD utterances are taken from dialog datasets in several foreign domains: 1) Frames dataset  — travel booking (1198 utterances); 2) Stanford Key-Value Retrieval Network Dataset  — calendar scheduling, weather information retrieval, city navigation (3030 utterances); 3) Dialog State Tracking Challenge 1  — bus information (968 utterances).
In order to avoid incomplete/elliptical phrases, we only took the first user’s utterances from the dialogs. For segment-level OOD content, we mined utterances with the explicit affirmation of a mistake from Twitter and Reddit conversations datasets — 701 and 500 utterances respectively.
|Avg. turns per dialog||20.08||19.30||22.07||27.27|
|Avg. turns per dialog||9.07||6.53||6.87||9.01|
|Avg. turns per dialog||8.78||9.14||8.73||11.25|
4 Experimental Setup and Evaluation
|Metrics||P@1||P@1||OOD F1||P@3||P@3||OOD F1||P@3||P@3||OOD F1|
Evaluation results. P@K means Precision@K. OOD F1 denotes f1-score for OOD detection over utterances.
We comparatively evaluate four different models: 1) an HCN model trained on in-domain training data; 2) an AE-HCN-Indep model which is the same as the HCN model except that it deals with OOD utterances using an independent autoencoder-based rule to mimic  – when the reconstruction score is greater than a threshold, the fallback action is chosen; we set the threshold to the maximum reconstruction score of training data; 3) an AE-HCN(-CNN) model trained on training data augmented with counterfeit OOD turns – the counterfeit OOD probability is set to 15% and to 30. We apply dropout to the user utterance encoding with the probability 0.3. We use the Adam optimizer , with gradients computed on mini-batches of size 1 and clipped with norm value 5. The learning rate was set to
throughout the training and all the other hyperparameters were left as suggested in. We performed early stopping based on the performance of the evaluation data to avoid overfitting. We first pretrain the autoencoder on in-domain training data and keep it fixed while training other components.
The result is shown in Table 3. Since there are multiple actions that are appropriate for a given dialog context, we use per-utterance Precision@K
as performance metric. We also report f1-score for OOD detection to measure the balance between precision and recall. The performances of HCN on Test-OOD are about 15 points down on average from those on Test, showing the detrimental impact of OOD utterances to such models only trained on in-domain training data. AE-HCN(-CNN) outperforms HCN on Test-OOD by a large margin about 17(20) points on average while keeping the minimum performance trade-off compared to Test. Interestingly, AE-HCN-CNN has even better performance than HCN on Test, indicating that, with the CNN encoder, counterfeit OOD augmentation acts as an effective regularization. In contrast, AE-HCN-Indep failed to robustly detect OOD utterances, resulting in much lower numbers for both metrics on Test-OOD as well as hurting the performance on Test. This result indicates two crucial points: 1) the inherent difficulty of finding an appropriate threshold value without actually seeing OOD data; 2) the limitation of the models which do not consider context. For the first point, Figure2 plots histograms of reconstruction scores for IND and OOD utterances of bAbI6 Test-OOD. If OOD utterances had been known a priori, the threshold should have been set to a much higher value than the maximum reconstruction score of IND training data (6.16 in this case).
For the second point, Table 4 shows the search for the best threshold value for AE-HCN-Indep on the bAbI6 task when given actual OOD utterances (which is highly unrealistic for the real-world scenario). Note that the best performance achieved at 9 is still not as good as that of AE-HCN(-CNN). This implies that we can perform better OOD detection by jointly considering other context features.
Finally, we conduct a sensitivity analysis by varying counterfeit OOD probabilities. Table 5 shows performances of AE-HCN-CNN on bAbI6 Test-OOD with different values, ranging from 5% to 30%. The result indicates that our method manages to produce good performance without regard to the value. This superior stability nicely contrasts with the high sensitivity of AE-HCN-Indep with regard to threshold values as shown in Table 4.
We proposed a novel OOD detection method that does not require OOD data without any restrictions by utilizing counterfeit OOD turns in the context of a dialog. We also release new dialog datasets which are three publicly available dialog corpora augmented with natural OOD turns to foster further research. In the presence of OOD utterances, our method outperforms state-of-the-art dialog models equipped with an OOD detection mechanism by a large margin — more than 17 points in Precision@K on average — while minimizing performance trade-off on in-domain test data. The detailed analysis sheds light on the difficulty of optimizing context-independent OOD detection and justifies the necessity of context-aware OOD handling models. We plan to explore other ways of scoring OOD utterances than autoencoders. For example, variational autoencoders or generative adversarial networks have great potential. We are also interested in using generative models to produce more realistic counterfeit user utterances.
-  (2017) Frames: a corpus for adding memory to goal-oriented dialogue systems. See DBLP:conf/sigdial/2017, pp. 207–219. External Links: Cited by: §3.
-  (2016) Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683. Cited by: §3.
On the properties of neural machine translation: encoder–decoder approaches. Syntax, Semantics and Structure in Statistical Translation, pp. 103. Cited by: §2.2.
-  (2017) Key-value retrieval networks for task-oriented dialogue. See DBLP:conf/sigdial/2017, pp. 37–49. External Links: Cited by: §3.
-  (2014) The second dialog state tracking challenge. See DBLP:conf/sigdial/2014, pp. 263–272. External Links: Cited by: §3.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.1.
Convolutional neural networks for sentence classification.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Cited by: §2.3.
-  (2015) Adam: a method for stochastic optimization. The International Conference on Learning Representations (ICLR).. Cited by: §4.
-  (2007) Out-of-domain utterance detection using classification confidences of multiple topics. IEEE Transactions on Audio, Speech, and Language Processing 15 (1), pp. 150–161. Cited by: §1.
-  (2011) A two-stage domain selection framework for extensible multi-domain spoken dialogue systems. In Proceedings of the SIGDIAL 2011 Conference, pp. 18–29. Cited by: §1.
-  (2014) Glove: global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014) 12, pp. 1532–1543. Cited by: §2.1.
-  (2017) Neural sentence embedding using only in-domain sentences for out-of-domain sentence detection in dialog systems. Pattern Recognition Letters 88, pp. 26–32. Cited by: §1, §4.
Building a conversational agent overnight with dialogue self-play. arXiv preprint arXiv:1801.04871. Cited by: §3.
-  (2014) Detecting out-of-domain utterances addressed to a virtual personal assistant. In Fifteenth Annual Conference of the International Speech Communication Association, Cited by: §1.
Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning. arXiv preprint arXiv:1702.03274. Cited by: §1.
-  (2013) The dialog state tracking challenge. See DBLP:conf/sigdial/2013, pp. 404–413. External Links: Cited by: §3.