Contextual Out-of-Domain Utterance Handling With Counterfeit Data Augmentation

05/24/2019 ∙ by Sungjin Lee, et al. ∙ 0

Neural dialog models often lack robustness to anomalous user input and produce inappropriate responses which leads to frustrating user experience. Although there are a set of prior approaches to out-of-domain (OOD) utterance detection, they share a few restrictions: they rely on OOD data or multiple sub-domains, and their OOD detection is context-independent which leads to suboptimal performance in a dialog. The goal of this paper is to propose a novel OOD detection method that does not require OOD data by utilizing counterfeit OOD turns in the context of a dialog. For the sake of fostering further research, we also release new dialog datasets which are 3 publicly available dialog corpora augmented with OOD turns in a controllable way. Our method outperforms state-of-the-art dialog models equipped with a conventional OOD detection mechanism by a large margin in the presence of OOD utterances.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, there has been a surge of excitement in developing chatbots for various purposes in research and enterprise. Data-driven approaches offered by common bot building platforms (e.g. Google Dialogflow, Amazon Alexa Skills Kit, Microsoft Bot Framework) make it possible for a wide range of users to easily create dialog systems with a limited amount of data in their domain of interest. Although most task-oriented dialog systems are built for a closed set of target domains, any failure to detect out-of-domain (OOD) utterances and respond with an appropriate fallback action can lead to frustrating user experience. There have been a set of prior approaches for OOD detection which require both in-domain (IND) and OOD data [10, 14]. However, it is a formidable task to collect sufficient data to cover in theory unbounded variety of OOD utterances. In contrast, [9]

introduced an in-domain verification method that requires only IND utterances. Later, with the rise of deep neural networks,


proposed an autoencoder-based OOD detection method which surpasses prior approaches without access to OOD data. However, those approaches still have some restrictions such that there must be multiple sub-domains to learn utterance representation and one must set a decision threshold for OOD detection. This can prohibit these methods from being used for most bots that focus on a single task.

The goal of this paper is to propose a novel OOD detection method that does not require OOD data by utilizing counterfeit OOD turns in the context of a dialog. Most prior approaches do not consider dialog context and make predictions for each utterance independently. We will show that this independent decision leads to suboptimal performance even when actual OOD utterances are given to optimize the model and that the use of dialog context helps reduce OOD detection errors. To consider dialog context, we need to connect the OOD detection task with the overall dialog task. Thus, for this work, we build upon Hybrid Code Networks (HCN) [15] since HCNs achieve state-of-the-art performance in a data-efficient way for task-oriented dialogs, and propose AE-HCNs which extend HCNs with an autoencoder (Figure 1). Furthermore, we release new dialog datasets which are three publicly available dialog corpora augmented with OOD turns in a controlled way (exemplified in Table 1) to foster further research. 111

1 sys: Hello, welcome to the Cambridge restaurant system.
How may I help you?
2 usr: What time is the next bus from squirrel hill to morewood?
3 sys: Sorry I didn’t catch that. Could you please repeat?
4 usr: Bring me to the nearest grocery store
5 sys: Sorry I didn’t catch that. Could you please repeat?
6 usr: So sorry man I want a moderately priced restaurant in the west part of town
7 sys: api call
Table 1: Augmented dialog example (OOD utterances in bold and segment-level OOD content in italics.)

2 Methods

In this section, we first present the standard HCN model. Then we introduce the proposed AE-HCN(-CNN) model, consisting of an autoencoder and a reconstruction score-aware HCN model. Finally, we describe the counterfeit data augmentation method for training the proposed model.

2.1 Hcn

As shown in Figure 1, HCN considers a dialog as a sequence of turns. At each turn, HCN takes a tuple, , as input to produce the next system action 222A system action can be either a text output or an api call. , where is a user utterance consisting of tokens, i.e., ,

a one-hot vector encoding the previous system action and

a contextual feature vector generated by domain-specific code. The user utterance is encoded as a concatenation of a bag-of-words representation and an average of word embeddings of the user utterance:


where denotes a word embedding layer initialized with GloVe [11] with 100 dimensions. HCN then considers the input tuple, , to update the dialog state through an LSTM [6] with 200 hidden units:


Finally, a distribution over system actions is calculated by a dense layer with a softmax activation:

Figure 1: The architecture of AE-HCN which is the same as HCN except for the autoencoder component.

2.2 Ae-Hcn

On top of HCN, AE-HCN additionally takes as input an autoencoder’s reconstruction score for the user utterance for dialog state update (Figure 1):


The autoencoder is a standard seq2seq model which projects a user utterance into a latent vector and reconstructs the user utterance. Specifically, the encoder reads using a GRU [3] to produce a 512-dimensional hidden vector which in turn gets linearly projected to a 200-dimensional latent vector :


The output of the decoder at step is a distribution over words:


where has 512 hidden units. The reconstruction score

is the normalized generation probability of



2.3 Ae-Hcn-Cnn

AE-HCN-CNN is a variant of AE-HCN where user utterances are encoded using a CNN layer with max-pooling (following 

[7]) rather than equation 1:


The CNN layer considers two kernel sizes (2 and 3) and has 100 filters for each kernel size.

2.4 Counterfeit Data Augmentation

To endow an AE-HCN(-CNN) model with a capability of detecting OOD utterances and producing fallback actions without requiring real OOD data, we augment training data with counterfeit turns. We first select arbitrary turns in a dialog at random according to a counterfeit OOD probability , and insert counterfeit turns before the selected turns. A counterfeit turn consists of a tuple as input and a fallback action as output. We copy and of each selected turn to the corresponding counterfeit turns since OOD utterances do not affect previous system action and feature vectors generated by domain-specific code. Now we generate a counterfeit and . Since we don’t know OOD utterances a priori, we randomly choose one of the user utterances of the same dialog to be . This helps the model learn to detect OOD utterances because a random user utterance is contextually inappropriate just like OOD utterances are. We generate

by drawing a sample from a uniform distribution,

, where is the maximum reconstruction score of training data and is an arbitrary large number. The rationale is that the reconstruction scores of OOD utterances are likely to be larger than but we don’t know what distribution the reconstruction scores of OOD turns would follow. Thus we choose the most uninformed distribution, i.e., a uniform distribution so that the model may be encouraged to consider not only reconstruction score but also other contextual features such as the appropriateness of the user utterance given the context, changes in the domain-specific feature vector, and what action the system previously took.

3 Datasets

To study the effect of OOD input on dialog system’s performance, we use three task-oriented dialog datasets: bAbI6 [2] initially collected for Dialog State Tracking Challenge 2 [5]; GR and GM taken from Google multi-domain dialog datasets  [13]. Basic statistics of the datasets are shown in Table 2. bAbI6 deals with restaurant finding tasks, GM buying a movie ticket, and GR reserving a restaurant table, respectively. We generated distinct action templates by replacing entities with slot types and consolidating based on dialog act annotations.

We augment test datasets (denoted as Test-OOD in Table 2) with real user utterances from other domains in a controlled way. Our OOD augmentations are as follows:

  • OOD utterances: user requests from a foreign domain — the desired system behavior for such input is a fallback action,

  • segment-level OOD content: interjections in the user in-domain requests — treated as valid user input and is supposed to be handled by the system in a regular way.

These two augmentation types reflect a specific dialog pattern of interest (see Table 1): first, the user utters a request from another domain at an arbitrary point in the dialog (each turn is augmented with the probability , which is set to 0.2 for this study), and the system answers accordingly. This may go on for several turns in a row —each following turn is augmented with the probability , which is set to 0.4 for this study. Eventually, the OOD sequence ends up and the dialog continues as usual, with a segment-level OOD content of the user affirming their mistake. While we introduce the OOD augmentations in a controlled programmatic way, the actual OOD content is natural. The OOD utterances are taken from dialog datasets in several foreign domains: 1) Frames dataset [1] — travel booking (1198 utterances); 2) Stanford Key-Value Retrieval Network Dataset [4] — calendar scheduling, weather information retrieval, city navigation (3030 utterances); 3) Dialog State Tracking Challenge 1 [16] — bus information (968 utterances).

In order to avoid incomplete/elliptical phrases, we only took the first user’s utterances from the dialogs. For segment-level OOD content, we mined utterances with the explicit affirmation of a mistake from Twitter and Reddit conversations datasets — 701 and 500 utterances respectively.

bAbI6 Train Dev Test Test-OOD
# dialogs 1618 500 1117 1117
Avg. turns per dialog 20.08 19.30 22.07 27.27
GR Train Dev Test Test-OOD
# dialogs 1116 349 775 775
Avg. turns per dialog 9.07 6.53 6.87 9.01
GM Train Dev Test Test-OOD
# dialogs 362 111 252 252
Avg. turns per dialog 8.78 9.14 8.73 11.25
Table 2: Data statistics. The numbers of distinct system actions are 58, 247, and 194 for bAbI6, GR, and GM, respectively.

4 Experimental Setup and Evaluation

Domain bAbI6 GR GM
Test Data Test Test-OOD Test Test-OOD Test Test-OOD
Metrics P@1 P@1 OOD F1 P@3 P@3 OOD F1 P@3 P@3 OOD F1
HCN 53.41 41.95 0 58.89 41.65 0 41.18 27.08 0
AE-HCN-Indep 31.29 41.06 48.68 51.90 55.42 71.52 31.12 42.78 64.35
AE-HCN 53.58 55.04 73.41 56.97 58.90 74.67 40.61 48.59 69.31
AE-HCN-CNN 55.04 55.35 70.38 58.32 64.51 81.33 45.12 52.79 68.59
Table 3:

Evaluation results. P@K means Precision@K. OOD F1 denotes f1-score for OOD detection over utterances.

We comparatively evaluate four different models: 1) an HCN model trained on in-domain training data; 2) an AE-HCN-Indep model which is the same as the HCN model except that it deals with OOD utterances using an independent autoencoder-based rule to mimic [12] – when the reconstruction score is greater than a threshold, the fallback action is chosen; we set the threshold to the maximum reconstruction score of training data; 3) an AE-HCN(-CNN) model trained on training data augmented with counterfeit OOD turns – the counterfeit OOD probability is set to 15% and to 30. We apply dropout to the user utterance encoding with the probability 0.3. We use the Adam optimizer [8], with gradients computed on mini-batches of size 1 and clipped with norm value 5. The learning rate was set to

throughout the training and all the other hyperparameters were left as suggested in

[8]. We performed early stopping based on the performance of the evaluation data to avoid overfitting. We first pretrain the autoencoder on in-domain training data and keep it fixed while training other components.

The result is shown in Table 3. Since there are multiple actions that are appropriate for a given dialog context, we use per-utterance Precision@K

as performance metric. We also report f1-score for OOD detection to measure the balance between precision and recall. The performances of HCN on Test-OOD are about 15 points down on average from those on Test, showing the detrimental impact of OOD utterances to such models only trained on in-domain training data. AE-HCN(-CNN) outperforms HCN on Test-OOD by a large margin about 17(20) points on average while keeping the minimum performance trade-off compared to Test. Interestingly, AE-HCN-CNN has even better performance than HCN on Test, indicating that, with the CNN encoder, counterfeit OOD augmentation acts as an effective regularization. In contrast, AE-HCN-Indep failed to robustly detect OOD utterances, resulting in much lower numbers for both metrics on Test-OOD as well as hurting the performance on Test. This result indicates two crucial points: 1) the inherent difficulty of finding an appropriate threshold value without actually seeing OOD data; 2) the limitation of the models which do not consider context. For the first point, Figure 

2 plots histograms of reconstruction scores for IND and OOD utterances of bAbI6 Test-OOD. If OOD utterances had been known a priori, the threshold should have been set to a much higher value than the maximum reconstruction score of IND training data (6.16 in this case).

Figure 2: Histograms of AE reconstruction scores for the bAbI6 test data. The histograms for other datasets follow similar trends.

For the second point, Table 4 shows the search for the best threshold value for AE-HCN-Indep on the bAbI6 task when given actual OOD utterances (which is highly unrealistic for the real-world scenario). Note that the best performance achieved at 9 is still not as good as that of AE-HCN(-CNN). This implies that we can perform better OOD detection by jointly considering other context features.

Threshold Precision@1 OOD F1
6 40.39 48.38
7 42.56 50.46
8 43.69 51.08
9 52.21 63.86
10 47.27 44.44
Table 4: Performances of AE-HCN-Indep on bAbI6 Test-OOD with different thresholds.

Finally, we conduct a sensitivity analysis by varying counterfeit OOD probabilities. Table 5 shows performances of AE-HCN-CNN on bAbI6 Test-OOD with different values, ranging from 5% to 30%. The result indicates that our method manages to produce good performance without regard to the value. This superior stability nicely contrasts with the high sensitivity of AE-HCN-Indep with regard to threshold values as shown in Table 4.

Test Data Test Test-OOD
Counterfeit Precision@1 Precision@1 OOD F1
OOD Rate
5% 55.25 55.48 69.72
10% 55.08 57.29 74.73
15% 55.04 55.35 70.38
20% 53.48 56.53 75.55
25% 53.72 56.66 73.13
30% 54.87 56.02 71.44
Table 5: Performances of AE-HCN-CNN on bAbI6 Test-OOD with varying counterfeit OOD rates.

5 Conclusion

We proposed a novel OOD detection method that does not require OOD data without any restrictions by utilizing counterfeit OOD turns in the context of a dialog. We also release new dialog datasets which are three publicly available dialog corpora augmented with natural OOD turns to foster further research. In the presence of OOD utterances, our method outperforms state-of-the-art dialog models equipped with an OOD detection mechanism by a large margin — more than 17 points in Precision@K on average — while minimizing performance trade-off on in-domain test data. The detailed analysis sheds light on the difficulty of optimizing context-independent OOD detection and justifies the necessity of context-aware OOD handling models. We plan to explore other ways of scoring OOD utterances than autoencoders. For example, variational autoencoders or generative adversarial networks have great potential. We are also interested in using generative models to produce more realistic counterfeit user utterances.


  • [1] L. E. Asri, H. Schulz, S. Sharma, J. Zumer, J. Harris, E. Fine, R. Mehrotra, and K. Suleman (2017) Frames: a corpus for adding memory to goal-oriented dialogue systems. See DBLP:conf/sigdial/2017, pp. 207–219. External Links: Link Cited by: §3.
  • [2] A. Bordes and J. Weston (2016) Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683. Cited by: §3.
  • [3] K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio (2014)

    On the properties of neural machine translation: encoder–decoder approaches

    Syntax, Semantics and Structure in Statistical Translation, pp. 103. Cited by: §2.2.
  • [4] M. Eric, L. Krishnan, F. Charette, and C. D. Manning (2017) Key-value retrieval networks for task-oriented dialogue. See DBLP:conf/sigdial/2017, pp. 37–49. External Links: Link Cited by: §3.
  • [5] M. Henderson, B. Thomson, and J. D. Williams (2014) The second dialog state tracking challenge. See DBLP:conf/sigdial/2014, pp. 263–272. External Links: Link Cited by: §3.
  • [6] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.1.
  • [7] Y. Kim (2014) Convolutional neural networks for sentence classification. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    pp. 1746–1751. Cited by: §2.3.
  • [8] D. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. The International Conference on Learning Representations (ICLR).. Cited by: §4.
  • [9] I. Lane, T. Kawahara, T. Matsui, and S. Nakamura (2007) Out-of-domain utterance detection using classification confidences of multiple topics. IEEE Transactions on Audio, Speech, and Language Processing 15 (1), pp. 150–161. Cited by: §1.
  • [10] M. Nakano, S. Sato, K. Komatani, K. Matsuyama, K. Funakoshi, and H. G. Okuno (2011) A two-stage domain selection framework for extensible multi-domain spoken dialogue systems. In Proceedings of the SIGDIAL 2011 Conference, pp. 18–29. Cited by: §1.
  • [11] J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014) 12, pp. 1532–1543. Cited by: §2.1.
  • [12] S. Ryu, S. Kim, J. Choi, H. Yu, and G. G. Lee (2017) Neural sentence embedding using only in-domain sentences for out-of-domain sentence detection in dialog systems. Pattern Recognition Letters 88, pp. 26–32. Cited by: §1, §4.
  • [13] P. Shah, D. Hakkani-Tür, G. Tür, A. Rastogi, A. Bapna, N. Nayak, and L. Heck (2018)

    Building a conversational agent overnight with dialogue self-play

    arXiv preprint arXiv:1801.04871. Cited by: §3.
  • [14] G. Tur, A. Deoras, and D. Hakkani-Tür (2014) Detecting out-of-domain utterances addressed to a virtual personal assistant. In Fifteenth Annual Conference of the International Speech Communication Association, Cited by: §1.
  • [15] J. D. Williams, K. Asadi, and G. Zweig (2017)

    Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning

    arXiv preprint arXiv:1702.03274. Cited by: §1.
  • [16] J. D. Williams, A. Raux, D. Ramachandran, and A. W. Black (2013) The dialog state tracking challenge. See DBLP:conf/sigdial/2013, pp. 404–413. External Links: Link Cited by: §3.