With increased use of ML models in real life settings, it has become imperative for them to self-identify, at test-time, examples on which they are likely to fail due to them differing significantly from the model’s training time distribution.
In particular, for state-of-the-art deep classifiers such as those used in vision and language tasks, it has been observed that the raw probability value is over calibrated and can have high values even for OOD inputs. This necessitates having an auxiliary mechanism to detect them.
This task is not in entirety novel, and has historically been explored in related forms under various names such as one class classification, open classification etc. The recent stream of work on this started with , which proposed benchmark datasets for doing this on vision problems.  find that increasing the softmax temperature makes the resultant probability more discriminative for OOD Detection.  propose using distances to per-class Gaussians in the intermediate representation learnt by the classifier. Specifically, a Gaussian is fit for each training class from all training points in that class .  show that “correcting” likelihood with likelihood from a “background” model trained on noisy inputs is better at discriminating out of distribution examples. Recently,  propose using an old measure from the data mining literature named LOF  in the space of penultimate activations learnt by a classifier.
Apart from  and few others, a majority of the prior work uses vision problems and datasets, often image classification as the setting in which to perform OOD Detection. Certain methods, such as input gradient reversal from  or an end-to-end differentiable Generative Adversarial Network (GAN) as in  are not directly applicable for natural language inputs. Furthermore, image classification has available several benchmarks with a similar label space (digits and numbers) but differing input distributions, such as MNIST, CIFAR and SVHN. Most of these works exploit this fact for their experimental setting by picking one of these datasets as ID and the other as OOD. In this work, we attempt to address these lacunae and specifically explore which OOD detection approaches work well on natural language, in particular, intent classification.
This problem is greatly relevant for task oriented dialog systems since intent classification can receive user intents which are sometimes not in any of the domains defined by the current ontology or downstream functions. In particular, unsupervised detection approaches are important as it is difficult to curate this kind of data for training because
The size of in-domain data to train on can become arbitrarily large as the concerned dialog system gets more users and acquires the ability to handle newer intent classes. After a point, it becomes impractical to continue curating newer examples for training in proportion to the in-domain data. From there on, class imbalance would keep increasing.
By definition, OOD is an open class. For natural language intents, utterances can demonstrate diverse sentence phenomenena such as slang, rhetorical questions, code mixed language, etc. User data can exhibit a large range of OOD behaviours, all of which may be difficult to encapsulate using a limited set of OOD examples at training time.
To the best of our knowledge, this is the first application of likelihood ratios approach for OOD Detection in natural language. Overall, our contributions are as follows:
We release ROSTD, a novel dataset111Our dataset is available at github.com/vgtomahawk/LR˙GC˙OOD/blob/master/data/fbrelease/OODrelease.tsv of OOD sentences for intent classification. We observe that existing datasets for OOD intent classification are
too small (1000 examples)
create OOD examples synthetically
We show that performing OOD detection on ROSTD is more challenging than the synthetic setting where OOD examples are created by holding out some fraction of intent classes. We further describe this dataset in §4 .
We show that using the marginal likelihood of a generative classifier provides a principled way of both incorporating the label information [like classifier uncertainty based approaches] while at the same time testing for ID vs OOD using a likelihood function.
We show that using likelihood with a correction term from a “background” model, based on the formalism proposed in , is a much more effective approach than using the plain likelihood. We propose multiple ways of training such a background model for natural language inputs.
Our improvements hold on multiple datasets - both for our dataset as well as the existing SNIPS dataset .
All our methods attempt to estimate a score which is indicative of the data point beingOOD. For an input , we refer to this function as . This function may additionally be parametrized by the classifier distribution or the set of nearest neighbors , based on the specific method in use.
Some of our evaluation measures are threshold independent. In this case, can be directly evaluated for its goodness of detecting OOD points. For measures which are threshold dependent, an optimal threshold which maximizes macro-F1 is picked using the values of on a validation set.
Maximum Softmax Probability
Maximum Softmax Probability, or MSP is a simple and intuitive baseline proposed by . MSP uses the maximum probability 1- as . The lesser “confident” the classifier is about its predicted outcome i.e the argmax label, the greater is . Typically,
denotes the logit for labelwhile denotes the softmax temperature. Increasing smoothens the distribution while decreasing makes it peakier. We also try increased values of as they were shown to work better by .
Alternatively, both  and  propose using either of the following222They differ only by a constant and have the exact same minimas, as we show in Appendix 1.1. Appendix can be read at github.com/vgtomahawk/LR˙GC˙OOD/tree/master/appendix:
Negative KL Divergence
w.r.t the uniform distribution over labels.
We refer to this method as in our experiments333One distinction from the two cited papers is that they use merely as an auxiliary training objective, and end up using MSP as the at test time. In contrast, we explicitly use as . Here, we also experiment with a variant of this method which replaces with , where , or the fraction of class in the training set. We expect this variant to do better when ID classes are not distributed equally.
Local Outlier Factor
LOF, proposed by , is a measure based on local density defined with respect to nearest neighbours. Recently,  effectively used LOF in the intermediate representation learnt by a classifier for OOD detection of intents. The LOF measure can be defined in three steps:
First, . Here, is the distance to the kth nearest neighbor, while is the distance measure being used. Intuitively, is lower-bounded by , but can become arbitrarily large.
Next, define a measure named local reachability density or lrd. This is simply the reciprocal of the average for
Lastly, LOF is defined as
Intuitively, if the “density” around a point’s nearest neighbours is higher than its own “density”, the point will have a higher LOF. Points with a higher score are more likely to be OOD.
 further show that using the large margin cosine loss or LMCL, works better than the typical combination of softmax + cross entropy.
Here, denotes the margin. denotes the row in the final linear layer weight matrix corresponding to the label . denotes the penultimate layer activations which are input to the final layer. We use to denote the normalized , i.e .
We denote this approach as LOF+LMCL. We directly use the author’s implementation 444https://github.com/thuiar/DeepUnkID for this approach.
Here, is the likelihood according to a model trained on the ID training points. In the simplest case, is simply a left-to-right language model learnt on all our training sentences. Later, we discuss another class of models which can give a valid likelihood, which we name .
[16, 2] found that likelihood is poor at separating out OOD examples; in some cases even assigning them higher likelihood than the ID test split.  make similar observations for detecting OOD DNA sequences. They then propose the concept of likelihood ratio or LLR based methods, which we briefly revisit here.
Let and denote the probability of according to the model and a background model respectively. is trained on the training set , while is trained on noised samples from the training set. Let denote the prefix . The LLR is derived as in Equation 2
The intuition was that “surface-level” features might be causing the OOD points to be assigned a reasonable probability. The hypothesis is that the background model would capture these “surface-level” features which also persist after noising and remove their influence on being divided out. If it were so, would be a better choice for .
How to introduce noise?
It is a common practice in vision to add noise or perturb images slightly by adding a Gaussian noise vector of small magnitude. Since natural language utterances are sequences of discrete words, this does not extend directly to them.
A simple alternative to introduce noise into natural language inputs is by random word substitution. Word substitution based noise has a long precedent of use, from negative sampling as in word2vec [15, 7]
to autoencoder-like objectives likeBERT .
More specifically, with probability , we substitute each word with word sampled from the distribution .
is a hyperparameter to be tuned. We experiment with 3 different choices of:
UNIFORM: i.e each word is equally likely.
UNIROOT: i.e a word is sampled with probability proportional to the square root of its frequency. Using such smoothed versions of the unigram frequency distribution has precedent in other NLP tasks. For instance, in 555Specifically, see the footnote concluding Page 2 in the paper, the word2vec negative sampling uses to sample negative contexts, , proportional to frequency, .
Choice of architecture
Since this is hitherto the first work to extend LLR method for NLP, we try to use a simple and standard architecture for . We use a left-to-right LSTM language model  with a single layer. We vary the hidden state . Note that is non-class conditional - it does not use the labels in any way.
An additional point of consideration is that should not have a very large number of parameters, or have large time complexity at test time. Even in this regard, a LSTM language model with a small state size is apt. We refer to approaches which use this architecture for with +BackLM
Typical classification models estimate the conditional probability of the label given the input, i.e . An alternative paradigm learns to estimate , additionally estimating from the training set label ratios. Using Bayes rule,
Classifiers of this paradigm are called generative classifiers, in contrast to the typical discriminative classifiers.
 compare the two paradigms and found generative classifiers useful for a) High sample efficiency b) Continual Learning c) Explicit Marginal Likelihood. The last point is particularly useful for us since we can use the explicit marginal likelihood from the classifier as our function. Specifically, we use the term which is directly available from a trained generative classifier. Hereon, we refer to this as .
also propose a deep architecture for generative text classifiers that consists of a shared unidirectional LSTM across classes and a label embedding matrix. The respective label embedding is concatenated to the current hidden state and a final layer is then applied on this vector to give the distribution over the next word. The per-word cross-entropy loss serves as the loss function. We use a similar architecture as illustrated in Figure 1.
For the threshold dependent measures, we tune our function on the validation set. We use the following metrics to measure OOD detection performance:
: On picking a threshold such that the OOD recall is k%, what fraction of the predicted OOD points are ID? FPR denotes False Positive Rate and TPR denotes True Positive Rate. Note that Positive here refers to the OOD class. We choose a high values of , i.e . Note that lower this value, the better is our OOD Detection.
: Measures the area under the Receiver Operating Characteristic, also known as the ROC curve. Note that this curve is for the OOD class.  first proposed using this. Higher the value, better is our OOD Detection. This metric is threshold independent.
: Area under the Precision Recall Curve is another threshold independent metric, based on the Precision-Recall Curve. Unlike , is insensitive to class imbalance . The and correspond to taking ID and OOD respectively as the positive class.
We use two datasets for our experiments. The first, SNIPS, is a widely used, publicly available dataset, and does not contain actual OOD intents. The second, ROSTD, is a combination of a dataset released earlier , with new OOD examples collected by us. We briefly describe both of these in order. Table 1 also provides useful summary statistics about these datasets:
Released by , SNIPs consists of sentences spread through intent classes such as GetWeather, RateBook etc. As discussed previously, it does not explicitly include OOD sentences.
We follow the procedure described in  to synthetically create OOD examples. Intent classes covering atleast of the training points in combination are retained as . Examples from the remaining classes are treated as OOD and removed from the training set. In the validation and test sets, examples from these classes are relabelled to the single class label OOD. Besides not being genuinely OOD, another issue with this dataset is that the validation and test splits are quite small in size at each.
In §5, we report experiments on and 666See appendix for the experimental results with i.e SNIPS,25%., both of which ratios were used in . We refer to these datasets as SNIPS,75% and SNIPS,25% respectively. Since multiple ID-OOD splits of the classes satisfying these ratios are possible, our results are averaged across 5 randomly chosen splits.
We release a dataset of OOD sentences . These sentences were curated to be explicitly OOD with respect to the English split of the recently released dataset of intents from  as the ID dataset. This dataset contained intents from intent classes. We chose this dataset over SNIPs owing to its considerably larger size ( times larger). The sentences were authored by human annotators with the instructions as described in the subsection Annotation Guidelines.
We use human annotators to author intents which are explicity with respect to the English split of . The requirements and instructions for annotation were as follows:
The OOD utterances were authored by several distinct English-speaking annotators from Anglophone countries.
The annotators were asked to author sentences which were both grammatical and semantically sensible as English sentences. This was to prevent our OOD data from becoming trivial by inclusion of ungrammatical sentences, gibberish and nonsensical sentences.
The annotators were well informed of existing intent classes to prevent them from authoring intents . This was done by presenting the annotators with examples from the training split of each intent class, with the option to scroll for more through a dropdown.
After the first round of annotators had authored such intents, each intent was post-annotated as in-domain vs out-of-domain by two fresh annotators who were not involved in the authoring stage.
If both annotators agreed that the example was OOD, it was retained. If both agreed it was ID, it was discarded. In the event the two annotators disagreed, an additional, third annotator was asked to label the example and adjudicate the disagreement.
During post-processing, we removed utterances which were shorter than three words.
We identify six qualitative categories which might be making the sentences OOD. We then manually assign each example into these categories. We summarize their distribution in Table 1. Note that since these categories are not mutually exhaustive, an example may get assigned multiple categories.
The ID examples from , which we also use as the ID portion of ROSTD has hierarchical class labels [e.g , and ]777See  for full list. Hence, ROSTD has a large number of classes (12), not all of which are equally distinct from each other. To ensure that our results are not specific only to settings with this kind of hierarchical label structure, we also experiment with retaining only the topmost or most “coarse” label on each example. We refer to this variant with “coarsened” labels as ROSTD-COARSE.
|Unique Word Types||11.5K||11.4K|
|Mean Utterance Length||6.85||6.79|
|Number of ID classes||12/3 (Coarse)||7|
|ROSTD||MSP||54.22 4.01||100.00 0.00||70.75 3.70||55.68 6.36|
|MSP,||55.45 4.19||60.48 3.17||76.94 4.01||59.64 6.49|
|55.31 3.89||60.15 3.11||76.86 3.85||59.54 6.10|
|83.24 2.78||21.31 8.38||95.78 1.30||90.32 1.97|
|LOF||64.46 2.57||42.49 3.49||81.39 2.38||46.89 3.50|
|LOF+LMCL||85.97 2.00||15.03 5.42||95.60 0.75||82.71 9.17|
|81.38 0.19||18.92 0.56||95.42 0.11||87.38 0.41|
|+BackLM+UNIFORM||85.25 0.72||36.65 6.87||94.71 0.49||91.10 0.63|
|+BackLM+UNIGRAM||82.27 0.74||42.16 3.62||93.62 0.43||89.30 0.50|
|+BackLM+UNIROOT||87.42 0.45||20.10 5.25||96.35 0.41||93.44 0.37|
|86.25 0.71||10.86 1.08||97.42 0.28||92.30 0.99|
|+BackLM+UNIFORM||89.60 0.56||13.71 5.64||97.67 0.35||95.49 0.42|
|+BackLM+UNIGRAM||91.35 2.62||10.55 4.11||97.87 0.49||95.86 0.68|
|+BackLM+UNIROOT||91.17 0.32||7.41 1.88||98.22 0.26||96.47 0.29|
|ROSTD-COARSE||MSP||59.99 19.01||26.00 34.32||71.63 15.55||64.32 19.36|
|MSP,||64.62 15.31||64.46 9.84||78.39 11.92||66.89 11.76|
|65.36 15.49||65.39 4.84||79.05 11.40||67.79 19.43|
|81.56 8.51||17.78 15.70||93.47 6.25||87.49 8.94|
|LOF||62.39 9.01||46.55 17.56||78.07 12.23||45.80 12.21|
|LOF+LMCL||84.28 3.44||15.24 4.70||95.19 1.03||76.63 2.53|
|80.48 0.27||20.78 0.71||95.2 0.07||86.87 0.13|
|+BackLM+UNIFORM||85.97 0.65||30.65 4.51||95.27 0.47||91.98 0.61|
|+BackLM+UNIGRAM||84.46 0.62||31.79 3.04||94.93 0.32||91.22 0.50|
|+BackLM+UNIROOT||88.25 0.50||16.35 1.32||96.82 0.12||94.10 0.20|
|86.67 0.34||9.88 0.44||97.58 0.08||92.74 0.29|
|+BackLM+UNIFORM||89.32 0.30||8.04 0.69||97.83 0.15||95.27 0.32|
|+BackLM+UNIGRAM||90.05 0.73||6.69 0.82||98.16 00.15||95.61 0.50|
|+BackLM+UNIROOT||90.14 0.39||6.78 0.60||98.30 0.09||95.96 00.37|
|SNIPS, 75%||MSP||81.58 7.68||16.68 18.06||93.51 4.49||85.03 6.19|
|MSP,||83.94 6.82||31.32 30.25||94.30 4.50||88.44 5.72|
|84.23 7.22||29.28 27.04||94.51 4.38||88.71 6.15|
|LOF||66.07 8.82||49.56 13.49||79.65 7.81||51.69 13.08|
|LOF+LMCL||76.24 9.34||42.27 20.64||90.37 6.54||77.81 10.53|
|63.51 6.33||54.56 12.13||81.72 5.90||62.12 13.28|
|+BackLM+UNIFORM||74.74 3.25||44.60 12.01||90.02 2.24||80.08 3.31|
|+BackLM+UNIGRAM||81.19 3.53||27.00 8.71||93.97 1.85||87.57 3.38|
|+BackLM+UNIROOT||78.75 3.25||35.24 11.08||92.66 1.89||84.84 3.36|
|67.31 7.06||44.60 18.53||85.17 7.18||68.11 14.29|
|+BackLM+UNIFORM||78.37 6.60||29.28 4.23||92.35 2.77||82.84 7.33|
|+BackLM+UNIGRAM||85.47 6.90||18.48 11.26||95.79 2.67||90.98 6.73|
|+BackLM+UNIROOT||81.91 6.83||22.24 6.26||94.15 2.59||86.60 7.03|
We compile the results of all our experiments in Table 3.
All experiments are averaged across 5 seeds. We use Pytorch 1.0 to implement models888Code available at github.com/vgtomahawk/LR˙GC˙OOD.
The checkpoint with highest validation F1 on the ID subset of the validation set is chosen as the final checkpoint for computing the other OODevaluation metrics. For the label-agnostic approaches (), the checkpoint with lowest validation perplexity is chosen. For the +BackLM approaches, we use . We also experimented with , but find works best.
Base Classifier Architectures
For the discriminative classifier, we use a bidirectional LSTM [1-layer] with embedding size 100, projection layer of 100300 (to project up embeddings), hidden size and embeddings initialized with Glove (glove.6B.100D) . Generative classifier approaches have similar architecture except that they are unidirectional and have additional label embeddings of dimension
We use the scikit-learn 0.21.2 implementation999https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html  of LOF. We fix the number of nearest neighbors to but tune the contamination rate as a hyperparameter. We also corroborated over email correspondence with the authors of  that they had used a similar hyperparameter setting for LOF.
From Table 3, we see that outperforms uncertainty based and nearest neighbour approaches by a reasonable margin on both datasets. It is also significantly better than the language model likelihood based . This validates our hypothesis that generative classifiers effectively combine the benefits of likelihood-based and uncertainty-based approaches.
Furthermore, LLR based approaches always outperform the respective likelihood-only approach, whether or . Amongst different noising methods, the performance improvement is typically largest using the UNIROOT approach we proposed. For instance, on ROSTD,
A clear advantage of ROSTD which is clear from the experiments is that differences in performance between the various methods are much more pronounced when tested on it, as compared to SNIPS. On SNIPS, the simple MSP, baseline is itself able to reach 80-90% of the best performing approach on most metrics.
To the best of our knowledge, we are hitherto the first work to use an approach based on generative text classifiers for OOD detection. Our experiments show that this approach can outperform existing paradigms significantly on multiple datasets.
Furthermore, we are the first to flesh out ways to use likelihood ratio based approaches first formalized by  for OOD detection in NLP. The original work had tested these approaches only for DNA sequences which have radically smaller vocabulary than NL sentences. We propose UNIROOT, a new way of noising inputs which works better for NL. Our method improves two different likelihood based approaches on multiple datasets.
Lastly, we curate and plan to publicly release ROSTD, a novel dataset of OOD intents w.r.t the intents in . We hope ROSTD fosters further research and serves as a useful benchmark for OOD Detection
We thank Tony Lin and co-authors for promptly answering several questions about their paper, and Sachin Kumar for valuable discussion on methods. We also thank Hiroaki Hayashi and 3 anonymous reviewers for valuable comments.
LOF: identifying density-based local outliers. In ACM sigmod record, Vol. 29, pp. 93–104. Cited by: §1, §2.
Generative ensembles for robust anomaly detection. arXiv preprint arXiv:1810.01392. Cited by: §2.
-  (2018) Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190. Cited by: item 4, §4.
The relationship between Precision-Recall and ROC curves.
Proceedings of the 23rd international conference on Machine learning, pp. 233–240. Cited by: 3rd item.
-  (2019-06) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §2.
-  (2014) Notes on noise contrastive estimation and negative sampling. arXiv preprint arXiv:1410.8251. Cited by: item 2.
-  (2014) Word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722. Cited by: item 3, §2.
On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1321–1330. Cited by: §1.
-  (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Cited by: §1, §2, 2nd item.
-  (2019) Deep anomaly detection with outlier exposure. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Cited by: §2.
-  (2018) Training confidence-calibrated classifiers for detecting out-of-distribution samples. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Cited by: §1, §2.
-  (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pp. 7167–7177. Cited by: §1.
-  (2018) Enhancing the reliability of out-of-distribution image detection in neural networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Cited by: §1, §1, §2.
-  (2019-07) Deep unknown intent detection with margin loss. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5491–5496. External Links: Cited by: §1, §1, §2, §2, §4, §4, §5.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §2.
-  (2019) Do deep generative models know what they don’t know?. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Cited by: §2.
-  (2017) Automatic differentiation in pytorch. OpenReview. Cited by: §5.
-  (2011) Scikit-learn: Machine learning in Python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: §5.
Glove: global vectors for word representation.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §5.
-  (2019) Likelihood ratios for Out-of-Distribution Detection. arXiv preprint arXiv:1906.02845. Cited by: item 3, §1, §2, §6.
Cross-lingual transfer learning for multilingual task oriented dialog. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3795–3805. External Links: Cited by: Likelihood Ratios and Generative Classifiers for Unsupervised Out-of-Domain Detection In Task Oriented Dialog, §4, §4, §4, Table 2, §4, §6, footnote 7.
-  (2012) LSTM neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, Cited by: §2.
Generative and discriminative text classification with recurrent neural networks. arXiv preprint arXiv:1703.01898. Cited by: §2, §2.