Likelihood Ratios and Generative Classifiers for Unsupervised Out-of-Domain Detection In Task Oriented Dialog

12/30/2019 ∙ by Varun Gangal, et al. ∙ Facebook Carnegie Mellon University 0

The task of identifying out-of-domain (OOD) input examples directly at test-time has seen renewed interest recently due to increased real world deployment of models. In this work, we focus on OOD detection for natural language sentence inputs to task-based dialog systems. Our findings are three-fold: First, we curate and release ROSTD (Real Out-of-Domain Sentences From Task-oriented Dialog) - a dataset of 4K OOD examples for the publicly available dataset from (Schuster et al. 2019). In contrast to existing settings which synthesize OOD examples by holding out a subset of classes, our examples were authored by annotators with apriori instructions to be out-of-domain with respect to the sentences in an existing dataset. Second, we explore likelihood ratio based approaches as an alternative to currently prevalent paradigms. Specifically, we reformulate and apply these approaches to natural language inputs. We find that they match or outperform the latter on all datasets, with larger improvements on non-artificial OOD benchmarks such as our dataset. Our ablations validate that specifically using likelihood ratios rather than plain likelihood is necessary to discriminate well between OOD and in-domain data. Third, we propose learning a generative classifier and computing a marginal likelihood (ratio) for OOD detection. This allows us to use a principled likelihood while at the same time exploiting training-time labels. We find that this approach outperforms both simple likelihood (ratio) based and other prior approaches. We are hitherto the first to investigate the use of generative classifiers for OOD detection at test-time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With increased use of ML models in real life settings, it has become imperative for them to self-identify, at test-time, examples on which they are likely to fail due to them differing significantly from the model’s training time distribution.

In particular, for state-of-the-art deep classifiers such as those used in vision and language tasks, it has been observed that the raw probability value is over calibrated

[8] and can have high values even for OOD inputs. This necessitates having an auxiliary mechanism to detect them.

This task is not in entirety novel, and has historically been explored in related forms under various names such as one class classification, open classification etc. The recent stream of work on this started with [9], which proposed benchmark datasets for doing this on vision problems. [13] find that increasing the softmax temperature makes the resultant probability more discriminative for OOD Detection. [12] propose using distances to per-class Gaussians in the intermediate representation learnt by the classifier. Specifically, a Gaussian is fit for each training class from all training points in that class . [20] show that “correcting” likelihood with likelihood from a “background” model trained on noisy inputs is better at discriminating out of distribution examples. Recently, [14] propose using an old measure from the data mining literature named LOF [1] in the space of penultimate activations learnt by a classifier.

Apart from [14] and few others, a majority of the prior work uses vision problems and datasets, often image classification as the setting in which to perform OOD Detection. Certain methods, such as input gradient reversal from [13] or an end-to-end differentiable Generative Adversarial Network (GAN) as in [11] are not directly applicable for natural language inputs. Furthermore, image classification has available several benchmarks with a similar label space (digits and numbers) but differing input distributions, such as MNIST, CIFAR and SVHN. Most of these works exploit this fact for their experimental setting by picking one of these datasets as ID and the other as OOD. In this work, we attempt to address these lacunae and specifically explore which OOD detection approaches work well on natural language, in particular, intent classification.

This problem is greatly relevant for task oriented dialog systems since intent classification can receive user intents which are sometimes not in any of the domains defined by the current ontology or downstream functions. In particular, unsupervised detection approaches are important as it is difficult to curate this kind of data for training because

  1. The size of in-domain data to train on can become arbitrarily large as the concerned dialog system gets more users and acquires the ability to handle newer intent classes. After a point, it becomes impractical to continue curating newer examples for training in proportion to the in-domain data. From there on, class imbalance would keep increasing.

  2. By definition, OOD is an open class. For natural language intents, utterances can demonstrate diverse sentence phenomenena such as slang, rhetorical questions, code mixed language, etc. User data can exhibit a large range of OOD behaviours, all of which may be difficult to encapsulate using a limited set of OOD examples at training time.

To the best of our knowledge, this is the first application of likelihood ratios approach for OOD Detection in natural language. Overall, our contributions are as follows:

  1. We release ROSTD, a novel dataset111Our dataset is available at github.com/vgtomahawk/LR˙GC˙OOD/blob/master/data/fbrelease/OODrelease.tsv of OOD sentences for intent classification. We observe that existing datasets for OOD intent classification are

    • too small (1000 examples)

    • create OOD examples synthetically

    We show that performing OOD detection on ROSTD is more challenging than the synthetic setting where OOD examples are created by holding out some fraction of intent classes. We further describe this dataset in §4 .

  2. We show that using the marginal likelihood of a generative classifier provides a principled way of both incorporating the label information [like classifier uncertainty based approaches] while at the same time testing for ID vs OOD using a likelihood function.

  3. We show that using likelihood with a correction term from a “background” model, based on the formalism proposed in [20], is a much more effective approach than using the plain likelihood. We propose multiple ways of training such a background model for natural language inputs.

  4. Our improvements hold on multiple datasets - both for our dataset as well as the existing SNIPS dataset [3].

2 Methods

Figure 1: We illustrate the architecture of our generative classifier. and are the word embeddings and the label embeddings respectively. The hidden state is concatenated with the label embedding for “Restaurant” before passing through the output layer . Best seen in color.

All our methods attempt to estimate a score which is indicative of the data point being

OOD. For an input , we refer to this function as . This function may additionally be parametrized by the classifier distribution or the set of nearest neighbors , based on the specific method in use.

Some of our evaluation measures are threshold independent. In this case, can be directly evaluated for its goodness of detecting OOD points. For measures which are threshold dependent, an optimal threshold which maximizes macro-F1 is picked using the values of on a validation set.

Maximum Softmax Probability

Maximum Softmax Probability, or MSP is a simple and intuitive baseline proposed by [9]. MSP uses the maximum probability 1- as . The lesser “confident” the classifier is about its predicted outcome i.e the argmax label, the greater is . Typically,

Here,

denotes the logit for label

while denotes the softmax temperature. Increasing smoothens the distribution while decreasing makes it peakier. We also try increased values of as they were shown to work better by [13].

Softmax Entropy

Alternatively, both [11] and [10] propose using either of the following222They differ only by a constant and have the exact same minimas, as we show in Appendix 1.1. Appendix can be read at github.com/vgtomahawk/LR˙GC˙OOD/tree/master/appendix:

  1. Entropy as

  2. Negative KL Divergence

    w.r.t the uniform distribution over labels

    .

We refer to this method as in our experiments333One distinction from the two cited papers is that they use merely as an auxiliary training objective, and end up using MSP as the at test time. In contrast, we explicitly use as . Here, we also experiment with a variant of this method which replaces with , where , or the fraction of class in the training set. We expect this variant to do better when ID classes are not distributed equally.

Local Outlier Factor

LOF, proposed by [1], is a measure based on local density defined with respect to nearest neighbours. Recently, [14] effectively used LOF in the intermediate representation learnt by a classifier for OOD detection of intents. The LOF measure can be defined in three steps:

  1. First, . Here, is the distance to the kth nearest neighbor, while is the distance measure being used. Intuitively, is lower-bounded by , but can become arbitrarily large.

  2. Next, define a measure named local reachability density or lrd. This is simply the reciprocal of the average for

  3. Lastly, LOF is defined as

Intuitively, if the “density” around a point’s nearest neighbours is higher than its own “density”, the point will have a higher LOF. Points with a higher score are more likely to be OOD.

[14] further show that using the large margin cosine loss or LMCL, works better than the typical combination of softmax + cross entropy.

Here, denotes the margin. denotes the row in the final linear layer weight matrix corresponding to the label . denotes the penultimate layer activations which are input to the final layer. We use to denote the normalized , i.e .

We denote this approach as LOF+LMCL. We directly use the author’s implementation 444https://github.com/thuiar/DeepUnkID for this approach.

Likelihood

Here, is the likelihood according to a model trained on the ID training points. In the simplest case, is simply a left-to-right language model learnt on all our training sentences. Later, we discuss another class of models which can give a valid likelihood, which we name .

Likelihood Ratio

[16, 2] found that likelihood is poor at separating out OOD examples; in some cases even assigning them higher likelihood than the ID test split. [20] make similar observations for detecting OOD DNA sequences. They then propose the concept of likelihood ratio or LLR based methods, which we briefly revisit here.

Let and denote the probability of according to the model and a background model respectively. is trained on the training set , while is trained on noised samples from the training set. Let denote the prefix . The LLR is derived as in Equation 2

The intuition was that “surface-level” features might be causing the OOD points to be assigned a reasonable probability. The hypothesis is that the background model would capture these “surface-level” features which also persist after noising and remove their influence on being divided out. If it were so, would be a better choice for .

How to introduce noise?

It is a common practice in vision to add noise or perturb images slightly by adding a Gaussian noise vector of small magnitude. Since natural language utterances are sequences of discrete words, this does not extend directly to them.

A simple alternative to introduce noise into natural language inputs is by random word substitution. Word substitution based noise has a long precedent of use, from negative sampling as in word2vec [15, 7]

to autoencoder-like objectives like

BERT [5].

More specifically, with probability , we substitute each word with word sampled from the distribution .

is a hyperparameter to be tuned. We experiment with 3 different choices of

:

  1. UNIFORM: i.e each word is equally likely.

  2. UNIGRAM: i.e a word is sampled with probability proportional to its frequency . Using unigram frequency for the noise distribution is common practice in noise contrastive estimation [6].

  3. UNIROOT: i.e a word is sampled with probability proportional to the square root of its frequency. Using such smoothed versions of the unigram frequency distribution has precedent in other NLP tasks. For instance, in [7]555Specifically, see the footnote concluding Page 2 in the paper, the word2vec negative sampling uses to sample negative contexts, , proportional to frequency, .

Choice of architecture

Since this is hitherto the first work to extend LLR method for NLP, we try to use a simple and standard architecture for . We use a left-to-right LSTM language model [22] with a single layer. We vary the hidden state . Note that is non-class conditional - it does not use the labels in any way.

An additional point of consideration is that should not have a very large number of parameters, or have large time complexity at test time. Even in this regard, a LSTM language model with a small state size is apt. We refer to approaches which use this architecture for with +BackLM

Generative Classifier

Typical classification models estimate the conditional probability of the label given the input, i.e . An alternative paradigm learns to estimate , additionally estimating from the training set label ratios. Using Bayes rule,

Classifiers of this paradigm are called generative classifiers, in contrast to the typical discriminative classifiers.

[23] compare the two paradigms and found generative classifiers useful for a) High sample efficiency b) Continual Learning c) Explicit Marginal Likelihood. The last point is particularly useful for us since we can use the explicit marginal likelihood from the classifier as our function. Specifically, we use the term which is directly available from a trained generative classifier. Hereon, we refer to this as .

[23]

also propose a deep architecture for generative text classifiers that consists of a shared unidirectional LSTM across classes and a label embedding matrix. The respective label embedding is concatenated to the current hidden state and a final layer is then applied on this vector to give the distribution over the next word. The per-word cross-entropy loss serves as the loss function. We use a similar architecture as illustrated in Figure 1.

3 Evaluation

For the threshold dependent measures, we tune our function on the validation set. We use the following metrics to measure OOD detection performance:

  • : On picking a threshold such that the OOD recall is k%, what fraction of the predicted OOD points are ID? FPR denotes False Positive Rate and TPR denotes True Positive Rate. Note that Positive here refers to the OOD class. We choose a high values of , i.e . Note that lower this value, the better is our OOD Detection.

  • : Measures the area under the Receiver Operating Characteristic, also known as the ROC curve. Note that this curve is for the OOD class. [9] first proposed using this. Higher the value, better is our OOD Detection. This metric is threshold independent.

  • : Area under the Precision Recall Curve is another threshold independent metric, based on the Precision-Recall Curve. Unlike , is insensitive to class imbalance [4]. The and correspond to taking ID and OOD respectively as the positive class.

4 Datasets

We use two datasets for our experiments. The first, SNIPS, is a widely used, publicly available dataset, and does not contain actual OOD intents. The second, ROSTD, is a combination of a dataset released earlier [21], with new OOD examples collected by us. We briefly describe both of these in order. Table 1 also provides useful summary statistics about these datasets:

Snips

Released by [3], SNIPs consists of sentences spread through intent classes such as GetWeather, RateBook etc. As discussed previously, it does not explicitly include OOD sentences.

We follow the procedure described in [14] to synthetically create OOD examples. Intent classes covering atleast of the training points in combination are retained as . Examples from the remaining classes are treated as OOD and removed from the training set. In the validation and test sets, examples from these classes are relabelled to the single class label OOD. Besides not being genuinely OOD, another issue with this dataset is that the validation and test splits are quite small in size at each.

In §5, we report experiments on and 666See appendix for the experimental results with i.e SNIPS,25%., both of which ratios were used in [14]. We refer to these datasets as SNIPS,75% and SNIPS,25% respectively. Since multiple ID-OOD splits of the classes satisfying these ratios are possible, our results are averaged across 5 randomly chosen splits.

Rostd

We release a dataset of OOD sentences . These sentences were curated to be explicitly OOD with respect to the English split of the recently released dataset of intents from [21] as the ID dataset. This dataset contained intents from intent classes. We chose this dataset over SNIPs owing to its considerably larger size ( times larger). The sentences were authored by human annotators with the instructions as described in the subsection Annotation Guidelines.

Category Example
%
Overtly Powerful
Action
1. send Ameena $ 25 from Venmo account
2. fix a pot of coffee
20.55
Action
Memory
1. What’s the color of the paint I
bought off Amazon
2. how much did I spend yesterday
12.24
Declarative
Statement
1. I learned some good words.
2. I always bookmark my favorite website
to go back in it anytime.
3. all Star Wars movie are great
8.74
Underspecified
Query
1. On what website can I order medication?
2. how many jobs is having been lost
33.94
Speculative
Question
1. Can I do all of my Amazon shopping
through the app?
2. when is the next episode of General Hospital
6.91
Subjective
Question
1. What color goes well with navy blue?
2. where can I learn something new every day?
27.99
Table 1: We manually classify each OOD sentence in ROSTD into [1 or more] of 6 qualitative categories named self-explanatorily. More examples per category can be seen in Table 1 of the appendix.

Annotation Guidelines

We use human annotators to author intents which are explicity with respect to the English split of [21]. The requirements and instructions for annotation were as follows:

  1. The OOD utterances were authored by several distinct English-speaking annotators from Anglophone countries.

  2. The annotators were asked to author sentences which were both grammatical and semantically sensible as English sentences. This was to prevent our OOD data from becoming trivial by inclusion of ungrammatical sentences, gibberish and nonsensical sentences.

  3. The annotators were well informed of existing intent classes to prevent them from authoring intents . This was done by presenting the annotators with examples from the training split of each intent class, with the option to scroll for more through a dropdown.

  4. After the first round of annotators had authored such intents, each intent was post-annotated as in-domain vs out-of-domain by two fresh annotators who were not involved in the authoring stage.

  5. If both annotators agreed that the example was OOD, it was retained. If both agreed it was ID, it was discarded. In the event the two annotators disagreed, an additional, third annotator was asked to label the example and adjudicate the disagreement.

  6. During post-processing, we removed utterances which were shorter than three words.

Qualitative Analysis

We identify six qualitative categories which might be making the sentences OOD. We then manually assign each example into these categories. We summarize their distribution in Table 1. Note that since these categories are not mutually exhaustive, an example may get assigned multiple categories.

Coarsening Labels

The ID examples from [21], which we also use as the ID portion of ROSTD has hierarchical class labels [e.g , and ]777See [21] for full list. Hence, ROSTD has a large number of classes (12), not all of which are equally distinct from each other. To ensure that our results are not specific only to settings with this kind of hierarchical label structure, we also experiment with retaining only the topmost or most “coarse” label on each example. We refer to this variant with “coarsened” labels as ROSTD-COARSE.

Statistic ROSTD SNIPS
Train-ID 30521 13084
Valid-ID 4181 700
Test-ID 8621 700
Actual OOD 4590 None
Unique Word Types 11.5K 11.4K
Unique Bigrams 47.3K 36.3K
Unique Trigrams 80.8K 52.2K
Mean Utterance Length 6.85 6.79
Number of ID classes 12/3 (Coarse) 7
Table 2: Dataset and Vocabulary Statistics contrasting ROSTD and SNIPS. Note that the ID part of ROSTD comes from the English portion of the publicly available data from [21]

5 Experiments

Dataset Model AUROC
ROSTD MSP 54.22 4.01 100.00 0.00 70.75 3.70 55.68 6.36
MSP, 55.45 4.19 60.48 3.17 76.94 4.01 59.64 6.49
55.31 3.89 60.15 3.11 76.86 3.85 59.54 6.10
83.24 2.78 21.31 8.38 95.78 1.30 90.32 1.97
LOF 64.46 2.57 42.49 3.49 81.39 2.38 46.89 3.50
LOF+LMCL 85.97 2.00 15.03 5.42 95.60 0.75 82.71 9.17
81.38 0.19 18.92 0.56 95.42 0.11 87.38 0.41
+BackLM+UNIFORM 85.25 0.72 36.65 6.87 94.71 0.49 91.10 0.63
+BackLM+UNIGRAM 82.27 0.74 42.16 3.62 93.62 0.43 89.30 0.50
+BackLM+UNIROOT 87.42 0.45 20.10 5.25 96.35 0.41 93.44 0.37
86.25 0.71 10.86 1.08 97.42 0.28 92.30 0.99
+BackLM+UNIFORM 89.60 0.56 13.71 5.64 97.67 0.35 95.49 0.42
+BackLM+UNIGRAM 91.35 2.62 10.55 4.11 97.87 0.49 95.86 0.68
+BackLM+UNIROOT 91.17 0.32 7.41 1.88 98.22 0.26 96.47 0.29
ROSTD-COARSE MSP 59.99 19.01 26.00 34.32 71.63 15.55 64.32 19.36
MSP, 64.62 15.31 64.46 9.84 78.39 11.92 66.89 11.76
65.36 15.49 65.39 4.84 79.05 11.40 67.79 19.43
81.56 8.51 17.78 15.70 93.47 6.25 87.49 8.94
LOF 62.39 9.01 46.55 17.56 78.07 12.23 45.80 12.21
LOF+LMCL 84.28 3.44 15.24 4.70 95.19 1.03 76.63 2.53
80.48 0.27 20.78 0.71 95.2 0.07 86.87 0.13
+BackLM+UNIFORM 85.97 0.65 30.65 4.51 95.27 0.47 91.98 0.61
+BackLM+UNIGRAM 84.46 0.62 31.79 3.04 94.93 0.32 91.22 0.50
+BackLM+UNIROOT 88.25 0.50 16.35 1.32 96.82 0.12 94.10 0.20
86.67 0.34 9.88 0.44 97.58 0.08 92.74 0.29
+BackLM+UNIFORM 89.32 0.30 8.04 0.69 97.83 0.15 95.27 0.32
+BackLM+UNIGRAM 90.05 0.73 6.69 0.82 98.16 00.15 95.61 0.50
+BackLM+UNIROOT 90.14 0.39 6.78 0.60 98.30 0.09 95.96 00.37
SNIPS, 75% MSP 81.58 7.68 16.68 18.06 93.51 4.49 85.03 6.19
MSP, 83.94 6.82 31.32 30.25 94.30 4.50 88.44 5.72
84.23 7.22 29.28 27.04 94.51 4.38 88.71 6.15
LOF 66.07 8.82 49.56 13.49 79.65 7.81 51.69 13.08
LOF+LMCL 76.24 9.34 42.27 20.64 90.37 6.54 77.81 10.53
63.51 6.33 54.56 12.13 81.72 5.90 62.12 13.28
+BackLM+UNIFORM 74.74 3.25 44.60 12.01 90.02 2.24 80.08 3.31
+BackLM+UNIGRAM 81.19 3.53 27.00 8.71 93.97 1.85 87.57 3.38
+BackLM+UNIROOT 78.75 3.25 35.24 11.08 92.66 1.89 84.84 3.36
67.31 7.06 44.60 18.53 85.17 7.18 68.11 14.29
+BackLM+UNIFORM 78.37 6.60 29.28 4.23 92.35 2.77 82.84 7.33
+BackLM+UNIGRAM 85.47 6.90 18.48 11.26 95.79 2.67 90.98 6.73
+BackLM+UNIROOT 81.91 6.83 22.24 6.26 94.15 2.59 86.60 7.03
Table 3: Performance of the baseline methods and our proposed models on ROSTD, ROSTD-COARSE and SNIPS. () indicates lower (higher) is better. We can see that the BackLM+Noise (where Noise is one of three noising schemes) approaches outdo their non LLR counterparts on most measures. For SNIPS, since the training set is almost evenly distributed between the ID classes. We can also observe that the differences in performance between different approaches are much more observable on ROSTD as compared to SNIPS.

We compile the results of all our experiments in Table 3.

Implementation

All experiments are averaged across 5 seeds. We use Pytorch 1.0

[17] to implement models888Code available at github.com/vgtomahawk/LR˙GC˙OOD.

The checkpoint with highest validation F1 on the ID subset of the validation set is chosen as the final checkpoint for computing the other OODevaluation metrics. For the label-agnostic approaches (), the checkpoint with lowest validation perplexity is chosen. For the +BackLM approaches, we use . We also experimented with , but find works best.

Base Classifier Architectures

For the discriminative classifier, we use a bidirectional LSTM [1-layer] with embedding size 100, projection layer of 100300 (to project up embeddings), hidden size and embeddings initialized with Glove (glove.6B.100D) [19]. Generative classifier approaches have similar architecture except that they are unidirectional and have additional label embeddings of dimension

LOF implementation

We use the scikit-learn 0.21.2 implementation999https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html [18] of LOF. We fix the number of nearest neighbors to but tune the contamination rate as a hyperparameter. We also corroborated over email correspondence with the authors of [14] that they had used a similar hyperparameter setting for LOF.

(a)
(b) +BackLM+UNIROOT
Figure 2: Effect of BackLM+UNIROOT. In the right graph, we can see that the OOD has shifted considerably to the right then before and overlaps less with the ID set

Observations

From Table 3, we see that outperforms uncertainty based and nearest neighbour approaches by a reasonable margin on both datasets. It is also significantly better than the language model likelihood based . This validates our hypothesis that generative classifiers effectively combine the benefits of likelihood-based and uncertainty-based approaches.

Furthermore, LLR based approaches always outperform the respective likelihood-only approach, whether or . Amongst different noising methods, the performance improvement is typically largest using the UNIROOT approach we proposed. For instance, on ROSTD,

A clear advantage of ROSTD which is clear from the experiments is that differences in performance between the various methods are much more pronounced when tested on it, as compared to SNIPS. On SNIPS, the simple MSP, baseline is itself able to reach 80-90% of the best performing approach on most metrics.

6 Conclusion

To the best of our knowledge, we are hitherto the first work to use an approach based on generative text classifiers for OOD detection. Our experiments show that this approach can outperform existing paradigms significantly on multiple datasets.

Furthermore, we are the first to flesh out ways to use likelihood ratio based approaches first formalized by [20] for OOD detection in NLP. The original work had tested these approaches only for DNA sequences which have radically smaller vocabulary than NL sentences. We propose UNIROOT, a new way of noising inputs which works better for NL. Our method improves two different likelihood based approaches on multiple datasets.

Lastly, we curate and plan to publicly release ROSTD, a novel dataset of OOD intents w.r.t the intents in [21]. We hope ROSTD fosters further research and serves as a useful benchmark for OOD Detection

7 Acknowledgements

We thank Tony Lin and co-authors for promptly answering several questions about their paper, and Sachin Kumar for valuable discussion on methods. We also thank Hiroaki Hayashi and 3 anonymous reviewers for valuable comments.

References

  • [1] M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander (2000)

    LOF: identifying density-based local outliers

    .
    In ACM sigmod record, Vol. 29, pp. 93–104. Cited by: §1, §2.
  • [2] H. Choi and E. Jang (2018)

    Generative ensembles for robust anomaly detection

    .
    arXiv preprint arXiv:1810.01392. Cited by: §2.
  • [3] A. Coucke, A. Saade, A. Ball, T. Bluche, A. Caulier, D. Leroy, C. Doumouro, T. Gisselbrecht, F. Caltagirone, T. Lavril, et al. (2018) Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190. Cited by: item 4, §4.
  • [4] J. Davis and M. Goadrich (2006) The relationship between Precision-Recall and ROC curves. In

    Proceedings of the 23rd international conference on Machine learning

    ,
    pp. 233–240. Cited by: 3rd item.
  • [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019-06) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §2.
  • [6] C. Dyer (2014) Notes on noise contrastive estimation and negative sampling. arXiv preprint arXiv:1410.8251. Cited by: item 2.
  • [7] Y. Goldberg and O. Levy (2014) Word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722. Cited by: item 3, §2.
  • [8] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017)

    On calibration of modern neural networks

    .
    In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1321–1330. Cited by: §1.
  • [9] D. Hendrycks and K. Gimpel (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §1, §2, 2nd item.
  • [10] D. Hendrycks, M. Mazeika, and T. G. Dietterich (2019) Deep anomaly detection with outlier exposure. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §2.
  • [11] K. Lee, H. Lee, K. Lee, and J. Shin (2018) Training confidence-calibrated classifiers for detecting out-of-distribution samples. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1, §2.
  • [12] K. Lee, K. Lee, H. Lee, and J. Shin (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pp. 7167–7177. Cited by: §1.
  • [13] S. Liang, Y. Li, and R. Srikant (2018) Enhancing the reliability of out-of-distribution image detection in neural networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1, §1, §2.
  • [14] T. Lin and H. Xu (2019-07) Deep unknown intent detection with margin loss. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5491–5496. External Links: Link Cited by: §1, §1, §2, §2, §4, §4, §5.
  • [15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §2.
  • [16] E. T. Nalisnick, A. Matsukawa, Y. W. Teh, D. Görür, and B. Lakshminarayanan (2019) Do deep generative models know what they don’t know?. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §2.
  • [17] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. OpenReview. Cited by: §5.
  • [18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011) Scikit-learn: Machine learning in Python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: §5.
  • [19] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    ,
    pp. 1532–1543. Cited by: §5.
  • [20] J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. A. DePristo, J. V. Dillon, and B. Lakshminarayanan (2019) Likelihood ratios for Out-of-Distribution Detection. arXiv preprint arXiv:1906.02845. Cited by: item 3, §1, §2, §6.
  • [21] S. Schuster, S. Gupta, R. Shah, and M. Lewis (2019-06)

    Cross-lingual transfer learning for multilingual task oriented dialog

    .
    In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3795–3805. External Links: Link, Document Cited by: Likelihood Ratios and Generative Classifiers for Unsupervised Out-of-Domain Detection In Task Oriented Dialog, §4, §4, §4, Table 2, §4, §6, footnote 7.
  • [22] M. Sundermeyer, R. Schlüter, and H. Ney (2012) LSTM neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, Cited by: §2.
  • [23] D. Yogatama, C. Dyer, W. Ling, and P. Blunsom (2017)

    Generative and discriminative text classification with recurrent neural networks

    .
    arXiv preprint arXiv:1703.01898. Cited by: §2, §2.