Deep neural networks have significantly contributed to the success of predictive accuracy for classification tasks in multiple domains. However, many applications require confidence in reliability. In real-world settings that contain out-of-distribution (OOD) samples, the model should know when it can not make a confident judgment rather than making an incorrect one. Studies show that traditional neural networks easily lead to over-confidence, i.e., a high-class probability in an incorrect class prediction (Guo et al., 2017; Hein et al., 2019; Ovadia et al., 2019). Therefore, calibrated predictive uncertainty is crucial to avoid those risks.
In this paper, we are interested in qualifying uncertainty to solve OOD detection in text classification as it contains a wide range of Natural Language Processing (NLP) applications (Chang et al., 2020; Li and Ye, 2018). Although fine-tuning pre-trained transformers (Devlin et al., 2018) have achieved state-of-the-art accuracy on text classification tasks, they still suffer from the same over-confidence problem of traditional neural networks, making the prediction untrustful (Hendrycks et al., 2020). One partial explanation is over-parameterization (Guo et al., 2017). Although transformers are pre-trained on a large corpus and get rich semantic information, it leads to over-confidence easily given limited labeled data during the fine-tuning stage (Kong et al., 2020). Overall, compared to the Computer Vision (CV) domain, there is less work in qualifying uncertainty in the NLP domain. Among them, there are Bayesian and non-Bayesian methods.
Bayesian models qualify the model uncertainty by Bayesian neural networks (BNNs) (Blundell et al., 2015; Louizos and Welling, 2017). BNNs explicitly qualify model uncertainty by considering model parameters as distributions. Specifically, BNNs consider probabilistic uncertainty, i.e., aleatoric uncertainty and epistemic uncertainty (Kendall and Gal, 2017). Aleatoric only considers data uncertainty caused by statistical randomness. At the same time, epistemic refers to model uncertainty introduced by limited knowledge or ignorance in collected data. Monte Carlo Dropout (Gal and Ghahramani, 2016)
is a crucial technique to approximate variational Bayesian inference. It trains and evaluates a neural network with dropout(Srivastava et al., 2014) before each layer. BNNs have been explored for classification prediction or regression in CV applications. However, there has been less study in the NLP domain. Few work (Xiao and Wang, 2019; Van Landeghem et al., 2020; Ovadia et al., 2019)
empirically evaluate uncertainty estimation in text classification. Other attempts adopt MC Dropout in deep active learning(Shen et al., 2017; Siddhant and Lipton, 2018) et al., 2020), or machine translation (Zhou et al., 2020).
Non-Bayesian approaches use entropy (Shannon, 1948) or softmax scores as a measure of uncertainty, which only considers aleatoric uncertainty (Kendall and Gal, 2017). OOD detection in text classification using GRU (Chung et al., 2014) or LSTM (Hochreiter and Schmidhuber, 1997) has been studied in (Hendrycks and Gimpel, 2016; Hendrycks et al., 2018). Hendrycks et al. (2020) empirically study pre-trained transformers’ performance on OOD detection. They point out transformers cannot clearly separate in-distribution (ID) and OOD examples. In addition, OOD detection has also been studied in dialogue systems (Zheng et al., 2020) and document classification (Zhang et al., 2019; He et al., 2020). Another line of non-Bayesian methods involves the calibration of probabilities. Temperature scaling (Guo et al., 2017) calibrates softmax probabilities by adding a scalar parameter to each class in a post-processing step. Thulasidasan et al. (2019) explore the improvement of calibration and predictive uncertainty of models trained with mix-up (Zhang et al., 2017) in the NLP domain. Kong et al. (2020) use pseudo samples on and off the data manifold for calibration.
3. ’Deep learning is data hungry.’
|= [0.99, 0.1]||doesn’t apply||Over-confidence|
|= [0.01, 0.99]||= [1,99]||Low uncertainty|
|= [0.5, 0.5]||= [50, 50]||
|3. ’Deep learning is data hungry.’||= [0.5, 0.5]||= [1, 1]||
Predictive uncertainty of sentiment analysis of restaurant reviews. The model without calibration demonstrates over-confidence. A well-calibrated classifier outputs the same expected probabilities for Case 2 and 3 that have different evidence.
Besides probabilistic uncertainty and BNNs, evidential uncertainty is proposed based on belief/evidence theory and Subjective Logic (SL) (Jøsang, 2016; Jøsang et al., 2018). It considers different dimensions of uncertainty, such as vacuity (i.e., lack of evidence) or dissonance (i.e., uncertainty due to conflicting evidence). In the CV domain, Sensoy et al. (2020) propose evidential neural networks (ENNs) to model the uncertainty of class probabilities based on SL explicitly. An ENN uses the predictions as subjective opinions and learns a function that collects evidence to form the opinions by a deterministic neural network from data. Several works (Sensoy et al., 2020; Zhao et al., 2019; Hu et al., 2020) improve ENNs using regularization or generative models to ensure correct uncertainty estimation towards unseen examples in image classification. However, those methods for continuous feature space are not applicable to the discrete text.
Why is it necessary to calibrate predictive uncertainty?
What is the advantage of evidential uncertainty in OOD detection?
How to design a regularization method to calibrate the predictive uncertainty?
In Table 1, we assume that a classifier is only trained on the restaurant reviews dataset and has never seen examples from other domains. The probability denotes the prediction softmax probability. The evidence represents historical observations, denoted by Dirichlet distributions (no evidence when ). Before calibration, the classifier predicts Sentence 3, an obvious OOD example, as positive with high confidence. Thus it is necessary to calibrate predictive uncertainty is to reduce over-confidence.
For a well-calibrated model, there are three common cases in predictions. Sentence 1 refers to correct confident classification, where we have enough evidence with no conflicts. Sentence 2 is vague and contains conflicting information like ’bad’ and ’acceptable’. The prediction will result in equal probability because each category supports equal evidence, i.e., conflicting evidence or high dissonance. Finally, we lack the evidence to support our prediction for an OOD sample, Sentence 3. It results in high vacuity with Dirichlet distribution being a uniform distribution. The model outputs the same predictive probability for Sentence 2 and 3, which have pretty different evidence. In this case, probabilistic uncertainty cannot distinguish the conflicting case and the out-of-distribution case. Evidential uncertainty decomposes the uncertainty base on different root causes. This explains the advantage of evidential uncertainty over probabilistic uncertainty.
Figure 1 illustrates the prediction uncertainty of neural networks in Table 1. Assume we project the examples in a 2D space. Sentence 1 lies in the region with many negative examples. Sentence 2 lies in the boundary region. Sentence 3 is far away from the ID region. Figure 1 (a) represents the prediction by traditional neural networks with softmax and demonstrates over-confidence. It only assigns high uncertainty (entropy) near the classification boundary. Hein et al. (2019)
prove that ReLU type neural networks produce arbitrary high confidence predictions far away from the training data. Figure1 (b) represents the predictive entropy of a well-calibrated model. Figure 1 (c) and (d) shows the evidential uncertainty decomposes the uncertainty in (b) based on different root causes. We observe high vacuity in OOD regions and high dissonance in ID boundary regions. Vacuity can effectively detect OOD samples from boundary ID examples because the cause of uncertainty is due to a lack of evidence. We can distinguish sentence 3 from sentence 2 in Figure 1 (c) but not in Figure 1 (b).
Finally, in Figure 1 we also observe OOD examples and adversarial examples. Adversarial examples (Szegedy et al., 2013; Carlini and Wagner, 2017; Madry et al., 2017) refer to instances with small feature perturbations. A lot of studies (Jia and Liang, 2017; Wallace et al., 2019; Jia et al., 2019) use adversarial examples to evaluate and improve neural networks’ robustness. We can use diverse outliers to calibrate the model to output high uncertainty in the OOD region (Hendrycks et al., 2018). Additionally, adversarial examples can be helpful to detect OOD examples close to ID regions. Thus, our approach adopts a mixture of an auxiliary dataset of outliers and close adversarial examples to calibrate the predictive uncertainty. We can easily get diverse text data as auxiliary outliers. However, generating adversarial examples via common gradient-based approaches is impossible in the NLP domain. Thus, we apply methods (Stutz et al., 2019; Gilmer et al., 2018; Kong et al., 2020) to generate off-manifold adversarial examples from the embedding layer.
Our work provides the following key contributions : (i) We firstly apply evidential uncertainty to solve OOD detection tasks in the text classification. (ii) We propose an inexpensive framework that adopts both an auxiliary dataset of outliers and generated pseudo off-manifold samples to train a model with prior knowledge of a certain class, which has high vacuity for OOD samples. (iii) We validate our proposed method’s performance via extensive experiments of OOD detection and uncertainty estimation in text classification. Our approach significantly outperforms all the counterparts.
We briefly provide the background knowledge of evidential uncertainty and the advantage over probabilistic uncertainty.
2.1. Subjective Opinions in SL
A multinomial opinion in a given proposition is represented by where a domain is
, a random variabletakes value in , and is given as . denotes belief mass function over . denotes uncertainty mass representing vacuity of evidence. represents base rate distribution over , with
. Then the projected probability distribution of a multinomial opinion is given by:
Multinomial probability density over a domain of cardinality is represented by the -dimensional Dirichlet PDF where the special case with is the Beta PDF as a binomial opinion. It denotes a domain of mutually disjoint elements in and
the strength vector overand the probability distribution over .
where is a multivariate beta function as the normalizing constant, , and if .
We term evidence as a measure of the amount of supporting observations collected from data in favor of a sample to be classified into a certain class. Let be the evidence derived for the singleton . The total strength for the belief of each singleton is given by:
where is a non-informative weight representing the amount of uncertain evidence and is the base rate distribution. Given the Dirichlet PDF, the expected probability distribution over is:
The observed evidence in the Dirichlet PDF can be mapped to the multinomial opinions by:
where . We set the base rate and the non-informative prior weight , and hence for each , as these are default values considered in subjective logic.
2.2. Uncertainty Dimensions
Jøsang et al. (2018) define multiple dimensions of a subjective opinion based on the formalism of SL. Vacuity refers to uncertainty caused by insufficient information to understand a given opinion. It corresponds to uncertainty mass, , of an opinion in SL as:
Dissonance denotes when there is an insufficient amount of evidence that can clearly support a particular belief. We observe high dissonance when the same amount of evidence is supporting multiple extremes of beliefs. Given a multinomial opinion with non-zero belief masses, the measure of dissonance can be obtained by:
where the relative mass balance between a pair of belief masses and is expressed by:
The above two uncertainty measures (i.e., vacuity and dissonance) can be interpreted using class-level evidence measures of subjective opinions. As in Table 1, given two classes (positive, and negative), we have three subjective opinions , represented by the two-class evidence measures as: representing low uncertainty (entropy, dissonance and vacuity) which implies high confidence in a decision making context. indicating high inconclusiveness due to high conflicting evidence which gives high entropy and high dissonance, showing the case of high vacuity which is commonly observed in OOD samples. Therefore, vacuity can effectively distinguish OOD samples from boundary samples because it represents a lack of evidence.
3.1. Calibrating Evidential Neural Networks
ENNs (Sensoy et al., 2018) predict the evidence vector for the predicted Dirichlet distribution instead of softmax probability. Given a sample with the input feature and the ground-truth label , let represents the predicted evidence vector predicted by the classifier with parameters . Then the corresponding Dirichlet distribution has parameters . The Dirichlet density is the prior on the Multinomial distribution . Then we optimize the following sum of squared loss for classfication:
Since Eq. (3.1
) only relies on class labels of training samples, it does not directly measure the quality of the predicted Dirichlet distributions. The uncertainty estimates may not be accurate. Thus, we propose a regularization method that combines ENNs and language models to quantify evidential uncertainty in text classification tasks. Formally, given a set of samples, where refers to input embedding of sentences or documents and is its label. Let and be the distributions of the OOD and ID samples respectively. Let
denote the function of the pre-trained feature extraction layers. Letdenote the task-specific layers. We use to represent the parameters of and
. Then we fine-tune our model by optimizing the following loss function over the parameters:
The first item refers to the vanilla classification loss of ENN Eq. (3.1), which ensures a reasonable estimation of the ID samples’ class probabilities. The second item is to reduce the vacuity estimation on ID samples. The third item is to increase the vacuity estimation on OOD samples. and are the trade-off parameters. The goal of minimizing Eq. (10) is to achieve high classification accuracy, low vacuity output for ID samples, and high vacuity output for OOD samples. To ensure the model’s generalization to the whole data space, the choice of effective is crucial. Although generative models have achieved success in the CV domain (Sensoy et al., 2020; Hu et al., 2020), they do not apply to discrete text data. We adopt two methods that have achieved success in the NLP domain to get effective OOD regularization: (i) Using auxiliary OOD datasets; (ii) Generating off-manifold adversarial examples.
3.2. Utilizing Auxiliary Datasets
The auxiliary datasets disjointed from the test datasets can be used to calibrate the neural networks’ over-confidence for unseen samples. A critical finding in (Hendrycks et al., 2018) is that the diversity of the auxiliary dataset is important. Hu et al. (2020) report that the methods using diverse examples beat the methods that only use close adversarial examples (Hein et al., 2019; Sensoy et al., 2020) in OOD detection in image classification. Our empirical observations also find that randomly generated sentences (we randomly sample words and concatenate them into fake sentences) do not improve the performance. One partial explanation is that these ”sentences” do not contain useful semantic information. This is similar to the CV domain, where CNN models do not extract valuable features from random pixel image samples. Since it is easy to get a large corpus of diverse text data, utilizing a real dataset is inexpensive and straightforward. Let be the distribution of the OOD auxiliary dataset, the regularization can be written as:
3.3. Utilizing Off-manifold samples
Kong et al. (2020) encourage the model to output uniform distributions on pseudo off-manifold samples to alleviate the over-confidence in OOD regions. On the contrary, we apply off-manifold samples by enforcing the model to predict high vacuity:
where denotes the distributions of the adversarial examples. The off-manifold samples are generated from adding relatively large perturbations towards the outside of the data manifold. In our NLP tasks, the data manifold refers to the embedding space because the original text is not continuous. Formally, given a training ID sample (embedding) , we generate the off-manifold sample by:
where denotes an sphere centered at with a radius . The is relatively large to ensure that the sphere lies outside of the data manifold (Gilmer et al., 2018; Stutz et al., 2019). Then we can get pseudo off-manifold samples from along the adversarial direction, which is calculated from the gradient of the classification loss.
Off-manifold samples can improve the uncertainty estimation in close OOD regions. However, the generalization of adversarial samples relies on the diversity of the features of the training data. Hu et al. (2020)et al., 2011). Because CIFAR-10 contains more diverse features than SVHN, a dataset of only street numbers. Our empirical observations find that off-manifold samples can help when combined with pre-trained transformers. However, it does not provide significant improvement in vanilla GRUs/ LSTMs. This is consistent with the empirical study (Hendrycks et al., 2020) where pre-trained transformers outperform vanilla models in generalization towards OOD regions. The embeddings of pre-trained transformers contain rich features that benefit the generated adversarial examples. Thus following (Kong et al., 2020), we evaluate off-manifold regularization on BERT (Devlin et al., 2018).
3.4. Mixture Regularization
Auxiliary datasets regularization provides an overall calibration improvement, while off-manifold regularization focuses more in the close OOD region. We replace the last item in Eq. (10), which represents the uncertainty regularization for OOD data to the mixture of Eqs. (11) and (13) to get the final objective function:
where , , denote the weight parameters of each regularization item. The overall framework and the detailed algorithm can be seen in Figure 2 and Algorithm 1. In each iteration, we firstly minimize the classification loss and estimated vacuity on ID samples. Then we maximize the vacuity on auxiliary outliers. Finally, we generate off-manifold samples and maximize the vacuity estimation on them.
We conduct OOD detection experiments on a wide range of datasets. In each scenario, we train the model on the ID training set . Later we evaluate the model on the ID testing set and an OOD testing set to see if the model can distinguish between ID and OOD examples. Our experiments consist of three parts: (i) We follow the work in (Hendrycks et al., 2018) to fine-tune a simple two-layer GRU classifier (Cho et al., 2014) using different methods. (ii) Then we extend the evaluation to pre-trained language models (BERT) like (Kong et al., 2020). We report the OOD detection performance and illustrate the advantage of evidential uncertainty in (iii) the predictive uncertainty distribution.
We follow the same benchmark in (Hendrycks et al., 2018). We use the same three datasets for training and evaluating: (i) 20News refers to the 20 Newsgroups dataset that contains news articles with 20 categories. (ii) SST denotes Stanford Sentiment Treebank (Socher et al., 2013), a collection of movie reviews for sentimental analysis. (iii) TREC consists of 5, 952 individual questions with 50 classes. Finally, WikiText-2 is a corpus of Wikipedia articles used for language modeling. To fairly compare with (Hendrycks et al., 2018), we also use its sentences as the auxiliary OOD examples during the training.
We use the following datasets as OOD testing set : (i) SNLI
refers to the hypotheses portion of the SNLI dataset(Bowman et al., 2015) used for natural language inference. (ii) IMDB (Maas et al., 2011) consists of highly polar movie reviews used for sentiment classification. (iii) M30K refers to the English portion of Multi-30K (Elliott et al., 2016), a dataset of image descriptions. (iv) WMT16 denotes the English portion of the test set from WMT16. (v) Yelp is a dataset of restaurant reviews.
4.2. Comparing Schemes
We compare several recent methods for qualifying uncertainty or OOD detection in text categorization. (i) MSP refers to maximum softmax probability, a baseline work of OOD detection (Hendrycks and Gimpel, 2016). (ii) DP refers to Monte Carlo Dropout (Gal and Ghahramani, 2016), which applies dropout at train and test time. We run ten it times and use the average MSP as the uncertainty score. (iii) TS is a post-hoc calibration method by temperature scaling (Guo et al., 2017). We fine-tune the temperature parameter via the validation set. (iv) MRC denotes Manifold Regularization Calibration (Guo et al., 2017), which adopts on- and off-manifold regularization to improve the calibration of BERT. (v) OE refers to Outlier Exposure (Hendrycks et al., 2018) that enforces uniform confidence on an auxiliary OOD dataset. (vi) ENN (Sensoy et al., 2018) is our base classifier, which uses deep learning models to explicitly model SL uncertainty. Most of the baselines with softmax function use the negative of maximum softmax scores () as the uncertainty score, which is similar to predictive entropy. ENN uses predictive entropy. Our proposed model uses vacuity as the detection score.
: The area under the receiver operating characteristic curve (AUROC), the area under the precision-recall curve (AUPR) and the False Alarm Rate at 90% Recall (FAR90). Higher AUROC indicates a higher probability that a positive example has a higher score than a false example, which means better accuracy. AUPR is similar to AUROC, but it also considers the positive class’s base rate. Higher AUPR is better. FAR90 measures the probability that a false example raises a false alarm, assuming that 90% of all positive examples are detected. Lower FAR90 is better.
For the GRU experiments, we use the source code of MSP and OE in (Hendrycks et al., 2018). We follow the same pre-processing steps and the base rate of to
is 1:5 in each scenario. We implement ENN, DP, and our model based on the same two-layer GRUs. We pre-train the base classifier for five epochs and fine-tune five more epochs for OE and our model using WikiText-2. Except for DP, we pre-train it for ten epochs to ensure the same accuracy as others. We evaluate our model with auxiliary datasets regularization (+OE).
For the experiments on BERT, we follow the same setting in (Kong et al., 2020), which also contains the implementation of multiple baselines. We still set the base rate of to 1:5 to be consistent with the previous experiments. We construct sequence classifiers with one linear layer on top of the pooled output of a pre-trained uncased BERT base model. Then we fine-tune it with different models for ten epochs. We evaluate auxiliary datasets regularization (+OE), adversarial regularization (+AD), and the mixture method (MIX).
We fairly train all the baselines with their default parameters and report the average results. In the GRU experiments, we set , , , in Adam optimizer of our model in all the experiments, which were fine-tuned considering the performance of both the OOD detection and ID classification accuracy. For the experiments on BERT, we set in all +OE and MIX, in all +AD, in Adam optimizer in all experiments. But we use slightly different for each , which is fine-tuned considering the accuracy and vacuity from the validation ID set. For more details, refer to Section 4.7 and our source code 111https://github.com/snowood1/BERT-ENN.
4.4. Out-of-Distribution Detection
In Table 2, our model on GRU significantly outperforms other approaches on SST and achieves the overall best results on TREC. Except on 20News, OE slightly outperforms ours. One partial explanation is that simple GRUs can not handle accuracy and uncertainty estimation simultaneously when handling longer texts. The average accuracy of all the models is only 73%, which indicates that the models have not learned the correct evidence.
Table 3 shows that pre-trained models still suffer from over-confidence. DP does not outperform MSP, which is consistent with (Vernekar et al., 2019) that MC Dropout only measures uncertainty in ID settings. TS still replies on softmax probability and tune its temperature parameter on the validation (ID) set. Thus TS does not generalize well in unseen data. Therefore, effective OOD detection models require regularization from OOD examples. OE using a diverse real auxiliary dataset beats MRC that adopts adversarial examples, except in the close OOD setting SST vs. IMDB. Our model (MIX) applies both regularizations and beats both of them.
Table 4 further analyzes the contribution of each regularization. Both +OE and +AD improve the performance of vanilla ENN. +OE outperforms the baseline OE. This indicates the effectiveness of evidential uncertainty when using the same regularization. While +OE provides an overall improvement, +AD is especially effective in distinguishing close OOD examples. For example, in SST vs. IMDB and SST vs. Yelp, both cases involve movies or reviews. In sum, applying the mixture of both regularizations achieves the overall stable best performance.
4.5. Predictive Uncertainty Distribution
We use boxplots to show the uncertainty distribution of different models deployed on BERT in Fig.3. Baselines use entropy as a measure of uncertainty. Our proposed model use vacuity (Vac) and the square root of dissonance (Dis) ranged from [0, 1]. We also show the output of our entropy (Ent). The top row shows the predictive uncertainty in and compares them to those for all the OOD datasets. We concatenate all the five OOD datasets as OOD examples in these experiments. The bottom row shows different models’ predictive uncertainty for correct and mis-classified examples in . OE is the best counterpart in OOD detection. However, OE fails to give a distinct separation between ID and OOD data on SST. Besides, all the counterparts predict high uncertainty for misclassified ID samples the same as OOD samples. Thus they will misclassify some of the boundary ID samples as OOD samples. On the contrary, our model decomposes the uncertainty into vacuity and dissonance. High vacuity is observed only in the OOD region. The boundary ID samples will have higher dissonance but low vacuity. This explains the advantage of adopting vacuity in distinguish between boundary ID and OOD examples.
4.6. Parameter Study
The most important parameters are and . influences the performance of adversarial regularization greatly. We find that achieves the best performance across all of our experiments. Figure 4 shows the FPR90 of our model using off-manifold regularization (+AD) in the scenario SST () vs. IMDB (). We observe the same performance in all the other scenarios. When is too small, the generated samples might be too close to the manifold and may harm the confidence of the ID region. Too much perturbation leads to ineffective samples for regularization.
We also compare the effect of the weights of different regularization terms in the mixture formula. We find that +OE provides an overall improvement in calibration, and we simply set . We try different or 0.1 to better distinguish close OOD examples. is tuned via the validation ID set within three possible values 0, 0.01 and 0.1. Since the first item in Eq. (10) already assigns considerable confidence in training samples during the classification process, it also reduces ID samples’ vacuity. Large may also affect the accuracy. Therefore we only use a small to scale the vacuity of ID examples slightly. The summary of different weights can be seen in Table 5.
5. Related work
Our study is related to uncertainty qualification (Blundell et al., 2015; Gal and Ghahramani, 2016; Sensoy et al., 2018), OOD detection (Hendrycks and Gimpel, 2016; Hendrycks et al., 2018) and confidence calibration (Guo et al., 2017; Thulasidasan et al., 2019; Kong et al., 2020). We have discussed the NLP applications of these fields in the Introduction.
Other baselines not included in our experiments include Deep Ensemble (Lakshminarayanan et al., 2017), which average the softmax outputs of five models with different initialization. A recent empirical study (Ovadia et al., 2019) proves that Deep Ensemble performs better than Dropout and Temperature Scaling under dataset shift of NLP tasks using LSTM (Hochreiter and Schmidhuber, 1997). However, fine-tuning multiple pre-trained transformer models is computationally expensive. Besides, the advantage of our considered baseline OE over this method has been reported in (Meinke and Hein, 2019). Therefore we do not consider this method as a baseline in our paper. Another line of work, Stochastic Variational Bayesian Inference (Blundell et al., 2015; Louizos and Welling, 2017; Wen et al., 2018) can be applied to CNN models but hard to be applied in other architectures such as LSTMs (Ovadia et al., 2019). Sensoy et al. (2018); Hu et al. (2020) also prove the advantage of ENNs over multiple Stochastic Variational Bayesian Inference methods.
Qualifying uncertainty is essential for reliable classification, but less work has been studied in the NLP domain. We firstly apply evidential uncertainty based on SL to solve OOD detection in the text classification. We combine ENNs and language models to measure vacuity and dissonance. Our proposed model uses auxiliary datasets of outliers and off-manifold samples to train a model with prior knowledge of a certain class, which has high vacuity for OOD samples. Extensive experiments show that our approach significantly outperforms all the counterparts.
Acknowledgements.The research reported herein was supported in part by NSF awards DMS-1737978, DGE-2039542, OAC-1828467, OAC-1931541, and DGE-1906630, ONR awards N00014-17-1-2995 and N00014-20-1-2738, Army Research Office Contract No. W911NF2110032 and IBM faculty award (Research).
- Word-level uncertainty estimation for black-box text classifiers using rnns. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 5541–5546. Cited by: §1.
- Weight uncertainty in neural network. In ICML, pp. 1613–1622. Cited by: §1, §5, §5.
- A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326. Cited by: §4.1.
- Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39–57. Cited by: §1.
- Taming pretrained transformers for extreme multi-label text classification. In KDD, pp. 3163–3171. Cited by: §1.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §4.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §1.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §3.3.
- Multi30K: multilingual english-german image descriptions. In Proceedings of the 5th Workshop on Vision and Language, pp. 70–74. Cited by: §4.1.
- Dropout as a bayesian approximation: representing model uncertainty in deep learning. In ICML, pp. 1050–1059. Cited by: §1, §4.2, §5.
- Adversarial spheres. arXiv preprint arXiv:1801.02774. Cited by: §1, §3.3.
- On calibration of modern neural networks. arXiv preprint arXiv:1706.04599. Cited by: §1, §1, §1, §4.2, §5.
- Towards more accurate uncertainty estimation in text classification. In EMNLP, pp. 8362–8372. Cited by: §1.
- Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem. In CVPR, pp. 41–50. Cited by: §1, §1, §3.2.
- A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §1, §4.2, §4.3, §5.
- Pretrained transformers improve out-of-distribution robustness. arXiv preprint arXiv:2004.06100. Cited by: §1, §1, §3.3.
Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606. Cited by: §1, §1, §3.2, §4.1, §4.2, §4.3, §4.3, §4, §5.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1, §5.
- Multidimensional uncertainty-aware evidential neural networks. arXiv preprint arXiv:2012.13676. Cited by: §1, §3.1, §3.2, §3.3, §5.
- Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328. Cited by: §1.
- Towards robust and discriminative sequential data learning: when and how to perform adversarial training?. In KDD, pp. 1665–1673. Cited by: §1.
- Uncertainty characteristics of subjective opinions. In Fusion, pp. 1998–2005. Cited by: §1, §2.2.
- Subjective logic. Springer. Cited by: §1.
- What uncertainties do we need in Bayesian deep learning for computer vision?. In NeurIPS, pp. 5574–5584. Cited by: §1, §1.
- Calibrated language model fine-tuning for in-and out-of-distribution data. arXiv preprint arXiv:2010.11506. Cited by: §1, §1, §1, §3.3, §3.3, §4.3, §4, §5.
- Simple and scalable predictive uncertainty estimation using deep ensembles. In NeurIPS, pp. 6402–6413. Cited by: §5.
- Learning adversarial networks for semi-supervised text classification via policy gradient. In KDD, pp. 1715–1723. Cited by: §1.
- Multiplicative normalizing flows for variational bayesian neural networks. In ICML, Vol. 70, pp. 2218–2227. Cited by: §1, §5.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp. 142–150. Cited by: §4.1.
- Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §1.
- Towards neural networks that provably know when they don’t know. arXiv preprint arXiv:1909.12180. Cited by: §5.
- Reading digits in natural images with unsupervised feature learning. Cited by: §3.3.
- Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In NeurIPS, pp. 13991–14002. Cited by: §1, §1, §5.
- Uncertainty-aware deep classifiers using generative models. arXiv preprint arXiv:2006.04183. Cited by: §1, §3.1, §3.2.
- Evidential deep learning to quantify classification uncertainty. In NeurIPS, pp. 3183–3193. Cited by: §3.1, §4.2, §5, §5.
- A mathematical theory of communication. The Bell system technical journal 27 (3), pp. 379–423. Cited by: §1.
Deep active learning for named entity recognition. arXiv preprint arXiv:1707.05928. Cited by: §1.
- Deep bayesian active learning for natural language processing: results of a large-scale empirical study. arXiv preprint arXiv:1808.05697. Cited by: §1.
- Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, pp. 1631–1642. Cited by: §4.1.
Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research15 (1), pp. 1929–1958. Cited by: §1.
- Disentangling adversarial robustness and generalization. In CVPR, pp. 6976–6987. Cited by: §1, §3.3.
- Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
- On mixup training: improved calibration and predictive uncertainty for deep neural networks. In NeurIPS, pp. 13888–13899. Cited by: §1, §5.
Predictive uncertainty for probabilistic novelty detection in text classification. In Proceedings ICML 2020 Workshop on Uncertainty and Robustness in Deep Learning, Cited by: §1.
- Out-of-distribution detection in classifiers via generation. arXiv preprint arXiv:1910.04241. Cited by: §4.4.
- Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125. Cited by: §1.
- Flipout: efficient pseudo-independent weight perturbations on mini-batches. arXiv preprint arXiv:1803.04386. Cited by: §5.
- Quantifying uncertainties in natural language processing tasks. In AAAI, Vol. 33, pp. 7322–7329. Cited by: §1.
- Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §1.
- Mitigating uncertainty in document classification. arXiv preprint arXiv:1907.07590. Cited by: §1.
- Quantifying classification uncertainty using regularized evidential neural networks. arXiv preprint arXiv:1910.06864. Cited by: §1.
- Out-of-domain detection for natural language understanding in dialog systems. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 1198–1209. Cited by: §1.
Uncertainty-aware curriculum learning for neural machine translation. In ACL, pp. 6934–6944. Cited by: §1.