Out-of-distribution (OOD) detection has been an important topic for deep learning applications, after deep neural networks (DNNs) are known to be over-confident on abnormal samples, which are unrecognizable(Nguyen et al., 2015) and from out-of-distribution (Hendrycks and Gimpel, 2017). OOD detection is highly related to the safety, because there is no control on test samples after deployment of DNNs to the real-world.
To prevent DNNs from over-confident predictions of OOD samples, post-training with outlier samples is commonly used. Fine-tuning with a few selected OOD samples(Hendrycks et al., 2019) or adversarial noises (Hein et al., 2019) can improve the detection performance of unseen OOD samples.
Meanwhile, soft labeling becomes a common trick of output regularization to train DNNs in various purposes. For example, label smoothing (Szegedy et al., 2016) improves the test accuracy of DNNs, preventing an overfitting problem (He et al., 2019; Müller et al., 2019). Knowledge distillation (Hinton et al., 2015), a kind of soft labeling (Yuan et al., 2019), can compress the size of a teacher model, or improve the accuracy of its student networks (Xie et al., 2019).
Despite the popularity of soft labeling, how soft labeling affects OOD detection of DNNs has not been explored. In this study, we assume that regularizing predictions on incorrect classes by soft labeling determines OOD detection performance of DNNs. We analyze and empirically verify our assumption, based on two major results: a) label smoothing deteriorates OOD detection of DNNs, and b) soft labels, generated by a teacher model, distill OOD detection performance into its student models. In particular, the degraded test accuracy of a teacher model with outlier exposure is recovered or improved in its student models, while conserving the high performance of OOD detection.
Based on the empirical results, we claim that a lottery ticket of soft labeling for OOD-robust DNNs exists, and how to regularize the predictions of DNNs on incorrect classes is a compelling direction of future work for generalization of DNNs not only on unseen in-distribution (ID) samples, but also on OOD samples.
2.1 Outlier Exposure
Outlier exposure (Hendrycks et al., 2019)
finetunes a model with some OOD samples to predict uniform distribution for the OOD training samples.
where is a prediction of a ID sample, is a prediction of a OOD sample, is an one-hot represented ground truth, is a hyper-parameter, is cross-entropy, and is uniform distribution over all classes.
Despite a significant improvement of OOD detection, training additional OOD samples has two drawbacks. First, original test accuracy is often degraded after outlier exposure as a trade-off between OOD detection and the original task. Second, we cannot consider all possible OOD samples in training, because there are infinitely many OOD samples.
2.2 Soft Labeling as an Output Regularization
Given an one-hot represented ground truth of a training sample , soft labeling is defined as
where is a hyper-parameter for soft labeling, is a soft target that satisfies and , and is the number of classes. Then, the training loss with soft labeling is
Note that a soft labeling is a regularization of the predictions including incorrect classes.
The training objective of both label smoothing (Szegedy et al., 2016) and knowledge distillation (Hinton et al., 2015) are represented by Eq (3) (Yuan et al., 2019). Label smoothing (Szegedy et al., 2016) is a soft labeling that regularizes DNNs to predict an uniform distribution over all classes:
In knowledge distillation, a soft target of a student model consists of a prediction of its teacher model:
where is the prediction of the teacher model. Knowledge distillation is a kind of output regularization of student models by the teacher’s predictions (Yuan et al., 2019) for model compression (Hinton et al., 2015) or generalization (Xie et al., 2019).
2.3 Experimental Setting
In this paper, we train WRN-40-2 (Zagoruyko and Komodakis, 2016) with the SVHN, CIFAR-10, and CIFAR-100 datasets (ID). We follow the experimental setting in the official code of outlier exposure111https://github.com/hendrycks/outlier-exposure
except that we use 150 epochs for training. In addition, we follow the hyper-parameter settings of knowledge distillation in(Müller et al., 2019). For evaluation of OOD detection, we use the MNIST, Fashion-MNIST, SVHN (or CIFAR-10), LSUN, and TinyImageNet datasets for OOD samples, and AUROC for the evaluation measure.
3 Soft Labeling Affects OOD Detection
3.1 Label Smoothing and OOD Detection
Figure 1 shows the effects of label smoothing with different on test accuracy, expected calibration error (ECE) (Guo et al., 2017), and detection of the OOD datasets. As shown in (Lukasik et al., 2020), ECE starts to increase when the label smoothing is larger than the optimal values (red dot lines). Although the test accuracy on CIFAR-10 with deteriorates, the test accuracy is always improved when ECE is minimized (Müller et al., 2019).
Even though label smoothing improves test accuracy and ECE (Müller et al., 2019), label smoothing makes DNNs vulnerable to out-of-distribution and disable to distinguish ID and OOD datasets. Label smoothing always deteriorates OOD detection regardless of the magnitude of , and larger results in more degradation of OOD detection. In particular, WRN models, trained with SVHN and CIFAR-10, show significant AUROC drops, when the ECE starts to increase (after the red dot line).
We can infer the reason why label smoothing hurts OOD detection of DNNs from two perspectives. First, combining Eq (1) and (4), we can interpret the output regularization of label smoothing as an outlier exposure of ID samples. Then, label smoothing can deteriorate OOD detection of DNNs, making DNNs disable to discriminate OOD samples from ID samples. When the magnitude of increases, the effect of output regularization in Eq (4) increases and deteriorates the OOD detection performance as outlier exposure of the ID datasets.
Meanwhile, knowledge distillation is the other view to interpret the negative effect of label smoothing on OOD detection. Note that label smoothing is a knowledge distillation with a teacher model, which perfectly learns the ID samples as OOD and predicts uniform distribution for all ID samples. Thus, we assume that Soft labels of incorrect classes, generated by a teacher model, determines OOD detection performance of its student model, and empirically verify the assumption in section 3.2.
3.2 Knowledge Distillation and OOD Detection
In this section, we show that OOD detection performance is determined by the soft labels. Specifically, soft labels that are generated by a teacher model determine the performance of its student model. Figure 2 shows OOD detection performance of teacher models and their student models in various settings. For a student model, we use the same architecture (WRN-40-2) with its teacher, because our concern is to analyze the effects of soft labeling, not a model compression.
Figure 2 (top) shows OOD detection AUROCs of the WRN-40-2 models (SVHN, CIFAR-10, and CIFAR-100), and their student models. The teacher and its student model have similar AUROCs regardless of test datasets (OOD).
In Figure 2 (bottom), we finetune the teacher models with various OOD samples (MNIST, TinyImageNet, and MNIST+TinyImageNet) to improve OOD detection by outlier exposure. The OOD detection of the teacher models is improved in different OOD datasets, according to the exposed OOD samples. We find that OOD detection performance of student models is always consistent with their teacher models (OE), regardless of the choice of training OOD samples for the teacher.
Especially, when we use MNIST+TinyImageNet for outlier exposure of the teacher model, both the teacher and its student almost perfectly detect the test OOD samples. Exposing various OOD samples in training time is an unrealistic setting, because there are infinitely many cases of OOD. However, the experimental results is worth noting, because the student model is trained only using ID samples with soft labels, and any OOD sample is not directly used to train the student model. Although how to generate soft labels without a perfectly OOD-robust teacher remains in an open question, the result show the existence of soft labeling for OOD-robust DNNs to various OOD datasets without OOD training.
OOD detection performance is also distilled into a student model that has a different architecture from its teacher model. In Table 2, we use DenseNet (Huang et al., 2017) with 40 hidden layers and 12 growth rates as the student model of WRN-40-2. Note that the number of trainable parameters of DenseNet (1.1 M) is twice less than WRN-40-2 (2.2 M). Even though the size and model architecture of the student are different from those of its teacher, OOD detection AUROCs of the teacher and student are consistent. The results imply that the effect of soft labeling on OOD detection is model-agnostic. Then, if we find a soft labeling method for OOD-robust DNNs, the soft labeling can be generally used for various DNN architectures.
Orthogonal to the OOD detection, one disadvantage of post-training with OOD samples is a degradation of original classification accuracy (Hendrycks et al., 2019; Hein et al., 2019). However, we find that both test accuracy and ECE of the student models (+OD) are similar to or better than the original model before outlier exposure (baseline) in (Table 1). The improvement of test accuracy results from soft labeling, because soft labels can help model prevent an overfitting problem regardless of the type of soft labeling (Yuan et al., 2019).
In this study, we show that a soft labeling of incorrect classes is closely linked with OOD detection of DNNs. Note that the results of student models in Figure 2 do not use any OOD sample, but can have almost perfect OOD detection AUROCs. The results verify that constructing OOD-robust DNNs is possible without modifying the model or post-training of OOD samples.
The limitation of our study is that the solution of soft labeling for OOD-robust DNNs is unrevealed and remains in an open question. However, we focus on showing the existence of soft labeling for OOD-robust DNNs.
We postulate that finding an output regularization of incorrect classes that makes DNN robust to unseen OOD samples is possible and a worth exploration for future work. Note that proper soft labeling can improve not only OOD detection, but also the classification accuracy of unseen ID samples and confidence calibration (Table 1). In addition, the OOD-robust soft labeling is model-agnostic and generally applied into various model architectures.
- On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1321–1330. Cited by: §3.1.
Bag of tricks for image classification with convolutional neural networks. In , pp. 558–567. Cited by: §1.
Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 41–50. Cited by: §1, §3.2.
- A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, Cited by: §1.
Deep anomaly detection with outlier exposure. In International Conference on Learning Representations, Cited by: §1, §2.1, §3.2.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §2.2, §2.2.
- Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §3.2.
- Does label smoothing mitigate label noise?. arXiv preprint arXiv:2003.02819. Cited by: §3.1.
- When does label smoothing help?. In Advances in Neural Information Processing Systems, pp. 4696–4705. Cited by: §1, §2.3, §3.1, §3.1.
- Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436. Cited by: §1.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §1, §2.2.
Self-training with noisy student improves imagenet classification. arXiv preprint arXiv:1911.04252. Cited by: §1, §2.2.
- Revisit knowledge distillation: a teacher-free framework. arXiv preprint arXiv:1909.11723. Cited by: §1, §2.2, §2.2, §3.2.
- Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §2.3.