Soft Labeling Affects Out-of-Distribution Detection of Deep Neural Networks

07/07/2020 ∙ by Doyup Lee, et al. ∙ 0

Soft labeling becomes a common output regularization for generalization and model compression of deep neural networks. However, the effect of soft labeling on out-of-distribution (OOD) detection, which is an important topic of machine learning safety, is not explored. In this study, we show that soft labeling can determine OOD detection performance. Specifically, how to regularize outputs of incorrect classes by soft labeling can deteriorate or improve OOD detection. Based on the empirical results, we postulate a future work for OOD-robust DNNs: a proper output regularization by soft labeling can construct OOD-robust DNNs without additional training of OOD samples or modifying the models, while improving classification accuracy.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Out-of-distribution (OOD) detection has been an important topic for deep learning applications, after deep neural networks (DNNs) are known to be over-confident on abnormal samples, which are unrecognizable

(Nguyen et al., 2015) and from out-of-distribution (Hendrycks and Gimpel, 2017). OOD detection is highly related to the safety, because there is no control on test samples after deployment of DNNs to the real-world.

To prevent DNNs from over-confident predictions of OOD samples, post-training with outlier samples is commonly used. Fine-tuning with a few selected OOD samples

(Hendrycks et al., 2019) or adversarial noises (Hein et al., 2019) can improve the detection performance of unseen OOD samples.

Meanwhile, soft labeling becomes a common trick of output regularization to train DNNs in various purposes. For example, label smoothing (Szegedy et al., 2016) improves the test accuracy of DNNs, preventing an overfitting problem (He et al., 2019; Müller et al., 2019). Knowledge distillation (Hinton et al., 2015), a kind of soft labeling (Yuan et al., 2019), can compress the size of a teacher model, or improve the accuracy of its student networks (Xie et al., 2019).

Despite the popularity of soft labeling, how soft labeling affects OOD detection of DNNs has not been explored. In this study, we assume that regularizing predictions on incorrect classes by soft labeling determines OOD detection performance of DNNs. We analyze and empirically verify our assumption, based on two major results: a) label smoothing deteriorates OOD detection of DNNs, and b) soft labels, generated by a teacher model, distill OOD detection performance into its student models. In particular, the degraded test accuracy of a teacher model with outlier exposure is recovered or improved in its student models, while conserving the high performance of OOD detection.

Based on the empirical results, we claim that a lottery ticket of soft labeling for OOD-robust DNNs exists, and how to regularize the predictions of DNNs on incorrect classes is a compelling direction of future work for generalization of DNNs not only on unseen in-distribution (ID) samples, but also on OOD samples.

2 Preliminaries

2.1 Outlier Exposure

Outlier exposure (Hendrycks et al., 2019)

finetunes a model with some OOD samples to predict uniform distribution for the OOD training samples.


where is a prediction of a ID sample, is a prediction of a OOD sample, is an one-hot represented ground truth, is a hyper-parameter, is cross-entropy, and is uniform distribution over all classes.

Despite a significant improvement of OOD detection, training additional OOD samples has two drawbacks. First, original test accuracy is often degraded after outlier exposure as a trade-off between OOD detection and the original task. Second, we cannot consider all possible OOD samples in training, because there are infinitely many OOD samples.

In this study, soft label prevents the degradation of classification accuracy and often improves the test accuracy (Table 1). In addition, we show that a soft label can make DNNs robust to OOD without any OOD training sample and model modification (Figure 2).

2.2 Soft Labeling as an Output Regularization

Given an one-hot represented ground truth of a training sample , soft labeling is defined as


where is a hyper-parameter for soft labeling, is a soft target that satisfies and , and is the number of classes. Then, the training loss with soft labeling is


Note that a soft labeling is a regularization of the predictions including incorrect classes.

The training objective of both label smoothing (Szegedy et al., 2016) and knowledge distillation (Hinton et al., 2015) are represented by Eq (3) (Yuan et al., 2019). Label smoothing (Szegedy et al., 2016) is a soft labeling that regularizes DNNs to predict an uniform distribution over all classes:


In knowledge distillation, a soft target of a student model consists of a prediction of its teacher model:


where is the prediction of the teacher model. Knowledge distillation is a kind of output regularization of student models by the teacher’s predictions (Yuan et al., 2019) for model compression (Hinton et al., 2015) or generalization (Xie et al., 2019).

2.3 Experimental Setting

In this paper, we train WRN-40-2 (Zagoruyko and Komodakis, 2016) with the SVHN, CIFAR-10, and CIFAR-100 datasets (ID). We follow the experimental setting in the official code of outlier exposure111

except that we use 150 epochs for training. In addition, we follow the hyper-parameter settings of knowledge distillation in

(Müller et al., 2019). For evaluation of OOD detection, we use the MNIST, Fashion-MNIST, SVHN (or CIFAR-10), LSUN, and TinyImageNet datasets for OOD samples, and AUROC for the evaluation measure.

ID Dataset Baseline +OE +OD
SVHN Acc 97.02 96.82 97.17
ECE 2.38 2.65 2.28
CIFAR-10 Acc 95.12 94.74 95.10
ECE 3.85 4.07 3.49
CIFAR-100 Acc 76.63 75.58 76.80
ECE 12.06 14.79 10.61
Table 1: Test accuracy and expected calibration error (ECE) of WideResNet (Baseline) trained with SVHN, CIFAR10, and CIFAR-100. TinyImageNet is used to train for OE (Outlier Exposure). OD (Outlier Distillation) means the student model of OE model.

3 Soft Labeling Affects OOD Detection

MNIST 84.28 90.95 93.30
Fashion-MNIST 95.16 96.03 96.61
SVHN 94.19 94.50 94.19
LSUN 99.99 99.94 99.94
TinyImageNet 99.99 99.78 99.83
Table 2: OOD detection performance of outlier exposure and outlier distillation. WRN-OE, which is finetuned with TinyImageNet as OOD, is used as the teacher model for the two student models (WRN and DenseNet).
Figure 1: Test accuracy and expected calibration error (top) and OOD detection AUROC (bottom) of WRN, trained with SVHN (left), CIFAR-10 (middle), and CIFAR-100 (right) respectively. The red dot line represents label smoothing minimizing ECE. OOD detection is continuously deteriorated when label smoothing increases. When ECE starts to increase (after the red dot line), dramatic drops of AUROC are shown in the training datasets of SVHN (ID) and CIFAR-10 (ID).
Figure 2: OOD detection AUROCs of a teacher model and its student model. (Top) teacher models trained with the SVHN, CIFAR-10, and CIFAR-100 dataset and their student models. (Bottom) teacher models of CIFAR-10 are finetuned with MNIST, TinyImageNet, and MNIST+TinyImageNet by outlier exposure. WRN-40-2 is used as the model architecture of both teacher and student models.

3.1 Label Smoothing and OOD Detection

Figure 1 shows the effects of label smoothing with different on test accuracy, expected calibration error (ECE) (Guo et al., 2017), and detection of the OOD datasets. As shown in (Lukasik et al., 2020), ECE starts to increase when the label smoothing is larger than the optimal values (red dot lines). Although the test accuracy on CIFAR-10 with deteriorates, the test accuracy is always improved when ECE is minimized (Müller et al., 2019).

Even though label smoothing improves test accuracy and ECE (Müller et al., 2019), label smoothing makes DNNs vulnerable to out-of-distribution and disable to distinguish ID and OOD datasets. Label smoothing always deteriorates OOD detection regardless of the magnitude of , and larger results in more degradation of OOD detection. In particular, WRN models, trained with SVHN and CIFAR-10, show significant AUROC drops, when the ECE starts to increase (after the red dot line).

We can infer the reason why label smoothing hurts OOD detection of DNNs from two perspectives. First, combining Eq (1) and (4), we can interpret the output regularization of label smoothing as an outlier exposure of ID samples. Then, label smoothing can deteriorate OOD detection of DNNs, making DNNs disable to discriminate OOD samples from ID samples. When the magnitude of increases, the effect of output regularization in Eq (4) increases and deteriorates the OOD detection performance as outlier exposure of the ID datasets.

Meanwhile, knowledge distillation is the other view to interpret the negative effect of label smoothing on OOD detection. Note that label smoothing is a knowledge distillation with a teacher model, which perfectly learns the ID samples as OOD and predicts uniform distribution for all ID samples. Thus, we assume that Soft labels of incorrect classes, generated by a teacher model, determines OOD detection performance of its student model, and empirically verify the assumption in section 3.2.

3.2 Knowledge Distillation and OOD Detection

In this section, we show that OOD detection performance is determined by the soft labels. Specifically, soft labels that are generated by a teacher model determine the performance of its student model. Figure 2 shows OOD detection performance of teacher models and their student models in various settings. For a student model, we use the same architecture (WRN-40-2) with its teacher, because our concern is to analyze the effects of soft labeling, not a model compression.

Figure 2 (top) shows OOD detection AUROCs of the WRN-40-2 models (SVHN, CIFAR-10, and CIFAR-100), and their student models. The teacher and its student model have similar AUROCs regardless of test datasets (OOD).

In Figure 2 (bottom), we finetune the teacher models with various OOD samples (MNIST, TinyImageNet, and MNIST+TinyImageNet) to improve OOD detection by outlier exposure. The OOD detection of the teacher models is improved in different OOD datasets, according to the exposed OOD samples. We find that OOD detection performance of student models is always consistent with their teacher models (OE), regardless of the choice of training OOD samples for the teacher.

Especially, when we use MNIST+TinyImageNet for outlier exposure of the teacher model, both the teacher and its student almost perfectly detect the test OOD samples. Exposing various OOD samples in training time is an unrealistic setting, because there are infinitely many cases of OOD. However, the experimental results is worth noting, because the student model is trained only using ID samples with soft labels, and any OOD sample is not directly used to train the student model. Although how to generate soft labels without a perfectly OOD-robust teacher remains in an open question, the result show the existence of soft labeling for OOD-robust DNNs to various OOD datasets without OOD training.

OOD detection performance is also distilled into a student model that has a different architecture from its teacher model. In Table 2, we use DenseNet (Huang et al., 2017) with 40 hidden layers and 12 growth rates as the student model of WRN-40-2. Note that the number of trainable parameters of DenseNet (1.1 M) is twice less than WRN-40-2 (2.2 M). Even though the size and model architecture of the student are different from those of its teacher, OOD detection AUROCs of the teacher and student are consistent. The results imply that the effect of soft labeling on OOD detection is model-agnostic. Then, if we find a soft labeling method for OOD-robust DNNs, the soft labeling can be generally used for various DNN architectures.

Orthogonal to the OOD detection, one disadvantage of post-training with OOD samples is a degradation of original classification accuracy (Hendrycks et al., 2019; Hein et al., 2019). However, we find that both test accuracy and ECE of the student models (+OD) are similar to or better than the original model before outlier exposure (baseline) in (Table 1). The improvement of test accuracy results from soft labeling, because soft labels can help model prevent an overfitting problem regardless of the type of soft labeling (Yuan et al., 2019).

4 Discussion

In this study, we show that a soft labeling of incorrect classes is closely linked with OOD detection of DNNs. Note that the results of student models in Figure 2 do not use any OOD sample, but can have almost perfect OOD detection AUROCs. The results verify that constructing OOD-robust DNNs is possible without modifying the model or post-training of OOD samples.

The limitation of our study is that the solution of soft labeling for OOD-robust DNNs is unrevealed and remains in an open question. However, we focus on showing the existence of soft labeling for OOD-robust DNNs.

We postulate that finding an output regularization of incorrect classes that makes DNN robust to unseen OOD samples is possible and a worth exploration for future work. Note that proper soft labeling can improve not only OOD detection, but also the classification accuracy of unseen ID samples and confidence calibration (Table 1). In addition, the OOD-robust soft labeling is model-agnostic and generally applied into various model architectures.


  • C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1321–1330. Cited by: §3.1.
  • T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li (2019)

    Bag of tricks for image classification with convolutional neural networks


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 558–567. Cited by: §1.
  • M. Hein, M. Andriushchenko, and J. Bitterwolf (2019)

    Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 41–50. Cited by: §1, §3.2.
  • D. Hendrycks and K. Gimpel (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, Cited by: §1.
  • D. Hendrycks, M. Mazeika, and T. Dietterich (2019)

    Deep anomaly detection with outlier exposure

    In International Conference on Learning Representations, Cited by: §1, §2.1, §3.2.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §2.2, §2.2.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §3.2.
  • M. Lukasik, S. Bhojanapalli, A. K. Menon, and S. Kumar (2020) Does label smoothing mitigate label noise?. arXiv preprint arXiv:2003.02819. Cited by: §3.1.
  • R. Müller, S. Kornblith, and G. E. Hinton (2019) When does label smoothing help?. In Advances in Neural Information Processing Systems, pp. 4696–4705. Cited by: §1, §2.3, §3.1, §3.1.
  • A. Nguyen, J. Yosinski, and J. Clune (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436. Cited by: §1.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §1, §2.2.
  • Q. Xie, E. Hovy, M. Luong, and Q. V. Le (2019)

    Self-training with noisy student improves imagenet classification

    arXiv preprint arXiv:1911.04252. Cited by: §1, §2.2.
  • L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng (2019) Revisit knowledge distillation: a teacher-free framework. arXiv preprint arXiv:1909.11723. Cited by: §1, §2.2, §2.2, §3.2.
  • S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §2.3.